A subject that came up recently for David Smith is asynchronous detach. So I've been thinking about accomodating it better in the interface.
One of the TODO features that I've been wanting to get to soon is global tracing. It so happens that David's work could use this feature. But that's not what makes me want to discuss both of them together. Rather, it's the similarity of the stickiest implementation question. So first, what do I mean by global tracing? The idea is you can register a "global" utrace engine to watch events as they occur in whatever thread, seeing all threads in the system. This would use register/unregister calls parallel to the per-task attach/detach, and a call parallel to utrace_set_events. The events and callbacks would be the same, with some constraints. I don't want to get into all the details of global tracing here. For now, I've just outlined it enough to discuss its overlap with the asynchronous detach question. Now, what do I mean by asynchronous detach? The "Teardown Races" rules (from utrace.txt, now a section under "utrace concepts" in the new utrace manual) address the issues of detach vs sudden deaths that are wholly outside of your control. Those rules ensure that when you think a task is stopped (i.e. via UTRACE_STOP), and you use utrace_control(task, UTRACE_DETACH) from another thread, you know for sure what callbacks do and don't get made. That's synchronous detach, because you synchronized with the thread by having it return UTRACE_STOP from a callback at some point, so it's stayed stopped until your utrace_control call. The cases where utrace_control returns -ESRCH or -EALREADY only arise here in the event of a simultaneous SIGKILL. (It's also synchronous detach when a callback in the task itself returns UTRACE_DETACH, of course.) The other kind is asynchronous detach. That means the task is running merrily away, and you call utrace_control(task, UTRACE_DETACH) without knowing exactly what it's doing. From the beginning, I never gave this case too much thought. Now it's time to think more about it. The rub is simultaneous callbacks. Since you haven't synchronized, the task you're detaching from might be running anywhere, it could be just now entering a callback to your engine. This can be on another CPU or, with preemption, just from getting rescheduled at an inopportune time. So when utrace_control returns 0, you can't be sure that one of your callbacks isn't just about to start in a race with your cleanup code. What the utrace interface has always said about this is, "So don't do that." The utrace code goes to great pains to make sure you positively always either get a callback or don't, as opposed to jumping to zero or getting a callback you had disabled with utrace_set_events, and that sort of thing. But that's it. I had assumed you either would just get the thread quiescent before detaching, or would have your callbacks doing some safe synchronized access using engine->data so they could tell you had done cleanup and so they'd just do nothing and return. What I overlooked is that not just your data structures, but your callbacks too might be going away, i.e. unloading the kernel module. This is where it ties in with global tracing. The interface to "detach" (unregister) a global tracing engine will have the same issue (but worse--there can be several CPUs/tasks simultaneously making callbacks). When this came up for the existing UTRACE_DETACH case, I realized that this issue is also the main determinant of the implementation details for doing global tracing. So I'm thinking about them together. The question is, what contract would you like in the interface for asynchronous detach? The common case paths for most of the event reports do not involve any SMP-synchronizing instructions (just some barriers). This is crucial for keeping the overhead of "latent" tracing to the absolute minimum. Your callbacks can check some unlocked data structures and be very quick to filter out boring events, and the whole thing should scale well to huge numbers of threads being traced on many-CPU machines. This concern constrains what I want to add in to the implementation. I realize this is microoptimizing early, but I don't want to make an interface choice that could fundamentally constrain future optimization and scaling, without giving it a lot of thought beforehand. (Remember, this is the lowest layer, to make things both feasible and fast. It's doesn't have to be the layer that's the simplest possible to use, higher layers can be that.) For UTRACE_DETACH, one thing seems easy and cheap to implement, off hand. I can keep a record of which engine is being called into now, unsynchronized. That means just sampling it is never reliable, but it will use barriers to ensure ordering of making the record and fetching the engine->ops pointer. That means utrace_control(task, UTRACE_DETACH) could return -EINPROGRESS if a callback to your engine might be in progress or about to begin. That can always be a false-positive, after your callback has just returned. Moreover, the only way this can work is that the detach happens regardless. Either it returns 0 and you know no more callbacks will be made, or it returns -EINPROGRESS and you know that it's possible one more callback is already in progress or about to begin, but you don't know for sure that there is one coming at all. There is no way to check if it's finished later (if it has, your engine pointer is already invalid). For global engines' detach, one option is to offer no help with your own data structures but to solve the module-unload problem using the module refcount. This means that rmmod would block while your callbacks are running, and the unload wouldn't actually commence until the callback finished. Once the unload has begun (your module's exit function starts running), new callbacks will be skipped automagically while you do your tear-down. (The module refcount works via per-CPU counts. So it does not involve SMP-synchronizing instructions. It does use some memory somewhat wastefully (at least cacheline*ncpus), but it's already there anyway. In contrast, that amount of memory would be several times the current size of struct utrace_attached_engine, making per-CPU counts unattractive for that.) I'm doing a lot of thinking out loud here. I want to explore all the ideas we can think of without synchronization, and mull over how (in)sufficient each is. So pitch in! The other obvious option is to decide that two SMP-synchronizing operations per engine per event is OK. Looking at where our callbacks are, most of them are in places where we take locks and do a lot of synchronization and heavy operations anyway. So I guess we're concerned mostly with syscall tracing. That's already not available to global tracing, so certainly for global engines' callbacks the case for avoiding synchronizing instructions is looking thinner. For syscalls, I did a little experiment to lend some perspective. I tested a stupid microbenchmark for system call overhead. It runs a million getpid system calls (the cheapest one), and samples CLOCK_THREAD_CPUTIME_ID (which is derived in the kernel from cycle counters or the high-resolution timer hardware) before and after. Here are the numbers. (These are averages from 10 runs in calm conditions, but these numbers are presumed unreliable and not at all precise, just to ponder.) 1.7979558114 no audit no trace 5.9270243434 no audit utrace-nop 6.2908507868 no audit utrace-sync For comparison, here are the numbers with syscall audit enabled (auditd on): 3.1960292903 audit-fast no trace 6.1579767031 audit-slow no trace 6.6884023171 audit utrace-nop 7.1273046673 audit utrace-sync This is on x86-64, with an upstream+utrace kernel approximately 2.6.27-rc1 (and a random configuration I use for development, not optimal). audit-fast means syscall audit enabled and using the new fast-path I recently added upstream (a change unrelated to utrace et al). audit-slow means audit using the slow hardware return path, which is what x86 kernels before 2.6.27 have done. (When audit and utrace are combined, it always takes the slow path, so the audit-fast kernel is no different from the audit-slow kernel.) This is overhead that lots of people are putting up with now, e.g. auditd is enabled by default on Fedora--and noone had the fast-path version before this week's rawhide. utrace-nop means a utrace engine is getting syscall_entry callbacks; it does: return utrace_syscall_action(action) | UTRACE_RESUME; utrace-sync means a utrace engine whose callback does: atomic_inc(&test_count); if (atomic_dec_and_test(&test_count)) WARN_ON(1); return utrace_syscall_action(action) | UTRACE_RESUME; The if condition never fires, but testing the synchronizing op's value this way mirrors the operations that would be added in the infrastructure to synchronize callbacks with asynchronous detach. It's clear in the numbers that the biggest factor is the slow return path, e.g. the audit-fast vs audit-slow difference. That's done for arcane x86 kernel reasons that aren't relevant here. (The difference would be much less pronounced on other architectures, because this slow path issue is really especially slow on x86.) I think this will be improved in the future, so that utrace-nop might be faster by about the same amount as the audit-fast vs audit-slow difference (or nearly so). If you factor that out, a few things pop out. First, utrace doing nothing is faster than audit doing nothing (go team). But compared to utrace-nop, utrace-sync nearly doubles the utrace overhead, making it well higher than the audit overhead. And remember, this test is in ideal, uncontested SMP conditions. The test runs on one CPU of a 4-core machine, and the other 3 CPUs are idle. The bad effects to other CPUs of any bus-locking or whatnot the -sync code is incurring are not showing up at all. Oh, hey, I can factor it out by not using an x86. Here are some more unreliable numbers, for the same test on a 2-CPU powerpc64: 1.6260223104 no audit no trace 3.7841002367 no audit utrace-nop 5.0061109056 no audit utrace-sync 5.8225092223 audit no trace 6.1297943584 audit utrace-nop 6.8046667872 audit utrace-sync The raw tracing overhead (no audit utrace-nop) is something that can be optimized further, and is all local to the CPU running the syscall. The -sync cost for SMP interlock is probably only going to get worse as we get more cores, in proportional terms, and it's a scaling limiter whose true impact is not reflected in the microbenchmark. So I think this suggests it's not entirely goofy of me to balk at introducing these synchronizing instructions into the callback path. Perhaps there is nothing that's tolerable for the detach interface that can avoid them. But let's explore it thoroughly. I'm not trying to make the interface overly aggravating, but nor do I want it to give you enough rope to ensure that the whole thing can never be really well-optimized because of overly generous guarantees in the interface semantics. I know these long rambles from me are hard to penetrate. But I need some help thinking about this stuff. And I'm not going to be satisfied by the simple answer, without a lot of complicated convincing. Thanks, Roland