Re: global tracing
On Sun, 2008-08-03 at 19:50 -0700, Roland McGrath wrote: [...] For global tracing, those checks would be: if ((current-utrace_flags | utrace_global_flags) mask) slow path; The cost is now two or three instructions with one load. It would increase to four or five instructions with two loads. By and large, these checks are already in places that take a lot of locks and so forth, so this addition seems pretty tiny. It's certainly no worse than adding a marker (in the current markers implementation), and probably usually far better, since it combines with the existing utrace check. If you really want to avoid it, there is a way: * Create another global variable utrace_possible_flags. Each bit is set only if there is either a global tracer for the event, or at least one tracer in the system (keep a global counter). * Always check utrace_possible_flags first, and if it is set (thus requesting the slow path anyway), only then check the per-thread and global flags. OK, this has its cost (maintaining the counter and one extra check in case where utrace _is_ actually used), but as long as there is no tracing happening in the system, performance does not suffer a single CPU cycle. It may even be a tiny better for architectures where accessing a global variable is cheaper than accessing current-utrace_flags... I'm not 100% sold that it's worth the complexity, but we can go that way if a nitpicker jumps up and argues. Petr Tesarik signature.asc Description: This is a digitally signed message part
Re: global tracing
Hi - The alternative I considered is the nonexistence of global tracing support, thus no utrace_global_flags test in the syscall fast path. It will never be in the fast path. It will always require TIF_SYSCALL_TRACE to bet set on each thread, which means the slow path. [...] OK, I must have misunderstood your original posting: # [...] # d. Kernel already has checks here, so almost free. # #The utrace event hooks are at places where the kernel has had old #ptrace checks forever. The old code has fast paths that do: # if (current-ptrace mask) slow path; #Now in those same places there is: # if (current-utrace_flags mask) slow path; #So the cost of the checks is identical to what's already there. [...] #For global tracing, those checks would be: # if ((current-utrace_flags | utrace_global_flags) mask) slow path; # [...] - FChE
Re: global tracing
It will never be in the fast path. It will always require TIF_SYSCALL_TRACE to bet set on each thread, which means the slow path. [...] OK, I must have misunderstood your original posting: # [...] # d. Kernel already has checks here, so almost free. This refers to all the other cases, where there is just a check at the time of the event. The syscall case is special, requiring TIF_SYSCALL_TRACE. Thanks, Roland
Re: global tracing
Actually, this point is where I'm stuck on these weeks. If we add marker or tracepoint to trace every syscalls, we might have to put it in the tracehook or audit and set TIF_SYSCALL_TRACE for every process, or put tracepoint in the syscall entrance/exit asm-code and check another flag. Since latter adds additional flag-checking in fast-path, I think it is not acceptable. I agree completely that it would be wrong to do any new arch work for this, especially assembly hacking. Certainly piggy-backing on the existing TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT in some fashion is the way to go. If you don't need complete user register access at your tracepoint, then the audit path is an option. I suspect you do, and so TIF_SYSCALL_TRACE is what to use. Then you can put tracepoints in tracehook_report_syscall_*. It's straightforward to write a loop to set TIF_SYSCALL_TRACE on every task. The only wrinkle is dealing with clearing the flag correctly. You don't need a loop, because it can be cleared lazily by each thread when it gets into the slow path and finds it has no reason to be there. This is not very hard. It only requires adding a few lines in the utrace code to check your global-syscall-trace flag in deciding when to clear TIF_SYSCALL_TRACE. This would be unlike a plain tracepoint only in that you have to make this explicit call to switch it on and off. (Maybe this could be rolled into the tracepoint probe registration API.) I'm not at all arguing against having utrace global tracing to provide you this feature instead. (I already raised the pros/cons about that generally and that discussion can continue.) But this is how you'd do it sensibly with tracepoints IMHO. (The details I just described are not much different from what utrace global tracing would have for handling TIF_SYSCALL_TRACE.) Thanks, Roland
Re: global tracing
* Create another global variable utrace_possible_flags. Each bit is set only if there is either a global tracer for the event, or at least one tracer in the system (keep a global counter). * Always check utrace_possible_flags first, and if it is set (thus requesting the slow path anyway), only then check the per-thread and global flags. That seems like a bad trade-off. The common case to optimize is that this event now not going to be traced. If someone somewhere is running strace on their programs, my task should not go through any slow paths just because of them. That's a degradation from today's performance with plain old ptrace. The fast path having two negative tests in the common case is surely better than what should be the fast path having a slow false positive for me, because someone else somewhere ran strace -f sleep 99 . If it comes down to exactly the current check only is acceptable cost, then the opposite direction is what makes sense to me. That is, have global tracing go do: task-utrace_flags |= global_utrace_flags; on every task whenever a new bit is set in global_utrace_flags. (Then there can be some lazy fixup for stale task-utrace_flags values after global_utrace_flags has bits cleared. It's essentially the same plan as for setting TIF_SYSCALL_TRACE for global syscall tracing.) Thanks, Roland
Re: global tracing
Hi, Roland McGrath wrote: Actually, this point is where I'm stuck on these weeks. If we add marker or tracepoint to trace every syscalls, we might have to put it in the tracehook or audit and set TIF_SYSCALL_TRACE for every process, or put tracepoint in the syscall entrance/exit asm-code and check another flag. Since latter adds additional flag-checking in fast-path, I think it is not acceptable. I agree completely that it would be wrong to do any new arch work for this, especially assembly hacking. Certainly piggy-backing on the existing TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT in some fashion is the way to go. If you don't need complete user register access at your tracepoint, then the audit path is an option. I suspect you do, and so TIF_SYSCALL_TRACE is what to use. Then you can put tracepoints in tracehook_report_syscall_*. Actually, I did it and found it is not simple to hook audit syscall. It seems that audit flag is not synchronously cleared/set on processes with audit_context. I think tracehook is better and simpler way to do that. But there is still some audit-related problem when I set TIF_SYSCALL_TRACE flag on every process, and I'm investigating that. Maybe I need to improve syscall audit. It's straightforward to write a loop to set TIF_SYSCALL_TRACE on every task. The only wrinkle is dealing with clearing the flag correctly. You don't need a loop, because it can be cleared lazily by each thread when it gets into the slow path and finds it has no reason to be there. This is not very hard. It only requires adding a few lines in the utrace code to check your global-syscall-trace flag in deciding when to clear TIF_SYSCALL_TRACE. That's a good idea. I'll check that. This would be unlike a plain tracepoint only in that you have to make this explicit call to switch it on and off. (Maybe this could be rolled into the tracepoint probe registration API.) Sure, even though, we can enable it when initializing tracepoint-marker conversion module. I'm not at all arguing against having utrace global tracing to provide you this feature instead. (I already raised the pros/cons about that generally and that discussion can continue.) But this is how you'd do it sensibly with tracepoints IMHO. (The details I just described are not much different from what utrace global tracing would have for handling TIF_SYSCALL_TRACE.) I agree that. I think if I can set TIF_SYSCALL_TRACE on each process safely, it can work with utrace global tracing too. In that case, I can move to utrace global tracing feature. Thank you, Thanks, Roland -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: [EMAIL PROTECTED]
Re: global tracing
Answer to (a) is surely yes, but... Since you're sure, what would you say to convince a skeptic? ... wouldn't it be better to first push the base utrace upstream and add this as a feature thereafter? I think this is probably how it will go anyway. I want to get a plan on the table now. The consensus here about the details will inform my ideas for implementing it. I want to think it through enough to see how the innards would be and figure out if it will entail significant rearrangement of the utrace implementation. Thanks, Roland
Re: global tracing
This kind of interface would be nice to have in utrace only if it were significantly cheaper than doing what we do now: potentially attaching utrace-engines to each thread -- or (in the near future, systemtap bug# 6445) to subtrees of the process hierarchy. The overhead (memory + setup/teardown cost) is per-thread X per-tracer. We'd have to measure what it is in practice. I'd guess the memory won't be an issue unless you were really milking the system for performance. I'd guess the first issue will be big chunks of slow at script setup/teardown when there are lots of threads on the system. The main feature of global tracing is that it avoids this overhead. It goes without saying that you could always just trace every thread individually and produce the same result at high level. The other feature is its simplicity. The baseline work to do global tracing via by-thread is not entirely trivial, as David will attest. For subtrees, there wouldn't any time soon be an option other that global or by-each-thread. In the long run, there might be some new optimizations for using utrace to treat many threads all the same. Whatever comes along to benefit that case, I don't think it will constitute an argument either for or against global tracing. (An extra chunk of work per clone() may well be cheaper than extra work at every system call.) I assume what you mean here is for global syscall tracing. There is no such trade-off. With vanilla utrace, you always do both. With global tracing, you still always do the latter. Systemtap doesn't currently change outcomes in a callback, so reason c. doesn't apply much. [...] Actually, this is the main reasons that utrace-level support sounds interesting to me. We have had requests for exposing some thread control primitives to systemtap probe handlers - to block/resume, send signals, that sort of stuff. *If* going through utrace (as opposed to a separate API) would make this smoother and compose better (should e.g. there be different systemtap scripts fighting over the threads), that could be worthwhile. We'd have to discuss concrete scenarios to get entirely clear on this. But off hand those sound like things that make sense to do with vanilla utrace on individual threads. i.e. blocking a thread implies that you maintain per-thread state, as opposed to just a per-event consideration of the thread on hand. (Also, for blocking specifically, utrace is the only kosher way to go about it--anything else fails badly at playing nicely with other tracing and debugging facilities.) So to me this says you just need whatever global tracing facility you're using to have a good place to make utrace setup calls when you discover you want to do this sort of thing. That's a feature that utrace global tracing clearly has. But given a particular scenario and a given other means of getting its necessary event hooks, that other means might well be fine in this regard too. To know, we'd have to get concrete about each of the specific tracepoints you would use instead. Thanks, Roland
Re: global tracing
Hi - On Tue, Aug 05, 2008 at 03:32:42PM -0700, Roland McGrath wrote: This kind of interface would be nice to have in utrace only if it were significantly cheaper than doing what we do now: potentially attaching utrace-engines to each thread -- or (in the near future, systemtap bug# 6445) to subtrees of the process hierarchy. The overhead (memory + setup/teardown cost) is per-thread X per-tracer. We'd have to measure what it is in practice. [...] Right. The other feature is its simplicity. The baseline work to do global tracing via by-thread is not entirely trivial, as David will attest. Right, though once it's done, it's done ... For subtrees, there wouldn't any time soon be an option other that global or by-each-thread. [...] ... and is necessary for this part anyway. (An extra chunk of work per clone() may well be cheaper than extra work at every system call.) I assume what you mean here is for global syscall tracing. There is no such trade-off. With vanilla utrace, you always do both. With global tracing, you still always do the latter. The alternative I considered is the nonexistence of global tracing support, thus no utrace_global_flags test in the syscall fast path. Systemtap doesn't currently change outcomes in a callback, so reason c. doesn't apply much. [...] Actually, this is the main reasons that utrace-level support sounds interesting to me. We have had requests for exposing some thread control primitives to systemtap probe handlers - to block/resume, send signals, that sort of stuff. [...] We'd have to discuss concrete scenarios to get entirely clear on this. [...] Well, it would be desirable to have some facility to block/resume and send signals to threads. It would be desirable for this not to be available only for utrace-probes and not only targeting the currently utrace-hooked thread, but enqueue the command to an arbitrary one. - FChE
Re: global tracing
Roland McGrath wrote: We've mentioned global tracing. I think it's time now to discuss it thoroughly and decide what we do or don't want to do. ... 2. Why do we want utrace global tracing? From a systemtap point of view, we'd certainly use global tracing. ... 3. What would it look like? Global engines' callbacks all run after all per-task engine callbacks. (This could change in future.) I guess in a perfect world callbacks would still be called in the order they were attached. But, if calling the global callbacks last makes things easier, I think systemtap could handle it. I had originally planned to rule out SYSCALL events for global tracing. The reason is that this is not like other event checks where a simple flag gets checked cheaply. Instead, it requires setting the low-level TIF_SYSCALL_TRACE on a thread, which makes it take a far slower path on system call entry and exit, and has a big impact on performance just from that alone. Global tracing has to set this individually on every thread, and then pay that big overhead across the board. If we had utrace memory map tracing (I believe it is on your TODO list), systemtap wouldn't use global (or even per-thread) SYSCALL events as much. ... I'd kind of prefer to exclude REAP events for global tracing. Currently systemtap only uses DEATH events, so I don't have much of an opinion there. ... 4. So, what's the plan? I need folks who might use global tracing to answer these questions: a. Do we want it? Yes. Systemtap currently does global tracing now, in a manner similar to crash-suspend.c. The code looks for global CLONE, EXEC, and DEATH events, so systemtap knows when threads come and go. Once systemtap finds a process the user has told us he's interested in, it attaches some additional per-thread engine(s). In the future, Frank has mentioned trying to do global memory map tracing, which would require global syscall tracing (or future global memory map tracing). b. Do we want it right now? Yes. If you need beta testers, let me know. c. What justifies doing it in utrace (vs leaving it purely to tracepoints et al), to placate upstream critics? Please don't say, That would be nice; your reasons sound good. That just does not help at all. The reasons in #2 above are ones I can think of, but I'm not arguing for them or for the feature. If you want the feature, *you* will be justifying it to the upstream critics. Let's here be as skeptical about adding the new complexity, before we decide on doing it, as our unsympathetic reviewers will be. Global tracing would be *really* nice; your reasons sound *great*. How's that? :-) Seriously, your reasons a. (Event vocabulary clearly aligned with utrace events), b. (Coordinated with per-task utrace callbacks), and d. (Kernel already has checks here, so almost free) apply most clearly to systemtap. Systemtap doesn't currently change outcomes in a callback, so reason c. doesn't apply much. Systemtap is interested in performance impacts and the a./b. advantages seem quite obvious to me. Avoiding the complexities of manually attaching/detaching to every thread in the system seems important also. -- David Smith [EMAIL PROTECTED] Red Hat http://www.redhat.com 256.217.0141 (direct) 256.837.0057 (fax)
Re: global tracing
2. Why do we want utrace global tracing? From a systemtap point of view, we'd certainly use global tracing. You're using tracepoints/markers too. (You'll use anything, you minx.) What we need is reasons for this to be a utrace feature. Global tracing would be *really* nice; your reasons sound *great*. How's that? :-) Cursing me with loud praise! Seriously, your reasons a. (Event vocabulary clearly aligned with utrace events), b. (Coordinated with per-task utrace callbacks), and d. (Kernel already has checks here, so almost free) apply most clearly to systemtap. Systemtap doesn't currently change outcomes in a callback, so reason c. doesn't apply much. Systemtap is interested in performance impacts and the a./b. advantages seem quite obvious to me. Ok. Since a. is basically aesthetic, I think what would be concrete here is to see how you'd use it in practice such that b. matters to you. Avoiding the complexities of manually attaching/detaching to every thread in the system seems important also. That's a reason to have some kind of global tracing as opposed to none. Sold. It's not a reason to have utrace global tracing instead of only tracepoints and markers. Thanks, Roland
global tracing
We've mentioned global tracing. I think it's time now to discuss it thoroughly and decide what we do or don't want to do. 1. So, what is global tracing? It's an interface to trace the events that a utrace engine can trace, but generically across the whole the system without attaching to specific threads. 2. Why do we want utrace global tracing? (I won't go into what the ability to trace things is good for in the abstract, I assume we're all sold on that.) This has been an item on the utrace TODO list for a long time, since before we had any other plan for system-wide hooks in the kernel. Now we have tracepoints and markers (et al). So the question here is, why do we want to do this in utrace? In each place that utrace has a tracing hook (now all in linux/tracehook.h), you could easily add a tracepoint/marker. So what does utrace global tracing offer over using tracepoints? Here are my thoughts on this. I'm not 100% sold that these justify it. There is a clear argument not to add another feature that provides a second way to do what you can already do with tracepoints. a. Event vocabulary clearly aligned with utrace events. The identifiers for and details of all the places you can get events and what information is on hand match the per-task utrace interface. This makes it very straightforward to compose higher-level interfaces that describe events uniformly, whether they are tracked via the global or per-task mechanism. This is quite a weak argument. It would never be difficult to map the two different mechanisms to a uniform higher-level event vocabulary. b. Coordinated with per-task utrace callbacks. If system-wide hooks are an independent mechanism, it won't be obvious (or necessarily stay reliable) whether the tracepoint is before or after the utrace callbacks, etc. As part of a unified interface, that will be well-specified. (If we grow some complex callback order priority feature, the global hooks might have detailed options for where to land in the ordering with various per-task callbacks.) Moreover, it's natural for a global tracing callback to get informed directly about what other utrace engines are doing. e.g., a system-wide catch-all hook for debugging stray crashes can tell if an active debugger is doing something to the particular task and get out of the way. c. Callbacks can change outcomes. In utrace, the syscall and signal callbacks can affect what the task actually does in a well-specified way. Tracepoints just report events. For syscalls, off hand I can only see wanting this for fault injection. There might be other sensible uses. For signals, this might be crucial to doing the crash-catcher of last resort sort of thing (at least, to do it more efficiently than giving every task in the system a utrace engine just for that). What I'd expect this to do is catch SIGNAL_CORE with a global tracing callback that attaches a new per-task engine, ignores and pushes back the signal (like crash-suspend does), and the new engine UTRACE_STOPs until some user-level crash handling stuff wakes up and takes over. d. Kernel already has checks here, so almost free. The utrace event hooks are at places where the kernel has had old ptrace checks forever. The old code has fast paths that do: if (current-ptrace mask) slow path; Now in those same places there is: if (current-utrace_flags mask) slow path; So the cost of the checks is identical to what's already there. This is the main thing I've expected to soothe the upstream performance nit-pickers about utrace: zero new overhead if you ain't usin' it. For global tracing, those checks would be: if ((current-utrace_flags | utrace_global_flags) mask) slow path; The cost is now two or three instructions with one load. It would increase to four or five instructions with two loads. By and large, these checks are already in places that take a lot of locks and so forth, so this addition seems pretty tiny. It's certainly no worse than adding a marker (in the current markers implementation), and probably usually far better, since it combines with the existing utrace check. 3. What would it look like? Global tracing would use the same struct utrace_engine_ops, sharing all the same signatures for the callbacks. There would be a call to register a global tracing engine, which would give you an engine represented by the same struct utrace_attached_engine type (so this pointer is passed to your callbacks). All the calls to administer global tracing engines would be separate from the existing per-task utrace calls, though we overload the same types and use the same callbacks. Perhaps only register/unregister calls, though maybe also a set_events to change your event mask after the fact. I'm leaving aside the asynchronous detach details for now. Callbacks would
Re: asynchronous detach, global tracing
Roland McGrath [EMAIL PROTECTED] writes: [...] What the utrace interface has always said about this is, So don't do that. [...] What I overlooked is that not just your data structures, but your callbacks too might be going away, i.e. unloading the kernel module. I don't think the module-unloading case is so special. If there exist races involving utrace detach, then they will affect long-lasting modules too that may want to do some utracing then some other stuff, then perhaps return to utracing again. In this scenario, the data too is volatile or could be repurposed between utrace sessions. Such a module would need to know positively when no further callbacks will arrive. [...] For global engines' detach, one option is to offer no help with your own data structures but to solve the module-unload problem using the module refcount. [...] If having a per-cpu counter vector is sufficiently low weight for utrace to update it around every callback, how about letting a utrace engine specify an (optional?) percpu-integer vector? Then, the utrace client could use a similar synchronization algorithm as that of module/refcount unloading to assure itself of a complete and final utrace detach. It could even opt to reuse the counters between engines, or between utracing sessions, if it knows that its data/code lifetimes can work with that. - FChE