[PATCH] ptrace-tests: fix step-fork.c on powerpc for ptrace-utrace
Instead of using fork(), call syscall(__NR_fork) in step-fork.c to avoid looping on powerpc arch in libc. Signed-off-by: Veaceslav Falico vfal...@redhat.com --- --- ptrace-tests/tests/step-fork.c 2009-12-01 17:17:14.0 +0100 +++ ptrace-tests/tests/step-fork.c 2009-12-01 17:25:12.0 +0100 @@ -29,6 +29,7 @@ #include unistd.h #include sys/wait.h #include string.h +#include sys/syscall.h #include signal.h #ifndef PTRACE_SINGLESTEP @@ -78,7 +79,12 @@ main (int argc, char **argv) sigprocmask (SIG_BLOCK, mask, NULL); ptrace (PTRACE_TRACEME); raise (SIGUSR1); - if (fork () == 0) + + /* + * We can't use fork() directly because on powerpc it loops inside libc on + * ptrace over utrace. See http://lkml.org/lkml/2009/11/28/11 + */ + if (syscall(__NR_fork) == 0) { read (-1, NULL, 0); _exit (22);
Re: [RFC] [PATCH] In-kernel gdbstub based on utrace Infrastructure.
Hi - On Tue, Dec 01, 2009 at 05:11:32PM +0100, Ingo Molnar wrote: Those facilities are not overlapping with kgdb though so my point doesnt apply to them. An in-kernel gdb server sure overlaps/extends kgdb though. Only in name. One is highly invasive, for debugging the kernel across serial consoles. The other is highly noninvasive, for debugging user processes across normal userspace channels. They both happen to talk to gdb, but that's the end of the natural overlap. Even if kgdb was extended to be able to manage userspace, and if gdb itself was extended to be able to use that same single channel, this would still not duplicate the use scenario for an ordinary user debugging his own processes. (Plus, in the future where at least gdb is applied toward kernel+user debugging, it is unlikely to be the case that this would need to be done *over a single channel*. A separate channel for kernel and separate channels for userspace programs are no less likely.) Btw., perf does meet that definition: it functionally replaces all facilities that it overlaps/extends - such as Oprofile. [...] (And they currently separately coexist.) - FChE
Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)
On 11/30, Oleg Nesterov wrote: On 11/29, Roland McGrath wrote: Please file this test case on bugzilla.redhat.com for Fedora 12 glibc. https://bugzilla.redhat.com/show_bug.cgi?id=542731 It was closed as NOTABUG, Andreas Schwab wrote: If you call clone directly you are responsible for setting up the TLS area yourself. troll mode Very nice. If I understand correctly, this means clone(CLONE_VM) must not be used without CLONE_SETTLS, right? This in turn means clone(CLONE_VM) is not useable, afaics it is not possible to use CLONE_SETTLS in a more or less portable manner. Even arch/x86/ needs struct user_desc * or long addr depending on CONFIG_X86_32. And it used to work? I downloaded glibc-2.11, and afaics this was broken by Preserve SSE registers in runtime relocations on x86-64. commit: b48a267b8fbb885191a04cffdb4050a4d4c8a20b I do not understand glibc even remotely, but this lools like regression to me. I see nothing in the changelog or man page which explains that CLONE_VM requires CLONE_SETTLS now. /troll mode So. Any ptrace test which uses clone() is broken, at least on x86_64. Jan, Roland, how should we fix this? We can rewrite the code to use pthread_create(), this should be trivial. Unfortunately, libpthread is not trivial, it can shadow the problem and complicate the testing. And the stupid question. If I create the subthread via pthread_create(), how can I know its tid? I grepped glibc-2.11, and afaics pthread_create returns the pointer to struct pthread which has pid_t tid but I can not find the helper which returns -tid and struct pthread is not exported. Oleg.
Re: [PATCH] ptrace-tests: fix step-fork.c on powerpc for ptrace-utrace
On Tue, Dec 01, 2009 at 05:37:48PM +0100, Veaceslav Falico wrote: - if (fork () == 0) + + /* + * We can't use fork() directly because on powerpc it loops inside libc on + * ptrace over utrace. See http://lkml.org/lkml/2009/11/28/11 + */ + if (syscall(__NR_fork) == 0) { read (-1, NULL, 0); _exit (22); Sorry, the comment is just wrong. I'll resend the patch in several minutes. -- Veaceslav
[PATCH v2] ptrace-tests: fix step-fork.c on powerpc for ptrace-utrace
Instead of using fork(), call syscall(__NR_fork) in step-fork.c to avoid looping on powerpc arch in libc. Signed-off-by: Veaceslav Falico vfal...@redhat.com --- --- a/ptrace-tests/tests/step-fork.c2009-12-01 17:17:14.0 +0100 +++ b/ptrace-tests/tests/step-fork.c2009-12-01 18:35:15.0 +0100 @@ -29,6 +29,7 @@ #include unistd.h #include sys/wait.h #include string.h +#include sys/syscall.h #include signal.h #ifndef PTRACE_SINGLESTEP @@ -78,7 +79,12 @@ main (int argc, char **argv) sigprocmask (SIG_BLOCK, mask, NULL); ptrace (PTRACE_TRACEME); raise (SIGUSR1); - if (fork () == 0) + + /* +* Can't use fork() directly because on powerpc it loops inside libc under +* PTRACE_SINGLESTEP. See http://marc.info/?l=linux-kernelm=125927241130695 +*/ + if (syscall(__NR_fork) == 0) { read (-1, NULL, 0); _exit (22);
Re: powerpc: syscall_dotrace() retcode (Was: powerpc: fork stepping)
We don't really intend to impose any new requirements on the arch behavior here. It's up to the arch folks to decide. It does simplify life somewhat on x86 that you can change the registers at the syscall-entry stop and then after skipping the syscall, those registers will be unchanged from what you set. But it's never been that way on every other arch, and it doesn't need to be. The arch requirement on the tracehook_report_syscall_entry() return value handling is that it make the syscall not be run, and that the register state then left be compatible with using syscall_rollback() at the tracehook_report_syscall_exit() spot. As to what you get from ptrace explicitly fiddling with registers at syscall entry, that has always been arch-specific before and we haven't done anything to change that situation. On every arch, there is some way to change the syscall number to be run, and changing it to a known-bogus number (e.g. -1) makes sure no syscall runs. On every arch, it's possible at the tracehook_report_syscall_exit() spot to change the registers to exactly whatever you want userland to see. That's enough as it stands to make it possible to do whatever you want, some way or other. If the powerpc maintainers want to change the behavior here, that is fine by me. But there is no need for that just to satisfy general ptrace cleanups (or utrace). Normal concerns require that no such change break the ptrace behavior that userland could have relied on in the past. So off hand I don't see a reason to change at all. If every arch were to change so that registers changed at syscall-entry were left unmolested by aborting the syscall, then that might be a new consistency worth having. But short of that, I don't really see a benefit. All this implies that the ptrace-tests case relating to this needs to be tailored differently for powerpc and each other arch so it expects and verifies exactly the arch-specific behavior that's been seen in the past. Thanks, Roland
Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)
So. Any ptrace test which uses clone() is broken, at least on x86_64. If you use clone() directly then you need to have the code run in that child be purely under your control. You can't use miscellaneous libc calls nor any libpthread calls, only ones you are sure do not require any thread setup. Given TLS, that means even using errno or anything that might set it. It also means any libc function that might use any TLS you don't know about, i.e. really anything beyond the pure computation calls like str*/mem*. It also means running any dynamic linker code, such as relying on dynamic linking without LD_BIND_NOW (or -Wl,-z,now at compile time). The only thing that changed about this recently in glibc is that even more code paths through the dynamic linker now happen to depend on thread setup. Jan, Roland, how should we fix this? We can rewrite the code to use pthread_create(), this should be trivial. Unfortunately, libpthread is not trivial, it can shadow the problem and complicate the testing. We should avoid library code more thoroughly, not use more of it. As well as being complex, it also varies a lot across systems and interferes with using the same sources to translate to exact kernel-level testing across various people's development environments. I think the best bet is to link with -Wl,-z,now and then minimize the library code you rely on. (It really only matters to be extra careful about that for the code running in the clone child.) So you can use syscall() if you are not relying on its error-case behavior--if the system call fails, the function will set errno, which can rely on the TLS setup. And the stupid question. If I create the subthread via pthread_create(), how can I know its tid? I grepped glibc-2.11, and afaics pthread_create returns the pointer to struct pthread which has pid_t tid but I can not find the helper which returns -tid and struct pthread is not exported. There is no official exported way. You can use syscall(__NR_gettid). That kernel concept of a global thread ID number does not exist in pthreads, it is a detail of the Linux implementation. Thanks, Roland
Re: [RFC,PATCH 14/14] utrace core
On Tue, 2009-11-24 at 21:02 +0100, Oleg Nesterov wrote: + sect2 id=reaptitleFinal callbacks/title + para +The functionreport_reap/function callback is always the final event +in the life cycle of a traced thread. Tracing engines can use this as +the trigger to clean up their own data structures. The +functionreport_death/function callback is always the penultimate +event a tracing engine might see; it's seen unless the thread was +already in the midst of dying when the engine attached. Many tracing +engines will have no interest in when a parent reaps a dead process, +and nothing they want to do with a zombie thread once it dies; for +them, the functionreport_death/function callback is the natural +place to clean up data structures and detach. To facilitate writing +such engines robustly, given the asynchrony of +constantSIGKILL/constant, and without error-prone manual +implementation of synchronization schemes, the +applicationutrace/application infrastructure provides some special +guarantees about the functionreport_death/function and +functionreport_reap/function callbacks. It still takes some care +to be sure your tracing engine is robust to tear-down races, but these +rules make it reasonably straightforward and concise to handle a lot of +corner cases correctly. + /para + /sect2 This above text seems inconsistent. First you say report_reap() is the final callback and should be used for cleanup, then you say report_death() is the penultimate callback but might not always happen and people would normally use that as cleanup. If we cannot rely on report_death() then I would suggest to simply recommend report_reap() as cleanup callback and leave it at that. + para +There is nothing a kernel module can do to keep a structnamestruct +task_struct/structname alive outside of +functionrcu_read_lock/function. Sure there is, get_task_struct() comes to mind. When the task dies and is reaped +by its parent (or itself), that structure can be freed so that any +dangling pointers you have stored become invalid. +applicationutrace/application will not prevent this, but it can +help you detect it safely. By definition, a task that has been reaped +has had all its engines detached. All +applicationutrace/application calls can be safely called on a +detached engine if the caller holds a reference on that engine pointer, +even if the task pointer passed in the call is invalid. All calls +return constant-ESRCH/constant for a detached engine, which tells +you that the task pointer you passed could be invalid now. Since +functionutrace_control/function and +functionutrace_set_events/function do not block, you can call those +inside a functionrcu_read_lock/function section and be sure after +they don't return constant-ESRCH/constant that the task pointer is +still valid until functionrcu_read_unlock/function. The +infrastructure never holds task references of its own. And here you imply its existence. Though neither +functionrcu_read_lock/function nor any other lock is held while +making a callback, it's always guaranteed that the structnamestruct +task_struct/structname and the structnamestruct +utrace_engine/structname passed as arguments remain valid +until the callback function returns. + /para + + para +The common means for safely holding task pointers that is available to +kernel modules is to use structnamestruct pid/structname, which +permits functionput_pid/function from kernel modules. When using +that, the calls functionutrace_attach_pid/function, +functionutrace_control_pid/function, +functionutrace_set_events_pid/function, and +functionutrace_barrier_pid/function are available. + /para The above seems weird to me at best... why hold a pid around when you can keep a ref on the task struct? A pid might get reused leaving you with a completely different task than you thought you had. + /sect2 + + sect2 id=reap-after-death +title + Serialization of constantDEATH/constant and constantREAP/constant +/title +para + The second guarantee is the serialization of + constantDEATH/constant and constantREAP/constant event + callbacks for a given thread. The actual reaping by the parent + (functionrelease_task/function call) can occur simultaneously + while the thread is still doing the final steps of dying, including + the functionreport_death/function callback. If a tracing engine + has requested both constantDEATH/constant and + constantREAP/constant event reports, it's guaranteed that the + functionreport_reap/function callback will not be made until
Re: powerpc: syscall_dotrace() retcode (Was: powerpc: fork stepping)
On Tue, 2009-12-01 at 11:27 -0800, Roland McGrath wrote: If the powerpc maintainers want to change the behavior here, that is fine by me. But there is no need for that just to satisfy general ptrace cleanups (or utrace). Normal concerns require that no such change break the ptrace behavior that userland could have relied on in the past. So off hand I don't see a reason to change at all. If every arch were to change so that registers changed at syscall-entry were left unmolested by aborting the syscall, then that might be a new consistency worth having. But short of that, I don't really see a benefit. All this implies that the ptrace-tests case relating to this needs to be tailored differently for powerpc and each other arch so it expects and verifies exactly the arch-specific behavior that's been seen in the past. Ok thanks. I'm happy to not change it then, the risk of breaking some existing assumption is too high in my book. Cheers, Ben.
Re: [RFC] [PATCH] In-kernel gdbstub based on utrace Infrastructure.
* Frank Ch. Eigler f...@redhat.com wrote: Hi - Only in name. One is highly invasive, for debugging the kernel across serial consoles. The other is highly noninvasive, for debugging user processes across normal userspace channels. They both happen to talk to gdb, but that's the end of the natural overlap. [...] Well nothing that you mention here changes our obvious suggestion that an in-kernel gdb stub should obviously either be a kgdb extension, or a replacement of it. Help me out here: by kgdb extension do you imagine something new that an unprivileged user can use to debug his own process? Or do you imagine a new userspace facility that single-steps the kernel? Is this a trick question? Single-stepping the kernel on the same system [especially if it's an UP system] would certainly be a challenge ;-) What i mean is what i said: if you provide a new framework (especially if it's user visible - which both kgdb and the gdb stub is) you should either fully replace existing functionality or extend it. Overlapping it in an incomplete way is not useful to anyone. Extending kgdb to allow the use of it as if we used gdb locally would certainly be interesting - and then you could drop into the kernel anytime as well. But i'm not siding with any particular solution - i'm just seconding Peter's point that there's very clear overlap and inconsistency, and that ought to be resolved one way or another. We dont want to separate facilities for the same conceptual thing: examining application state (be that in user-space and kernel-space). This seems like a shallow sort of consistency. kgdb was added after ptrace existed -- why not extend ptrace instead to target the kernel? After all, it's examining application state. The answer is that it doesn't make a heck of a lot of sense. kgdb simply used gdb's preferred way of remote debugging. That's certainly the ugliest bit of it btw - but it's an externality to kgdb. Had it extended ptrace it wouldnt have gdb compatibility. So i think this example of yours is inapposite as well. Having said all that, i certainly subscribe to the view that neither kgdb nor ptrace is particularly cleanly done. So i wouldnt mind if something new existed that had a modern, flexible, extensible and generally pleasant interface and implementation. If you are heading in that direction, please let me know. Btw., perf does meet that definition: it functionally replaces all facilities that it overlaps/extends - such as Oprofile. [...] (And they currently separately coexist.) You didnt get my point apparently. Keeping the overlapped facility for compatibility (and general user inertia) is fine. Creating a new facility that doesnt do everything that the existing facility does, and not integrating it either, is not fine. oprofile and perfctr are closer in concept than kgdb and ptrace, yet AFAIK perfctr doesn't interface to oprofile, except perhaps to the extent of resolving contention over the underlying physical resources. In any case this is not a great analogy. (FYI, 'perfctr' is a different project that has existed for years, i suspect you meant perf events?) perf replaces oprofile functionally. If the in-kernel gdb stub replaced kgdb functionally you'd hear no complaints from me. Ingo
Re: [RFC,PATCH 14/14] utrace core
On 12/01, Peter Zijlstra wrote: On Tue, 2009-11-24 at 21:02 +0100, Oleg Nesterov wrote: + para +There is nothing a kernel module can do to keep a structnamestruct +task_struct/structname alive outside of +functionrcu_read_lock/function. Sure there is, get_task_struct() comes to mind. it is not exported ;) Peter, I skipped other comments about the documentation, I never read it myself. Update: I skipped a lot more for today ;) @@ -351,6 +394,10 @@ static inline void tracehook_report_vfor */ static inline void tracehook_prepare_release_task(struct task_struct *task) { + /* see utrace_add_engine() about this barrier */ + smp_mb(); + if (task_utrace_flags(task)) + utrace_release_task(task); } OK, that seems to properly order -exit_state vs -utrace_flags, This site does: assign -state mb observe -utrace_flags and the other site does: assign -utrace_flags mb observe -exit_state Yes, we hope. @@ -560,6 +625,20 @@ static inline void tracehook_report_deat int signal, void *death_cookie, int group_dead) { + /* +* This barrier ensures that our caller's setting of +* @task-exit_state precedes checking @task-utrace_flags here. +* If utrace_set_events() was just called to enable +* UTRACE_EVENT(DEATH), then we are obliged to call +* utrace_report_death() and not miss it. utrace_set_events() +* uses tasklist_lock to synchronize enabling the bit with the +* actual change to @task-exit_state, but we need this barrier +* to be sure we see a flags change made just before our caller +* took the tasklist_lock. +*/ + smp_mb(); + if (task_utrace_flags(task) _UTRACE_DEATH_EVENTS) + utrace_report_death(task, death_cookie, group_dead, signal); } I don't think its allowed to pair a mb with a lock-barrier, since the lock barriers are semi-permeable. Could you clarify? @@ -589,10 +668,20 @@ static inline void set_notify_resume(str * asynchronously, this will be called again before we return to * user mode. * - * Called without locks. + * Called without locks. However, on some machines this may be + * called with interrupts disabled. */ static inline void tracehook_notify_resume(struct pt_regs *regs) { + struct task_struct *task = current; + /* +* This pairs with the barrier implicit in set_notify_resume(). +* It ensures that we read the nonzero utrace_flags set before +* set_notify_resume() was called by utrace setup. +*/ + smp_rmb(); + if (task_utrace_flags(task)) + utrace_resume(task, regs); } Sending an IPI implies the mb? Yes, but this has nothing to do with IPI. The caller, do_notify_resume(), does: clear_thread_flag(TIF_NOTIFY_RESUME); tracehook_notify_resume: if (task_utrace_flags(task)) utrace_resume(); We should not read task_utrace_flags() before we clear TIF_NOTIFY_RESUME. +static inline struct utrace *task_utrace_struct(struct task_struct *task) +{ + struct utrace *utrace; + + /* +* This barrier ensures that any prior load of task-utrace_flags +* is ordered before this load of task-utrace. We use those +* utrace_flags checks in the hot path to decide to call into +* the utrace code. The first attach installs task-utrace before +* setting task-utrace_flags nonzero, with a barrier between. +* See utrace_task_alloc(). +*/ + smp_rmb(); + utrace = task-utrace; + + smp_read_barrier_depends(); /* See utrace_task_alloc(). */ + return utrace; +} I spot two barriers here, but only 1 over in utrace_task_alloc(), hmm? smp_read_barrier_depends() pairs with utrace_task_alloc()-wmb(). smp_rmb() is needed for another reason. Suppose the code does: if (task_utrace_flags() SOMETHING) do_something_with(task-utrace); if we race with utrace_attach_task(), we can see -utrace_flags != 0 but task-utrace == NULL without rmb(). +struct utrace_engine { +/* private: */ + struct kref kref; + void (*release)(void *); + struct list_head entry; + +/* public: */ + const struct utrace_engine_ops *ops; + void *data; + + unsigned long flags; +}; Sorry, the kernel is written in C, not C++. Hmm. I almost never read the comments, but these 2 look very clear to me ;) + * Most callbacks take an @action argument, giving the resume action + * chosen by other tracing engines. All callbacks take an @engine + * argument, and a @task argument, which is always equal to @current. Given that some functions have a lot of arguments (depleting regparam), isn't it more expensive to push current on the stack than it is to simply read it again? Yes, perhaps. Only -report_reap() really needs @task, it may be
Re: [RFC,PATCH 14/14] utrace core
Could we just drop the tracehook layer if this finally merged and call the low level functions directly? We can certainly do appropriate streamlining cleanups later, by all means. The original purpose of the tracehook layer is simply this: A person hacking on core kernel code or arch code should not have to think about all the innards of tracing (ptrace, utrace, or anything else). If he comes across a tracehook_* call, he can just read its kernel-doc for explanation of its parameters, return value, what it expects about its context (locking et al), and what the semantic significance of making that particular call is. If changes to the core/arch code in question do not require changing any of those aspects, then said person need not consider tracing issues further. If a change to a function's calling interface or contextual assumptions looks warranted, then he knows he should discuss the details with some tracing-related folks (i.e. find tracehook.h in MAINTAINERS and thus get to me and Oleg). Likewise, a person hacking on tracing code should not have to think about every corner of interaction with the core code or arch code. Each tracehook_* call's kernel-doc comments say everything that matters about how and where it's called. If some of those details are problematic for what we want to do inside the tracing code, then we know we have to hash out the details with the maintainers of the core or arch code that makes those calls. Otherwise we can keep our focus on tracing infrastructure without spending time getting lost in arcane details of the arch or core kernel code. These two things seem permanently worthwhile to me: that the core and arch source code use simple function calls without open-coding any innards of the tracing infrastructure; and that these functions have clear and complete documentation about their calling interfaces and context (locking et al). I find it natural and convenient that such calls have a common prefix that makes it obvious to any reader of the core code what subsystem the call relates to. Beyond those ideas, I certainly don't care at all what the names of these functions are, what common prefix is used, or in which source files those declarations and kernel-doc comments appear. Thanks, Roland
Re: [RFC,PATCH 0/14] utrace/ptrace
Note to all: I'm on the road this week (having had a holiday last week) and will be somewhat slow in replying on these threads, but I will be sure to get to them all. Yes, nobody likes 2 implementations. I guess Roland and me hate CONFIG_UTRACE much more than anybody else. :-) We both hate maintaining the old ptrace implementation, that is true! The other thing we hate is having our work delayed arbitrarily again and again to wait for the arch maintainers of arch's we don't use ourselves. Oleg and I have been trying to follow the advice we get on how to get this work merged in. When multiple gate-keepers give conflicting advice, we really don't know what to do next. Our interest is in whatever path smooths the merging of our work. We'd greatly prefer to spend our time hacking on our new code to make it better or different in cool and useful ways than to spend more time guessing what order of patches and rejuggling of the work will fit the changing whims of the next round of review. We don't want to rush anyone, like uninterested arch maintainers. We don't want to break anyone who doesn't care about our work (we do test for ptrace regressions but of course new code will always have new bugs so some instances of instability in the transition to a new ptrace implementation have to be expected no matter how hard we try). We just don't want to keep working out of tree. The advantage of making the new code optional is, obviously, that you can turn it off. That is, lagging arch's won't be broken, just unable to turn it on. And, anyone who doesn't want to try anything new yet can just turn it off and not be exposed to new code. The advantage of making the new code nonoptional is, obviously, that you can't turn it off. The old implementation will be gone and we won't have to maintain it any more (outside some -stable streams for a while). The kernel source will be cleaner with no optional old cruft code duplicating functionality. All ptrace users will be testing the new code, and so we'll find its bugs and gain confidence that it works solidly. Like I've said, we want to do whatever the people want. My concern about what Christoph has suggested is that it really seems like an open-ended delay depending on some arch people who are not even in the conversation. For anyone either who likes utrace or who is concerned about its details, the sooner we are working in-tree the sooner we can really hash it out thoroughly and get to having merged code that works well. As long as there is anything unfinished like arch work that we've decided is a gating event, then the review and hashing out of the new code itself does not seem to get very serious. (To wit, this thread is still talking about merge order and such, another long thread was about incidental trivia, and we've only just started to see a tiny bit of actual code review today.) Thanks, Roland
Re: [RFC,PATCH 14/14] utrace core
This above text seems inconsistent. First you say report_reap() is the final callback and should be used for cleanup, then you say report_death() is the penultimate callback but might not always happen and people would normally use that as cleanup. If we cannot rely on report_death() then I would suggest to simply recommend report_reap() as cleanup callback and leave it at that. I'm sorry the explanation was not clear to you. I'll try to make it clear now and then I'd appreciate your suggestions on how the documentation could be worded better. (I do appreciate your suggestion here but it's one based on an idea that's not factual, so I don't think we should follow it.) There is no unreliable aspect to it. If you call utrace_set_events() to ask for report_death() callbacks then you will get that callback for that task--guaranteed--unless utrace_set_events() returned an error code that tells you unambiguously that you could not get that callback. What's true is that report_reap() is the last callback you can ever get for a given task. If you had asked for report_death() callbacks, then you always get report_death() and if you've asked for both these callbacks then report_death() is always before report_reap(). (This is a special ordering guarantee, because in actuality the release_task() call that constitutes reap can happen in the parent simultaneously with the task's own exit path.) The one situation in which you cannot get report_death() is when the task was already dead when you called utrace_set_events(). In that case, it gives an error code as I mentioned above. Even in that situation, you can still ask to enable just the report_reap() callback. With either of these, if you enabled it successfully, then you are guaranteed to get it. When you get report_death() and are not interested in getting report_reap() afterwards, then you can return UTRACE_DETACH from report_death(). If you don't detach, then you will get report_reap() later too. The documentation mentions using report_death() to detach and clean up because many kinds of uses will have no interest in report_reap() at all. The only reason to get a report_reap() callback is if you are interested in tracking when a task's (real) parent reaps it with wait* calls, but usually people are only really interested in a task until it dies. + para +There is nothing a kernel module can do to keep a structnamestruct +task_struct/structname alive outside of +functionrcu_read_lock/function. Sure there is, get_task_struct() comes to mind. __put_task_struct() is not exported to modules and so put_task_struct() cannot be used by modules. +still valid until functionrcu_read_unlock/function. The +infrastructure never holds task references of its own. And here you imply its existence. [I take it this refers to the next quoted bit:] Though neither +functionrcu_read_lock/function nor any other lock is held while +making a callback, it's always guaranteed that the structnamestruct +task_struct/structname and the structnamestruct +utrace_engine/structname passed as arguments remain valid +until the callback function returns. No, we do not imply that the utrace infrastructure holds any task reference. The current task_struct is always live while it's running and until it's passed to release_task(). The task_struct passed to callbacks is current. That's all it means. The true facts are that utrace callbacks are all made in the current task except for report_reap(), which is made at the start of release_task(). So the kernel's core logic is holding a task reference at all times that utrace callbacks are made. If you really think it is clearer to explain that set of facts as utrace holds a reference, then please suggest the exact wording you have in mind. The above seems weird to me at best... why hold a pid around when you can keep a ref on the task struct? A pid might get reused leaving you with a completely different task than you thought you had. The *_pid interfaces are only there because put_pid() can be used by modules while put_task_struct() cannot. If we can make __put_task_struct() an exported symbol again, I would see no reason for these *_pid calls. Right, except you cannot generally rely on this report_death() thing, so a trace engine needs to deal with the report_reap() thing anyway. This is a false antecedent. I hope the explanation above made this subject clear to you. + sect2 id=interlocktitleInterlock with final callbacks/title [...] Hrmm, better mention this earlier, or at least refer to this when describing the above cleanup bits. This explanation is somewhat long and has its own whole section so as to be thoroughly explicit and clear. Do you think there should just be a cross reference here in the earlier mention of report_reap() and report_death()? Or do you think