> This series is an RFC for utrace-based non-disruptive application core > dumper. Per Roland McGrath, this is possibly one of the common use-cases > for utrace.
I'm not sure whether it's the "common use-case" (debuggers are still that, I think). What I've noticed is that this is a feature that many users often request (i.e. "much better gcore") and dearly want. It's a fine example of the sort of thing that utrace is intended to facilitate. > This is the first foray of mine into the hairy core dump land. I guess I've been roaming around there far too long, because the actual core dump parts seem like simple stuff to me. And you didn't even delve into the core dump parts, you just call the old code! So I'll take this as "foray ... into hairy utrace-ish synchronization issues", and then not believe you about the "first" part. ;-) > Admittedly, I may have missed some (not so) subtle issues that need > careful consideration. My reactions are in three areas that I'll discuss somewhat independently. 1. technical issues with using binfmt->core_dump() 2. utrace issues 3. feature semantics issues 1. binfmt->core_dump() What you're really calling is elf_core_dump(). I don't think we really care about any feature to write other formats of core dump, so we can think about what that code is doing rather than worrying about any supposed general case of the binfmt hook. elf_core_dump() expects to be called from do_coredump(), most importantly, after coredump_wait(). Here every other task sharing current->mm is known to be uninterruptibly blocked in exit_mm(). That makes it impossible for anything to change any mappings (the only ways to do it are on your own mm, not on another process). Because it expects this, elf_core_dump() does its mm-scanning work without mmap_sem. (See the comment near the top of elf_core_dump.) There are several reasons why you can't have such an ironclad guarantee for your case. The first reason is simply that you don't try to stop all the threads in the mm, only the thread group. I think that is perfectly fine as a feature decision for the thing you're writing as far as which threads to include in the dump--but that's separate from the kernel robustness mandate, under which you cannot presume no such races. The rest of the reasons are in the realm of hairy synchronization issues, which I'll go into later. In short, I think you want down_read(&mm->mmap_sem) around calling binfmt->core_dump(). The other issue in this area is overloading mm->core_state. (Obviously you have to do this to use the existing elf_core_dump code that anchors its internal state records there.) Since you're doing this, you have to make 100% sure that no task in the mm can get to exit_mm() and look at your struct core_state. (Or else, you have to make it safe for that to happen, by the way you set up and tear down your struct, which may be tricky.) In utrace terms, this happens after the exit event and before the death event. One simple way is to just hold down_write(&mm->mmap_sem) around calling elf_core_dump(). Then threads can get to exit_mm() but they will just block in down_read(). But for several reasons, I think that is improper and a bad plan. The other possible ways tie into the utrace issues I'll get into below, but they are also imperfect. (I'm not really offering solutions to these problems here, more just citing some cans of worms. At the end, I'll have some conclusions stemming from the whole can collection.) 2. utrace stuff EXEC is not an interesting event here. That event is after everything has changed. So either you got the exec'ing thread stopped before it actually started the exec (i.e. syscall entry tracing or before), or you didn't. If you did, then you are done and gone before the exec happens. If you didn't, then you don't really care that an exec just happened. That just happens to be the new state of the process before you took the snapshot. In an MT exec, the other threads will have died and been reaped before the EXEC event is reported. So an MT exec case is really just like any other case where you have one thread stopped successfully (in this case, after it exec'd), and meanwhile some other threads died before you got them stopped. EXIT is a potentially interesting event, but it's not the right one for the way you are using it. That event means "starting to exit". Then the thread does potentially a whole lot, conceivably including arbitrary blocks, changing all sorts of shared state, etc., before the DEATH event. So it's not right to wash your hands of the thread in EXIT and immediately forget it exists--it could be doing more interfering things thereafter. Moreover, you might have already missed the EXIT event when you attach (utrace_set_events can't tell, unlike for DEATH and REAP) and then you would get that callback before your death callback. You don't really need or want a separate death callback. Your quiesce callback will get called first anyway. It needs to check for the death case, check either the event argument or task->exit_state. When it's called for the DEATH event, you shouldn't record the task as "quiescent". You can't (necessarily) make user_regset calls on a dead thread. Instead, that case is where you should drop your record of the task as you are now doing on the EXIT event. I said EXIT is "potentially interesting". You don't need that event for any bookkeeping reason. EXIT and SYSCALL_ENTRY are the two events that are "potentially interesting" in the same way. These are the two event reports that precede some complex kernel activity, rather than being closely succeeded by user mode. If anyone is tracing these, then you will get report_quiesce callbacks at these events. But you are not guaranteed those callbacks just because you are listening for QUIESCE. So if you want to maximize on "make all threads report quickly", then you should enable the EXIT and SYSCALL_ENTRY event bits. (There's no point for all the other events, because your UTRACE_STOP ensures a report_quiesce callback of some sort either at or immediately after each other event.) You don't actually need to define any callback functions for these. You don't have any special work for them, the report_quiesce callback will have done it already. You can make report_quiesce call utrace_set_events to reset the bits and that takes effect immediately so it doesn't try to make the event-specific callback for the very same event that this report_quiesce call is about. After you've gotten report_quiesce, you don't necessarily need any events enabled any more (only perhaps QUIESCE or DEATH or REAP for bookkeeping purposes). That brings us to the SIGKILL question. Whatever place you got each thread to stop, it can always be suddenly woken up by a SIGKILL. There are two issues with this and two approaches to coping with each. First is bookkeeping: that you don't get confused about the state of the thread and the utrace_engine. For that you can make sure to have a REAP, DEATH, or QUIESCE callback enabled (QUIESCE will get you a callback for death) that cleans up your bookkeeping synchronously so that you can never try to do anything with the dead task. Or, you can do it "passively" just by having your asynchronous uses cope with -ESRCH returns from utrace calls to mean the non-error case of the task having been reaped and your engine implicitly detached. Importantly, you'd have to use utrace_prepare_examine et al properly around access to the task (user_regset et al) unless you hold your own task refs. Second is safety in actually examining the threads. For that you have the two same sorts of option. First option, you make sure to have a DEATH or QUIESCE callback that cleans up your own bookkeeping synchronously so you ignore the thread. Second option, you cope gracefully with failure in your asynchronous examination. To do it safely you have to hold task refs and/or use utrace_prepare_examine et al around the user_regset calls et al and be prepared to punt the thread silently when they fail. Each of those is potentially hairy to get exactly right. And, in fact you have to do some measure of the second just to do the first. That is, DEATH event reports are after the point at which user_regset calls may no longer work (they could return -EIO or something). So your calls have to be robust to that, even if you are synchronizing as much as you can. Finally, there is the issue of blocked/nonresponsive threads. This is not so much an issue about how you use utrace per se. It's just a set of facts and constraints on what choices you have for the semantics. Threads that are in the kernel, either blocked, or arbitrary long-running (though that shouldn't happen) won't report a QUIESCE (or any) event for an arbitrarily long time. The normal case is a thread blocked in a system call, though there could be others. I'll discuss the semantics question below. The upshot for the mechanics of the code is that when utrace_control() returned -EINPROGRESS you have to notice if the thread is not responding quickly, and decide what to do. 3. semantics of the feature Calling elf_core_dump() is obviously attractive when writing the code--there's a big chunk you just call and don't think about. That's natural and fine for a first prototype. But it imposes a variety of constraints on how you can do things. Some of these impinge on the implementation issues I've mentioned above. Others just affect what options you have for the flexibility and semantics of the feature you provide. You create the file and do all the i/o in the context of some thread in the process being examined. This has two kinds of issues: "disruptive" issues, and feature semantics issues. Of the former, the possibilities I can think of immediately are SIGXFZ and quota warnings that could be generated for the process, but there may be other issues of that ilk. Those are positive disruptions; there is also the related ilk of "perturbations", such as i/o charged to the process (or CPU time for filesystem or i/o work). As to the latter, the first point is that I think it's just not what people would expect by default. I think people expect "like gcore", i.e. the instigator of the dump must have permissions for full inspection and control a la ptrace, and then all the actual file writing happens entirely in the context of the dumper, not the dumpee. (Context here means all of privilege, filesystem namespace, etc.) The more general point is that we'd like the dumper to be in a position to offer all manner of flexibility in where and how the data gets sent. The next issue about the feature is just the general issue of flexibility. One thing people have mused could be nice about a "fancy new core dumper" is the chance for arbitrary new amounts of flexibility in exactly what a dump-taker wants to do. An example is the choices that vma_dump_size() makes, i.e. which memory to include in the dump. These are controlled by coredump_filter, so a crude level of control for that example is possible just by fiddling that momentarily. But one can imagine fancier users wanting to apply more complex criteria there. Another aspect of flexibility is what you do about threads that will not stop quickly. It's not kosher to make user_regset calls on these. They might well claim to work, but some or all of their data could be bogus, could be leaking uninitialized kernel stack bits, etc. Among options that are purely nonperturbing, you have several in the abstract. But using elf_core_dump() only gives you one: omit the threads entirely. You have to keep them out of your list to prevent it from making those bogus user_regset calls, so you don't have any other choice. I mentioned "purely nonperturbing" options. There is also the option to perturb, and variants thereof. That is, you can use UTRACE_INTERRUPT. Using it blindly is what today's "gcore" does implicitly--interrupt every system call, no matter the implications for the program logic. That's always an option, though it does not qualify as nondisruptive. Its effect is like using SIGSTOP + SIGCONT (when the process is ignoring SIGCONT). For many system calls, this is no problem: they abort with -ERESTART*, get rolled back, and restart as if nothing happened. For various things like i/o calls this is also true but it may also be that there are perturbations, ones that are officially harmless, but still could affect the program's experience of the system. For other things like some sleep calls, some supposedly restart correctly, but you can't be entirely sure about the effects. There are probably others that are just completely pertrubed by "interrupt with no signal". >From there, you can get into a world of "lightly perturbing" options. That is, you can get fancy with looking at the state of the thread with asm/syscall.h calls and decide what is safe to do. If it's blocked and not in a syscall, that ought to be in a page fault (or maybe some other kind of machine exception). It should be harmless to interrupt that, since the faulting insn will just start over again in the end. For various particular syscalls recognized from syscall_get_nr(), you can assert that it's safe to interrupt them because they are known to restart harmlessly enough. For example, futex is this way (and should be a very common case). But it all gets into a touchy area. I think you see why I raise this as a "flexibility" question rather than citing one right thing to do. Back to purely nonperturbing options, there are a few that make sense off hand. Skipping the thread as if it doesn't exist at all is one, but not the one I'd ever suggest. The next obvious simple thing is to include a record (struct elf_prstatus) for the task, but just give all zero for its register data (pr_reg). That says there is a thread, and all the info that is easily at hand about it, but not what it's doing. The most satisfying thing is like that, but fill in "as much as you can" for the register data (you can get at least what /proc/pid/syscall has, if not more). I have an idea about that, but that can wait until after we've gotten to a place where we have any choice in the matter. (The most satisfying thing overall is probably to combine that latter with a simple version of the "lightly perturbing", e.g. interrupt if not in a syscall or if in a futex or wait call.) Leaving nonresponsive threads aside, there is another angle of flexibility that I always figured people might like in gcore. That is, how much it synchronizes. For a coherent complete snapshot, you need every thread to stop. But even if everything goes completely smoothly in the terms we've been discussing it, that pause could be a significant perturbation, e.g. in a large process with many many threads to stop and a lot of memory to spend i/o time dumping. For nondisruptive sampling, you might want the option to sample each thread, but not stop them all. That can get a full register snapshot of the ones running in user mode (for which just that may be very informative), and "as much as you can" register data for blocked ones. You might do that for a "threads-only" snapshot and not dump any memory (could be very quick, minimal i/o, useful for flight-recorder, health-monitoring, etc.). Or you might dump memory too, where you are synchronizing against mmap changes, but letting threads run--then the stacks for the blocked ones are complete, even if still-running ones are being overwritten while you dump. (That e.g. might be the right balance for a monitoring sample where you just want to notice the details of threads staying blocked for a long time while minimizing any delays to the highly-active threads.) Conclusions. I've mentioned lots of kinds of flexibility one might like in the core dumper's features. This doesn't mean I expect you to add every bell and whistle I can imagine. It's just to get you thinking, and to advocate for a different way to structure the code that would make it possible to do more different and fancy things eventually. In #1 there are two issues due to using elf_core_dump(). One is mmap/vma synchronization, which you can just solve with mmap_sem. The other is mm->core_state and the whole related set of synchronization problems. In #2 most of the unresolved complexities have to do with the essential synchronization problems of third-party examination. In #3 all the oddities and lacks of flexibility derive from the basic constraint of what elf_core_dump() does. All of that is far simpler IMHO if you take a different approach and do not use elf_core_dump() at all. There is only a little code from there I would really want to reuse/refactor. For a prototype to start with I think it's certainly fine to just copy those few routines, and we can worry later about refactoring to share code with elf_core_dump. For overloading mm->core_state, you just don't do it. Don't touch that, and you don't interact with do_coredump() at all, except inasmuch as it can instigate some threads dying before you stop them (which is always possible anyway). What I'd do is have each thread fill in its own info (prstatus and regsets) directly in the utrace callback. Then once each task is logged as stopped you only need to keep track of them as long as you want them in UTRACE_STOP (if doing a non-stop option, detach them right there). You still have to do the third-party examination when utrace_control(,,UTRACE_STOP) returns zero because it's already stopped (either another debugger or job control stop). But that's a pretty easy case where you know you only have to worry about the SIGKILL races. (It's good enough just to zero the buffers beforehand and ignore errors from the regset calls.) That way you just use the QUIESCE callback. If it's for death, you detach and forget the thread ever existed. If not, you reset your engine with utrace_set_events to zero, collect all the data on yourself, and return UTRACE_STOP. There are no other races to worry about. After that, if threads die, they die, but you aren't looking anyway. You only keep track of them at all so that they stay in UTRACE_STOP, until you call utrace_control(,,UTRACE_DETACH) after the dump (or before, in a non-stop option). That returns -ESRCH if it died and detached already, but you don't care anyway. Core dumping has three components: thread stuff, memory collection, and file writing. I include all the synchronization bits in "thread stuff", which I've outlined an approach for. The memory collection and file writing are pretty well independent of the thread stuff. The memory collection is straightforward and there is just one way to do it (modulo flexibility in elision choices). The file writing is defined by what interface and control semantics you are using to drive the feature. At least for a normal fully-synchronous snapshot, in any interface you have to do the thread stuff first. Before you start reading memory you need to know threads are stopped so they won't touch it, and you need them to call any regset writeback hooks. You can't even take mmap_sem before finishing thread stuff, because if a thread can enter the kernel and might do down_write(), you could deadlock waiting for it to finish and stop while it waits for you to up_read(). So every implementation plan for whatever interface will probably have that in common: first, you do thread stuff, then you take mmap_sem and do memory stuff. Part of my point in bringing up all the flexibility is that I think in the long run what we'll want is to break up the core dump work in the kernel into several pieces that can be reused in a variety of contexts and plug into multiple different interfaces. But we don't have to worry about that for the prototype. For a standalone feature prototype somewhat along the lines of what you've done, one interface idea is attracting my attention. That is a virtual-file "pull" interface, akin to /proc/kcore. So, rather than userland magically poking a "write a file" command, instead userland gets to read a virtual file of core dump contents and any actual i/o writing the data out is userland's business. What I have in mind is to support random-access reads; this gives userland automatic flexibility to just skip over some parts if it wants, and for the core file portions corresponding to memory, this will completely elide fetching those pages. Userland can use normal reads, or shovel the data with splice or sendfile to write a file directly. Later, we can soup up the internals so that splice/sendfile from the virtual file does optimized internal magic to avoid copying the pages. Here's the interface I have in mind: magic file /proc/pid/gcore. When you open the file, this starts trying to stop all the threads. When you first actually read from the file, this blocks until all the thread stuff is done. Then it takes mmap_sem and collects all the phdrs. That is all the information other than actual memory contents, and it's enough to know everything about the file layout. You can collect that info directly on some pages in the core file format; the ehdr and phdrs you collect are all the information you need to do later i/o. (After that you can actually drop mmap_sem and it no longer matters.) Once the notes/headers (including all thread stuff) are ready, then you keep those pages around as long as the file is open. A read call that asks for those initial pages just gets those saved pages. A read call for a larger offset is in the part of the file that corresponds to memory. You can look up that offset in the phdrs you saved (in part of those initial pages) and map from p_offset to p_vaddr, then just use get_user_pages/access_process_vm to read those pages. The only other bookkeeping you need is the list of tasks+engines you have attached; when the virtual file is closed, you detach those engines and clean up. For further fanciness, we can later make it support nonblocking read, poll, and fasync. This lets a fancy user open the file, get a notification when it's ready to read (all threads have stopped and collected register data), and then read when they know it won't block (except for page fault blocks in memory areas). Other fanciness that is probably more important, we can later add an optimized splice_read callback to the virtual file. This can send the saved initial pages and the memory pages from get_user_pages() directly to splice_to_pipe. This makes splice/sendfile calls do single-copy i/o that should go as fast as traditional direct-from-kernel core writing, if not faster. This is only one interface idea, but I think it's got some sex appeal. The main point about organizing the control flow/synchronization plan in this way applies to any sort of interface. Thanks, Roland