> This series is an RFC for utrace-based non-disruptive application core
> dumper. Per Roland McGrath, this is possibly one of the common use-cases
> for utrace.

I'm not sure whether it's the "common use-case" (debuggers are still that,
I think).  What I've noticed is that this is a feature that many users
often request (i.e. "much better gcore") and dearly want.  It's a fine
example of the sort of thing that utrace is intended to facilitate.

> This is the first foray of mine into the hairy core dump land.

I guess I've been roaming around there far too long, because the actual
core dump parts seem like simple stuff to me.  And you didn't even delve
into the core dump parts, you just call the old code!  So I'll take this as
"foray ... into hairy utrace-ish synchronization issues", and then not
believe you about the "first" part. ;-)

> Admittedly, I may have missed some (not so) subtle issues that need
> careful consideration.

My reactions are in three areas that I'll discuss somewhat independently.

1. technical issues with using binfmt->core_dump()
2. utrace issues
3. feature semantics issues


1. binfmt->core_dump()

What you're really calling is elf_core_dump().  I don't think we really
care about any feature to write other formats of core dump, so we can think
about what that code is doing rather than worrying about any supposed
general case of the binfmt hook.

elf_core_dump() expects to be called from do_coredump(), most
importantly, after coredump_wait().  Here every other task sharing
current->mm is known to be uninterruptibly blocked in exit_mm().
That makes it impossible for anything to change any mappings (the
only ways to do it are on your own mm, not on another process).
Because it expects this, elf_core_dump() does its mm-scanning work
without mmap_sem.  (See the comment near the top of elf_core_dump.)

There are several reasons why you can't have such an ironclad
guarantee for your case.  The first reason is simply that you don't
try to stop all the threads in the mm, only the thread group.  I
think that is perfectly fine as a feature decision for the thing
you're writing as far as which threads to include in the dump--but
that's separate from the kernel robustness mandate, under which you
cannot presume no such races.  The rest of the reasons are in the
realm of hairy synchronization issues, which I'll go into later.  
In short, I think you want down_read(&mm->mmap_sem) around calling
binfmt->core_dump().

The other issue in this area is overloading mm->core_state.
(Obviously you have to do this to use the existing elf_core_dump
code that anchors its internal state records there.)  Since you're
doing this, you have to make 100% sure that no task in the mm can
get to exit_mm() and look at your struct core_state.  (Or else, you
have to make it safe for that to happen, by the way you set up and
tear down your struct, which may be tricky.)  In utrace terms, this
happens after the exit event and before the death event.

One simple way is to just hold down_write(&mm->mmap_sem) around
calling elf_core_dump().  Then threads can get to exit_mm() but they
will just block in down_read().  But for several reasons, I think
that is improper and a bad plan.

The other possible ways tie into the utrace issues I'll get into
below, but they are also imperfect.  (I'm not really offering
solutions to these problems here, more just citing some cans of
worms.  At the end, I'll have some conclusions stemming from the
whole can collection.)


2. utrace stuff

EXEC is not an interesting event here.  That event is after everything
has changed.  So either you got the exec'ing thread stopped before it
actually started the exec (i.e. syscall entry tracing or before), or
you didn't.  If you did, then you are done and gone before the exec
happens.  If you didn't, then you don't really care that an exec just
happened.  That just happens to be the new state of the process before
you took the snapshot.  In an MT exec, the other threads will have
died and been reaped before the EXEC event is reported.  So an MT exec
case is really just like any other case where you have one thread
stopped successfully (in this case, after it exec'd), and meanwhile
some other threads died before you got them stopped.

EXIT is a potentially interesting event, but it's not the right one
for the way you are using it.  That event means "starting to exit".
Then the thread does potentially a whole lot, conceivably including
arbitrary blocks, changing all sorts of shared state, etc., before the
DEATH event.  So it's not right to wash your hands of the thread in
EXIT and immediately forget it exists--it could be doing more
interfering things thereafter.  Moreover, you might have already
missed the EXIT event when you attach (utrace_set_events can't tell,
unlike for DEATH and REAP) and then you would get that callback before
your death callback.

You don't really need or want a separate death callback.  Your quiesce
callback will get called first anyway.  It needs to check for the
death case, check either the event argument or task->exit_state.  When
it's called for the DEATH event, you shouldn't record the task as
"quiescent".  You can't (necessarily) make user_regset calls on a dead
thread.  Instead, that case is where you should drop your record of
the task as you are now doing on the EXIT event.

I said EXIT is "potentially interesting".  You don't need that event
for any bookkeeping reason.  EXIT and SYSCALL_ENTRY are the two events
that are "potentially interesting" in the same way.  These are the two
event reports that precede some complex kernel activity, rather than
being closely succeeded by user mode.  If anyone is tracing these,
then you will get report_quiesce callbacks at these events.  But you
are not guaranteed those callbacks just because you are listening for
QUIESCE.  So if you want to maximize on "make all threads report
quickly", then you should enable the EXIT and SYSCALL_ENTRY event
bits.  (There's no point for all the other events, because your
UTRACE_STOP ensures a report_quiesce callback of some sort either at
or immediately after each other event.)

You don't actually need to define any callback functions for these.
You don't have any special work for them, the report_quiesce callback
will have done it already.  You can make report_quiesce call
utrace_set_events to reset the bits and that takes effect immediately
so it doesn't try to make the event-specific callback for the very
same event that this report_quiesce call is about.  After you've
gotten report_quiesce, you don't necessarily need any events enabled
any more (only perhaps QUIESCE or DEATH or REAP for bookkeeping
purposes).

That brings us to the SIGKILL question.  Whatever place you got each
thread to stop, it can always be suddenly woken up by a SIGKILL.
There are two issues with this and two approaches to coping with each.

First is bookkeeping: that you don't get confused about the state of
the thread and the utrace_engine.  For that you can make sure to have
a REAP, DEATH, or QUIESCE callback enabled (QUIESCE will get you a
callback for death) that cleans up your bookkeeping synchronously so
that you can never try to do anything with the dead task.  Or, you can
do it "passively" just by having your asynchronous uses cope with
-ESRCH returns from utrace calls to mean the non-error case of the
task having been reaped and your engine implicitly detached.
Importantly, you'd have to use utrace_prepare_examine et al properly
around access to the task (user_regset et al) unless you hold your own
task refs.

Second is safety in actually examining the threads.  For that you have
the two same sorts of option.  First option, you make sure to have a
DEATH or QUIESCE callback that cleans up your own bookkeeping
synchronously so you ignore the thread.  Second option, you cope
gracefully with failure in your asynchronous examination.  To do it
safely you have to hold task refs and/or use utrace_prepare_examine et
al around the user_regset calls et al and be prepared to punt the
thread silently when they fail.

Each of those is potentially hairy to get exactly right.  And, in fact
you have to do some measure of the second just to do the first.  That
is, DEATH event reports are after the point at which user_regset calls
may no longer work (they could return -EIO or something).  So your
calls have to be robust to that, even if you are synchronizing as much
as you can.

Finally, there is the issue of blocked/nonresponsive threads.  This is
not so much an issue about how you use utrace per se.  It's just a set
of facts and constraints on what choices you have for the semantics.
Threads that are in the kernel, either blocked, or arbitrary
long-running (though that shouldn't happen) won't report a QUIESCE (or
any) event for an arbitrarily long time.  The normal case is a thread
blocked in a system call, though there could be others.  I'll discuss
the semantics question below.  The upshot for the mechanics of the
code is that when utrace_control() returned -EINPROGRESS you have to
notice if the thread is not responding quickly, and decide what to do.


3. semantics of the feature

Calling elf_core_dump() is obviously attractive when writing the
code--there's a big chunk you just call and don't think about.  That's
natural and fine for a first prototype.  But it imposes a variety of
constraints on how you can do things.  Some of these impinge on the
implementation issues I've mentioned above.  Others just affect what
options you have for the flexibility and semantics of the feature you
provide.

You create the file and do all the i/o in the context of some thread in
the process being examined.  This has two kinds of issues: "disruptive"
issues, and feature semantics issues.  Of the former, the possibilities
I can think of immediately are SIGXFZ and quota warnings that could be
generated for the process, but there may be other issues of that ilk.
Those are positive disruptions; there is also the related ilk of
"perturbations", such as i/o charged to the process (or CPU time for
filesystem or i/o work).

As to the latter, the first point is that I think it's just not what
people would expect by default.  I think people expect "like gcore",
i.e. the instigator of the dump must have permissions for full
inspection and control a la ptrace, and then all the actual file writing
happens entirely in the context of the dumper, not the dumpee.  (Context
here means all of privilege, filesystem namespace, etc.)  The more
general point is that we'd like the dumper to be in a position to offer
all manner of flexibility in where and how the data gets sent.

The next issue about the feature is just the general issue of
flexibility.  One thing people have mused could be nice about a "fancy
new core dumper" is the chance for arbitrary new amounts of flexibility
in exactly what a dump-taker wants to do.  An example is the choices
that vma_dump_size() makes, i.e. which memory to include in the dump.
These are controlled by coredump_filter, so a crude level of control for
that example is possible just by fiddling that momentarily.  But one can
imagine fancier users wanting to apply more complex criteria there.

Another aspect of flexibility is what you do about threads that will not
stop quickly.  It's not kosher to make user_regset calls on these.  They
might well claim to work, but some or all of their data could be bogus,
could be leaking uninitialized kernel stack bits, etc.  Among options
that are purely nonperturbing, you have several in the abstract.  But
using elf_core_dump() only gives you one: omit the threads entirely.
You have to keep them out of your list to prevent it from making those
bogus user_regset calls, so you don't have any other choice.

I mentioned "purely nonperturbing" options.  There is also the option to
perturb, and variants thereof.  That is, you can use UTRACE_INTERRUPT.
Using it blindly is what today's "gcore" does implicitly--interrupt
every system call, no matter the implications for the program logic.
That's always an option, though it does not qualify as nondisruptive.
Its effect is like using SIGSTOP + SIGCONT (when the process is ignoring
SIGCONT).  For many system calls, this is no problem: they abort with
-ERESTART*, get rolled back, and restart as if nothing happened.  For
various things like i/o calls this is also true but it may also be that
there are perturbations, ones that are officially harmless, but still
could affect the program's experience of the system.  For other things
like some sleep calls, some supposedly restart correctly, but you can't
be entirely sure about the effects.  There are probably others that are
just completely pertrubed by "interrupt with no signal".

>From there, you can get into a world of "lightly perturbing" options.
That is, you can get fancy with looking at the state of the thread with
asm/syscall.h calls and decide what is safe to do.  If it's blocked and
not in a syscall, that ought to be in a page fault (or maybe some other
kind of machine exception).  It should be harmless to interrupt that,
since the faulting insn will just start over again in the end.  For
various particular syscalls recognized from syscall_get_nr(), you can
assert that it's safe to interrupt them because they are known to
restart harmlessly enough.  For example, futex is this way (and should
be a very common case).  But it all gets into a touchy area.  I think
you see why I raise this as a "flexibility" question rather than citing
one right thing to do.

Back to purely nonperturbing options, there are a few that make sense
off hand.  Skipping the thread as if it doesn't exist at all is one, but
not the one I'd ever suggest.  The next obvious simple thing is to
include a record (struct elf_prstatus) for the task, but just give all
zero for its register data (pr_reg).  That says there is a thread, and
all the info that is easily at hand about it, but not what it's doing.
The most satisfying thing is like that, but fill in "as much as you can"
for the register data (you can get at least what /proc/pid/syscall has,
if not more).  I have an idea about that, but that can wait until after
we've gotten to a place where we have any choice in the matter.  (The
most satisfying thing overall is probably to combine that latter with a
simple version of the "lightly perturbing", e.g. interrupt if not in a
syscall or if in a futex or wait call.)

Leaving nonresponsive threads aside, there is another angle of
flexibility that I always figured people might like in gcore.  That is,
how much it synchronizes.  For a coherent complete snapshot, you need
every thread to stop.  But even if everything goes completely smoothly
in the terms we've been discussing it, that pause could be a significant
perturbation, e.g. in a large process with many many threads to stop and
a lot of memory to spend i/o time dumping.  For nondisruptive sampling,
you might want the option to sample each thread, but not stop them all.
That can get a full register snapshot of the ones running in user mode
(for which just that may be very informative), and "as much as you can"
register data for blocked ones.  You might do that for a "threads-only"
snapshot and not dump any memory (could be very quick, minimal i/o,
useful for flight-recorder, health-monitoring, etc.).  Or you might dump
memory too, where you are synchronizing against mmap changes, but
letting threads run--then the stacks for the blocked ones are complete,
even if still-running ones are being overwritten while you dump.  (That
e.g. might be the right balance for a monitoring sample where you just
want to notice the details of threads staying blocked for a long time
while minimizing any delays to the highly-active threads.)


Conclusions.

I've mentioned lots of kinds of flexibility one might like in the core
dumper's features.  This doesn't mean I expect you to add every bell and
whistle I can imagine.  It's just to get you thinking, and to advocate
for a different way to structure the code that would make it possible to
do more different and fancy things eventually.

In #1 there are two issues due to using elf_core_dump().  One is
mmap/vma synchronization, which you can just solve with mmap_sem.  The
other is mm->core_state and the whole related set of synchronization
problems.  In #2 most of the unresolved complexities have to do with the
essential synchronization problems of third-party examination.  In #3
all the oddities and lacks of flexibility derive from the basic
constraint of what elf_core_dump() does.

All of that is far simpler IMHO if you take a different approach and do
not use elf_core_dump() at all.  There is only a little code from there
I would really want to reuse/refactor.  For a prototype to start with I
think it's certainly fine to just copy those few routines, and we can
worry later about refactoring to share code with elf_core_dump.

For overloading mm->core_state, you just don't do it.  Don't touch that,
and you don't interact with do_coredump() at all, except inasmuch as it
can instigate some threads dying before you stop them (which is always
possible anyway).

What I'd do is have each thread fill in its own info (prstatus and
regsets) directly in the utrace callback.  Then once each task is logged
as stopped you only need to keep track of them as long as you want them
in UTRACE_STOP (if doing a non-stop option, detach them right there).
You still have to do the third-party examination when
utrace_control(,,UTRACE_STOP) returns zero because it's already stopped
(either another debugger or job control stop).  But that's a pretty easy
case where you know you only have to worry about the SIGKILL races.
(It's good enough just to zero the buffers beforehand and ignore errors
from the regset calls.)

That way you just use the QUIESCE callback.  If it's for death, you
detach and forget the thread ever existed.  If not, you reset your
engine with utrace_set_events to zero, collect all the data on yourself,
and return UTRACE_STOP.  There are no other races to worry about.  After
that, if threads die, they die, but you aren't looking anyway.  You only
keep track of them at all so that they stay in UTRACE_STOP, until you
call utrace_control(,,UTRACE_DETACH) after the dump (or before, in a
non-stop option).  That returns -ESRCH if it died and detached already,
but you don't care anyway.

Core dumping has three components: thread stuff, memory collection, and
file writing.  I include all the synchronization bits in "thread stuff",
which I've outlined an approach for.  The memory collection and file
writing are pretty well independent of the thread stuff.  The memory
collection is straightforward and there is just one way to do it (modulo
flexibility in elision choices).  The file writing is defined by what
interface and control semantics you are using to drive the feature.

At least for a normal fully-synchronous snapshot, in any interface you
have to do the thread stuff first.  Before you start reading memory you
need to know threads are stopped so they won't touch it, and you need
them to call any regset writeback hooks.  You can't even take mmap_sem
before finishing thread stuff, because if a thread can enter the kernel
and might do down_write(), you could deadlock waiting for it to finish
and stop while it waits for you to up_read().  So every implementation
plan for whatever interface will probably have that in common: first,
you do thread stuff, then you take mmap_sem and do memory stuff.

Part of my point in bringing up all the flexibility is that I think in
the long run what we'll want is to break up the core dump work in the
kernel into several pieces that can be reused in a variety of contexts
and plug into multiple different interfaces.  But we don't have to worry
about that for the prototype.

For a standalone feature prototype somewhat along the lines of what
you've done, one interface idea is attracting my attention.  That is a
virtual-file "pull" interface, akin to /proc/kcore.  So, rather than
userland magically poking a "write a file" command, instead userland
gets to read a virtual file of core dump contents and any actual i/o
writing the data out is userland's business.  What I have in mind is to
support random-access reads; this gives userland automatic flexibility
to just skip over some parts if it wants, and for the core file portions
corresponding to memory, this will completely elide fetching those pages.

Userland can use normal reads, or shovel the data with splice or
sendfile to write a file directly.  Later, we can soup up the internals
so that splice/sendfile from the virtual file does optimized internal
magic to avoid copying the pages.

Here's the interface I have in mind: magic file /proc/pid/gcore.  When
you open the file, this starts trying to stop all the threads.  When you
first actually read from the file, this blocks until all the thread
stuff is done.  Then it takes mmap_sem and collects all the phdrs.  That
is all the information other than actual memory contents, and it's
enough to know everything about the file layout.  You can collect that
info directly on some pages in the core file format; the ehdr and phdrs
you collect are all the information you need to do later i/o.  (After
that you can actually drop mmap_sem and it no longer matters.)  Once the
notes/headers (including all thread stuff) are ready, then you keep
those pages around as long as the file is open.  A read call that asks
for those initial pages just gets those saved pages.  A read call for a
larger offset is in the part of the file that corresponds to memory.
You can look up that offset in the phdrs you saved (in part of those
initial pages) and map from p_offset to p_vaddr, then just use
get_user_pages/access_process_vm to read those pages.  The only other
bookkeeping you need is the list of tasks+engines you have attached;
when the virtual file is closed, you detach those engines and clean up.

For further fanciness, we can later make it support nonblocking read,
poll, and fasync.  This lets a fancy user open the file, get a
notification when it's ready to read (all threads have stopped and
collected register data), and then read when they know it won't block
(except for page fault blocks in memory areas).

Other fanciness that is probably more important, we can later add an
optimized splice_read callback to the virtual file.  This can send the
saved initial pages and the memory pages from get_user_pages() directly
to splice_to_pipe.  This makes splice/sendfile calls do single-copy i/o
that should go as fast as traditional direct-from-kernel core writing,
if not faster.

This is only one interface idea, but I think it's got some sex appeal.
The main point about organizing the control flow/synchronization plan in
this way applies to any sort of interface.


Thanks,
Roland

Reply via email to