A subject that came up recently for David Smith is asynchronous detach.
So I've been thinking about accomodating it better in the interface.

One of the TODO features that I've been wanting to get to soon is global
tracing.  It so happens that David's work could use this feature.  But
that's not what makes me want to discuss both of them together.  Rather,
it's the similarity of the stickiest implementation question.

So first, what do I mean by global tracing?  The idea is you can
register a "global" utrace engine to watch events as they occur in
whatever thread, seeing all threads in the system.  This would use
register/unregister calls parallel to the per-task attach/detach, and a
call parallel to utrace_set_events.  The events and callbacks would be
the same, with some constraints.  I don't want to get into all the
details of global tracing here.  For now, I've just outlined it enough
to discuss its overlap with the asynchronous detach question.

Now, what do I mean by asynchronous detach?  

The "Teardown Races" rules (from utrace.txt, now a section under "utrace
concepts" in the new utrace manual) address the issues of detach vs
sudden deaths that are wholly outside of your control.  Those rules
ensure that when you think a task is stopped (i.e. via UTRACE_STOP), and
you use utrace_control(task, UTRACE_DETACH) from another thread, you
know for sure what callbacks do and don't get made.  That's synchronous
detach, because you synchronized with the thread by having it return
UTRACE_STOP from a callback at some point, so it's stayed stopped until
your utrace_control call.  The cases where utrace_control returns -ESRCH
or -EALREADY only arise here in the event of a simultaneous SIGKILL.
(It's also synchronous detach when a callback in the task itself returns
UTRACE_DETACH, of course.)

The other kind is asynchronous detach.  That means the task is running
merrily away, and you call utrace_control(task, UTRACE_DETACH) without
knowing exactly what it's doing.  From the beginning, I never gave this
case too much thought.  Now it's time to think more about it.

The rub is simultaneous callbacks.  Since you haven't synchronized, the
task you're detaching from might be running anywhere, it could be just
now entering a callback to your engine.  This can be on another CPU or,
with preemption, just from getting rescheduled at an inopportune time.
So when utrace_control returns 0, you can't be sure that one of your
callbacks isn't just about to start in a race with your cleanup code.

What the utrace interface has always said about this is, "So don't do
that."  The utrace code goes to great pains to make sure you positively
always either get a callback or don't, as opposed to jumping to zero or
getting a callback you had disabled with utrace_set_events, and that
sort of thing.  But that's it.  I had assumed you either would just get
the thread quiescent before detaching, or would have your callbacks
doing some safe synchronized access using engine->data so they could
tell you had done cleanup and so they'd just do nothing and return.

What I overlooked is that not just your data structures, but your
callbacks too might be going away, i.e. unloading the kernel module.
This is where it ties in with global tracing.  The interface to "detach"
(unregister) a global tracing engine will have the same issue (but
worse--there can be several CPUs/tasks simultaneously making callbacks).

When this came up for the existing UTRACE_DETACH case, I realized that
this issue is also the main determinant of the implementation details
for doing global tracing.  So I'm thinking about them together.

The question is, what contract would you like in the interface for
asynchronous detach?

The common case paths for most of the event reports do not involve any
SMP-synchronizing instructions (just some barriers).  This is crucial for
keeping the overhead of "latent" tracing to the absolute minimum.  Your
callbacks can check some unlocked data structures and be very quick to
filter out boring events, and the whole thing should scale well to huge
numbers of threads being traced on many-CPU machines.  This concern
constrains what I want to add in to the implementation.  I realize this is
microoptimizing early, but I don't want to make an interface choice that
could fundamentally constrain future optimization and scaling, without
giving it a lot of thought beforehand.  (Remember, this is the lowest
layer, to make things both feasible and fast.  It's doesn't have to be the
layer that's the simplest possible to use, higher layers can be that.)

For UTRACE_DETACH, one thing seems easy and cheap to implement, off hand.
I can keep a record of which engine is being called into now, unsynchronized.
That means just sampling it is never reliable, but it will use barriers
to ensure ordering of making the record and fetching the engine->ops pointer.
That means utrace_control(task, UTRACE_DETACH) could return -EINPROGRESS
if a callback to your engine might be in progress or about to begin.
That can always be a false-positive, after your callback has just returned.
Moreover, the only way this can work is that the detach happens regardless.
Either it returns 0 and you know no more callbacks will be made, or it
returns -EINPROGRESS and you know that it's possible one more callback
is already in progress or about to begin, but you don't know for sure that
there is one coming at all.  There is no way to check if it's finished later
(if it has, your engine pointer is already invalid).

For global engines' detach, one option is to offer no help with your own
data structures but to solve the module-unload problem using the module
refcount.  This means that rmmod would block while your callbacks are
running, and the unload wouldn't actually commence until the callback
finished.  Once the unload has begun (your module's exit function starts
running), new callbacks will be skipped automagically while you do your
tear-down.  (The module refcount works via per-CPU counts.  So it does not
involve SMP-synchronizing instructions.  It does use some memory somewhat
wastefully (at least cacheline*ncpus), but it's already there anyway.  In
contrast, that amount of memory would be several times the current size of
struct utrace_attached_engine, making per-CPU counts unattractive for that.)

I'm doing a lot of thinking out loud here.  I want to explore all the ideas
we can think of without synchronization, and mull over how (in)sufficient
each is.  So pitch in!

The other obvious option is to decide that two SMP-synchronizing operations
per engine per event is OK.  Looking at where our callbacks are, most of
them are in places where we take locks and do a lot of synchronization and
heavy operations anyway.  So I guess we're concerned mostly with syscall
tracing.  That's already not available to global tracing, so certainly for
global engines' callbacks the case for avoiding synchronizing instructions
is looking thinner.  

For syscalls, I did a little experiment to lend some perspective.  I tested
a stupid microbenchmark for system call overhead.  It runs a million getpid
system calls (the cheapest one), and samples CLOCK_THREAD_CPUTIME_ID (which
is derived in the kernel from cycle counters or the high-resolution timer
hardware) before and after.  Here are the numbers.  (These are averages
from 10 runs in calm conditions, but these numbers are presumed unreliable
and not at all precise, just to ponder.)

        1.7979558114    no audit no trace
        5.9270243434    no audit utrace-nop
        6.2908507868    no audit utrace-sync

For comparison, here are the numbers with syscall audit enabled (auditd on):

        3.1960292903    audit-fast no trace
        6.1579767031    audit-slow no trace
        6.6884023171    audit      utrace-nop
        7.1273046673    audit      utrace-sync

This is on x86-64, with an upstream+utrace kernel approximately 2.6.27-rc1
(and a random configuration I use for development, not optimal).

audit-fast means syscall audit enabled and using the new fast-path I
recently added upstream (a change unrelated to utrace et al).  audit-slow
means audit using the slow hardware return path, which is what x86 kernels
before 2.6.27 have done.  (When audit and utrace are combined, it always
takes the slow path, so the audit-fast kernel is no different from the
audit-slow kernel.)  This is overhead that lots of people are putting up
with now, e.g. auditd is enabled by default on Fedora--and noone had the
fast-path version before this week's rawhide.

utrace-nop means a utrace engine is getting syscall_entry callbacks; it does:

        return utrace_syscall_action(action) | UTRACE_RESUME;

utrace-sync means a utrace engine whose callback does:

        atomic_inc(&test_count);
        if (atomic_dec_and_test(&test_count))
                WARN_ON(1);
        return utrace_syscall_action(action) | UTRACE_RESUME;

The if condition never fires, but testing the synchronizing op's value this
way mirrors the operations that would be added in the infrastructure to
synchronize callbacks with asynchronous detach.

It's clear in the numbers that the biggest factor is the slow return path,
e.g. the audit-fast vs audit-slow difference.  That's done for arcane x86
kernel reasons that aren't relevant here.  (The difference would be much
less pronounced on other architectures, because this slow path issue is
really especially slow on x86.)  I think this will be improved in the
future, so that utrace-nop might be faster by about the same amount as the
audit-fast vs audit-slow difference (or nearly so).

If you factor that out, a few things pop out.  First, utrace doing nothing
is faster than audit doing nothing (go team).  But compared to utrace-nop,
utrace-sync nearly doubles the utrace overhead, making it well higher than
the audit overhead.

And remember, this test is in ideal, uncontested SMP conditions.  The test
runs on one CPU of a 4-core machine, and the other 3 CPUs are idle.  The
bad effects to other CPUs of any bus-locking or whatnot the -sync code is
incurring are not showing up at all.

Oh, hey, I can factor it out by not using an x86.  Here are some more
unreliable numbers, for the same test on a 2-CPU powerpc64:

        1.6260223104    no audit no trace
        3.7841002367    no audit utrace-nop
        5.0061109056    no audit utrace-sync
        5.8225092223    audit    no trace
        6.1297943584    audit    utrace-nop
        6.8046667872    audit    utrace-sync

The raw tracing overhead (no audit utrace-nop) is something that can be
optimized further, and is all local to the CPU running the syscall.  The
-sync cost for SMP interlock is probably only going to get worse as we get
more cores, in proportional terms, and it's a scaling limiter whose true
impact is not reflected in the microbenchmark.

So I think this suggests it's not entirely goofy of me to balk at
introducing these synchronizing instructions into the callback path.
Perhaps there is nothing that's tolerable for the detach interface that can
avoid them.  But let's explore it thoroughly.  I'm not trying to make the
interface overly aggravating, but nor do I want it to give you enough rope
to ensure that the whole thing can never be really well-optimized because
of overly generous guarantees in the interface semantics.

I know these long rambles from me are hard to penetrate.  But I need some
help thinking about this stuff.  And I'm not going to be satisfied by the
simple answer, without a lot of complicated convincing.


Thanks,
Roland

Reply via email to