[PATCH] ptrace-tests: fix step-fork.c on powerpc for ptrace-utrace

2009-12-01 Thread Veaceslav Falico
Instead of using fork(), call syscall(__NR_fork) in step-fork.c
to avoid looping on powerpc arch in libc.

Signed-off-by: Veaceslav Falico vfal...@redhat.com
---

--- ptrace-tests/tests/step-fork.c  2009-12-01 17:17:14.0 +0100
+++ ptrace-tests/tests/step-fork.c  2009-12-01 17:25:12.0 +0100
@@ -29,6 +29,7 @@
 #include unistd.h
 #include sys/wait.h
 #include string.h
+#include sys/syscall.h
 #include signal.h
 
 #ifndef PTRACE_SINGLESTEP
@@ -78,7 +79,12 @@ main (int argc, char **argv)
sigprocmask (SIG_BLOCK, mask, NULL);
ptrace (PTRACE_TRACEME);
raise (SIGUSR1);
-   if (fork () == 0)
+
+   /*
+   * We can't use fork() directly because on powerpc it loops inside libc 
on 
+   * ptrace over utrace. See http://lkml.org/lkml/2009/11/28/11
+   */
+   if (syscall(__NR_fork) == 0)
  {
read (-1, NULL, 0);
_exit (22);



Re: [RFC] [PATCH] In-kernel gdbstub based on utrace Infrastructure.

2009-12-01 Thread Frank Ch. Eigler
Hi -

On Tue, Dec 01, 2009 at 05:11:32PM +0100, Ingo Molnar wrote:
 Those facilities are not overlapping with kgdb though so my point doesnt 
 apply to them. An in-kernel gdb server sure overlaps/extends kgdb 
 though.

Only in name.  One is highly invasive, for debugging the kernel across
serial consoles.  The other is highly noninvasive, for debugging user
processes across normal userspace channels.  They both happen to talk
to gdb, but that's the end of the natural overlap.

Even if kgdb was extended to be able to manage userspace, and if gdb
itself was extended to be able to use that same single channel, this
would still not duplicate the use scenario for an ordinary user
debugging his own processes.

(Plus, in the future where at least gdb is applied toward kernel+user
debugging, it is unlikely to be the case that this would need to be
done *over a single channel*.  A separate channel for kernel and
separate channels for userspace programs are no less likely.)

 Btw., perf does meet that definition: it functionally replaces all 
 facilities that it overlaps/extends - such as Oprofile. [...]

(And they currently separately coexist.)

- FChE



Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)

2009-12-01 Thread Oleg Nesterov
On 11/30, Oleg Nesterov wrote:

 On 11/29, Roland McGrath wrote:
 
  Please file this test case on bugzilla.redhat.com for Fedora 12 glibc.

 https://bugzilla.redhat.com/show_bug.cgi?id=542731

It was closed as NOTABUG, Andreas Schwab wrote:

 If you call clone directly you are responsible for setting up
 the TLS area yourself.

troll mode

Very nice. If I understand correctly, this means clone(CLONE_VM)
must not be used without CLONE_SETTLS, right?

This in turn means clone(CLONE_VM) is not useable, afaics it is not
possible to use CLONE_SETTLS in a more or less portable manner.
Even arch/x86/ needs struct user_desc * or long addr depending
on CONFIG_X86_32.

And it used to work? I downloaded glibc-2.11, and afaics this was
broken by

Preserve SSE registers in runtime relocations on x86-64.
commit: b48a267b8fbb885191a04cffdb4050a4d4c8a20b

I do not understand glibc even remotely, but this lools like
regression to me. I see nothing in the changelog or man page
which explains that CLONE_VM requires CLONE_SETTLS now.

/troll mode


So. Any ptrace test which uses clone() is broken, at least on x86_64.

Jan, Roland, how should we fix this? We can rewrite the code to use
pthread_create(), this should be trivial. Unfortunately, libpthread
is not trivial, it can shadow the problem and complicate the testing.

And the stupid question. If I create the subthread via pthread_create(),
how can I know its tid? I grepped glibc-2.11, and afaics pthread_create
returns the pointer to struct pthread which has pid_t tid but I can
not find the helper which returns -tid and struct pthread is not
exported.

Oleg.



Re: [PATCH] ptrace-tests: fix step-fork.c on powerpc for ptrace-utrace

2009-12-01 Thread Veaceslav Falico
On Tue, Dec 01, 2009 at 05:37:48PM +0100, Veaceslav Falico wrote:

 - if (fork () == 0)
 +
 + /*
 + * We can't use fork() directly because on powerpc it loops inside libc 
 on 
 + * ptrace over utrace. See http://lkml.org/lkml/2009/11/28/11
 + */
 + if (syscall(__NR_fork) == 0)
 {
   read (-1, NULL, 0);
   _exit (22);


Sorry, the comment is just wrong. I'll resend the patch in several minutes.

--
Veaceslav



[PATCH v2] ptrace-tests: fix step-fork.c on powerpc for ptrace-utrace

2009-12-01 Thread Veaceslav Falico
Instead of using fork(), call syscall(__NR_fork) in step-fork.c
to avoid looping on powerpc arch in libc.

Signed-off-by: Veaceslav Falico vfal...@redhat.com
---

--- a/ptrace-tests/tests/step-fork.c2009-12-01 17:17:14.0 +0100
+++ b/ptrace-tests/tests/step-fork.c2009-12-01 18:35:15.0 +0100
@@ -29,6 +29,7 @@
 #include unistd.h
 #include sys/wait.h
 #include string.h
+#include sys/syscall.h
 #include signal.h
 
 #ifndef PTRACE_SINGLESTEP
@@ -78,7 +79,12 @@ main (int argc, char **argv)
sigprocmask (SIG_BLOCK, mask, NULL);
ptrace (PTRACE_TRACEME);
raise (SIGUSR1);
-   if (fork () == 0)
+
+   /*
+* Can't use fork() directly because on powerpc it loops inside libc 
under
+* PTRACE_SINGLESTEP. See 
http://marc.info/?l=linux-kernelm=125927241130695
+*/
+   if (syscall(__NR_fork) == 0)
  {
read (-1, NULL, 0);
_exit (22);



Re: powerpc: syscall_dotrace() retcode (Was: powerpc: fork stepping)

2009-12-01 Thread Roland McGrath
We don't really intend to impose any new requirements on the arch behavior
here.  It's up to the arch folks to decide.  It does simplify life somewhat
on x86 that you can change the registers at the syscall-entry stop and then
after skipping the syscall, those registers will be unchanged from what you
set.  But it's never been that way on every other arch, and it doesn't need
to be.  The arch requirement on the tracehook_report_syscall_entry() return
value handling is that it make the syscall not be run, and that the
register state then left be compatible with using syscall_rollback() at the
tracehook_report_syscall_exit() spot.

As to what you get from ptrace explicitly fiddling with registers at
syscall entry, that has always been arch-specific before and we haven't
done anything to change that situation.  On every arch, there is some way
to change the syscall number to be run, and changing it to a known-bogus
number (e.g. -1) makes sure no syscall runs.  On every arch, it's possible
at the tracehook_report_syscall_exit() spot to change the registers to
exactly whatever you want userland to see.  That's enough as it stands to
make it possible to do whatever you want, some way or other.

If the powerpc maintainers want to change the behavior here, that is fine
by me.  But there is no need for that just to satisfy general ptrace
cleanups (or utrace).  Normal concerns require that no such change break
the ptrace behavior that userland could have relied on in the past.

So off hand I don't see a reason to change at all.  If every arch were to
change so that registers changed at syscall-entry were left unmolested by
aborting the syscall, then that might be a new consistency worth having.
But short of that, I don't really see a benefit.

All this implies that the ptrace-tests case relating to this needs to be
tailored differently for powerpc and each other arch so it expects and
verifies exactly the arch-specific behavior that's been seen in the past.


Thanks,
Roland



Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)

2009-12-01 Thread Roland McGrath
 So. Any ptrace test which uses clone() is broken, at least on x86_64.

If you use clone() directly then you need to have the code run in that
child be purely under your control.  You can't use miscellaneous libc
calls nor any libpthread calls, only ones you are sure do not require
any thread setup.  Given TLS, that means even using errno or anything
that might set it.  It also means any libc function that might use any
TLS you don't know about, i.e. really anything beyond the pure
computation calls like str*/mem*.  It also means running any dynamic
linker code, such as relying on dynamic linking without LD_BIND_NOW
(or -Wl,-z,now at compile time).  

The only thing that changed about this recently in glibc is that even
more code paths through the dynamic linker now happen to depend on
thread setup.

 Jan, Roland, how should we fix this? We can rewrite the code to use
 pthread_create(), this should be trivial. Unfortunately, libpthread
 is not trivial, it can shadow the problem and complicate the testing.

We should avoid library code more thoroughly, not use more of it.  
As well as being complex, it also varies a lot across systems and
interferes with using the same sources to translate to exact
kernel-level testing across various people's development environments.

I think the best bet is to link with -Wl,-z,now and then minimize the
library code you rely on.  (It really only matters to be extra careful
about that for the code running in the clone child.)  So you can use
syscall() if you are not relying on its error-case behavior--if the
system call fails, the function will set errno, which can rely on the
TLS setup.

 And the stupid question. If I create the subthread via pthread_create(),
 how can I know its tid? I grepped glibc-2.11, and afaics pthread_create
 returns the pointer to struct pthread which has pid_t tid but I can
 not find the helper which returns -tid and struct pthread is not
 exported.

There is no official exported way.  You can use syscall(__NR_gettid).
That kernel concept of a global thread ID number does not exist in
pthreads, it is a detail of the Linux implementation.


Thanks,
Roland



Re: [RFC,PATCH 14/14] utrace core

2009-12-01 Thread Peter Zijlstra
On Tue, 2009-11-24 at 21:02 +0100, Oleg Nesterov wrote:

 +  sect2 id=reaptitleFinal callbacks/title
 +  para
 +The functionreport_reap/function callback is always the final event
 +in the life cycle of a traced thread.  Tracing engines can use this as
 +the trigger to clean up their own data structures.  The
 +functionreport_death/function callback is always the penultimate
 +event a tracing engine might see; it's seen unless the thread was
 +already in the midst of dying when the engine attached.  Many tracing
 +engines will have no interest in when a parent reaps a dead process,
 +and nothing they want to do with a zombie thread once it dies; for
 +them, the functionreport_death/function callback is the natural
 +place to clean up data structures and detach.  To facilitate writing
 +such engines robustly, given the asynchrony of
 +constantSIGKILL/constant, and without error-prone manual
 +implementation of synchronization schemes, the
 +applicationutrace/application infrastructure provides some special
 +guarantees about the functionreport_death/function and
 +functionreport_reap/function callbacks.  It still takes some care
 +to be sure your tracing engine is robust to tear-down races, but these
 +rules make it reasonably straightforward and concise to handle a lot of
 +corner cases correctly.
 +  /para
 +  /sect2

This above text seems inconsistent. First you say report_reap() is the
final callback and should be used for cleanup, then you say
report_death() is the penultimate callback but might not always happen
and people would normally use that as cleanup.

If we cannot rely on report_death() then I would suggest to simply
recommend report_reap() as cleanup callback and leave it at that.

 +  para
 +There is nothing a kernel module can do to keep a structnamestruct
 +task_struct/structname alive outside of
 +functionrcu_read_lock/function. 

Sure there is, get_task_struct() comes to mind.

  When the task dies and is reaped
 +by its parent (or itself), that structure can be freed so that any
 +dangling pointers you have stored become invalid.
 +applicationutrace/application will not prevent this, but it can
 +help you detect it safely.  By definition, a task that has been reaped
 +has had all its engines detached.  All
 +applicationutrace/application calls can be safely called on a
 +detached engine if the caller holds a reference on that engine pointer,
 +even if the task pointer passed in the call is invalid.  All calls
 +return constant-ESRCH/constant for a detached engine, which tells
 +you that the task pointer you passed could be invalid now.  Since
 +functionutrace_control/function and
 +functionutrace_set_events/function do not block, you can call those
 +inside a functionrcu_read_lock/function section and be sure after
 +they don't return constant-ESRCH/constant that the task pointer is
 +still valid until functionrcu_read_unlock/function.  The
 +infrastructure never holds task references of its own.  

And here you imply its existence.

 Though neither
 +functionrcu_read_lock/function nor any other lock is held while
 +making a callback, it's always guaranteed that the structnamestruct
 +task_struct/structname and the structnamestruct
 +utrace_engine/structname passed as arguments remain valid
 +until the callback function returns.
 +  /para
 +
 +  para
 +The common means for safely holding task pointers that is available to
 +kernel modules is to use structnamestruct pid/structname, which
 +permits functionput_pid/function from kernel modules.  When using
 +that, the calls functionutrace_attach_pid/function,
 +functionutrace_control_pid/function,
 +functionutrace_set_events_pid/function, and
 +functionutrace_barrier_pid/function are available.
 +  /para

The above seems weird to me at best... why hold a pid around when you
can keep a ref on the task struct? A pid might get reused leaving you
with a completely different task than you thought you had.

 +  /sect2
 +
 +  sect2 id=reap-after-death
 +title
 +  Serialization of constantDEATH/constant and 
 constantREAP/constant
 +/title
 +para
 +  The second guarantee is the serialization of
 +  constantDEATH/constant and constantREAP/constant event
 +  callbacks for a given thread.  The actual reaping by the parent
 +  (functionrelease_task/function call) can occur simultaneously
 +  while the thread is still doing the final steps of dying, including
 +  the functionreport_death/function callback.  If a tracing engine
 +  has requested both constantDEATH/constant and
 +  constantREAP/constant event reports, it's guaranteed that the
 +  functionreport_reap/function callback will not be made until
 

Re: powerpc: syscall_dotrace() retcode (Was: powerpc: fork stepping)

2009-12-01 Thread Benjamin Herrenschmidt
On Tue, 2009-12-01 at 11:27 -0800, Roland McGrath wrote:
 
 If the powerpc maintainers want to change the behavior here, that is fine
 by me.  But there is no need for that just to satisfy general ptrace
 cleanups (or utrace).  Normal concerns require that no such change break
 the ptrace behavior that userland could have relied on in the past.
 
 So off hand I don't see a reason to change at all.  If every arch were to
 change so that registers changed at syscall-entry were left unmolested by
 aborting the syscall, then that might be a new consistency worth having.
 But short of that, I don't really see a benefit.
 
 All this implies that the ptrace-tests case relating to this needs to be
 tailored differently for powerpc and each other arch so it expects and
 verifies exactly the arch-specific behavior that's been seen in the past. 

Ok thanks. I'm happy to not change it then, the risk of breaking some
existing assumption is too high in my book.

Cheers,
Ben.




Re: [RFC] [PATCH] In-kernel gdbstub based on utrace Infrastructure.

2009-12-01 Thread Ingo Molnar

* Frank Ch. Eigler f...@redhat.com wrote:

 Hi -  
 
   Only in name.  One is highly invasive, for debugging the kernel across 
   serial consoles.  The other is highly noninvasive, for debugging user 
   processes across normal userspace channels.  They both happen to talk 
   to gdb, but that's the end of the natural overlap.
  [...]
 
  Well nothing that you mention here changes our obvious suggestion 
  that an in-kernel gdb stub should obviously either be a kgdb 
  extension, or a replacement of it.
 
 Help me out here: by kgdb extension do you imagine something new 
 that an unprivileged user can use to debug his own process?  Or do 
 you imagine a new userspace facility that single-steps the kernel?

Is this a trick question? Single-stepping the kernel on the same system 
[especially if it's an UP system] would certainly be a challenge ;-)

What i mean is what i said: if you provide a new framework (especially 
if it's user visible - which both kgdb and the gdb stub is) you should 
either fully replace existing functionality or extend it. Overlapping it 
in an incomplete way is not useful to anyone.

Extending kgdb to allow the use of it as if we used gdb locally would 
certainly be interesting - and then you could drop into the kernel 
anytime as well. But i'm not siding with any particular solution - i'm 
just seconding Peter's point that there's very clear overlap and 
inconsistency, and that ought to be resolved one way or another.

  We dont want to separate facilities for the same conceptual thing:
  examining application state (be that in user-space and
  kernel-space).
 
 This seems like a shallow sort of consistency.  kgdb was added after 
 ptrace existed -- why not extend ptrace instead to target the kernel? 
 After all, it's examining application state.  The answer is that it 
 doesn't make a heck of a lot of sense.

kgdb simply used gdb's preferred way of remote debugging. That's 
certainly the ugliest bit of it btw - but it's an externality to kgdb.

Had it extended ptrace it wouldnt have gdb compatibility.

So i think this example of yours is inapposite as well.

Having said all that, i certainly subscribe to the view that neither 
kgdb nor ptrace is particularly cleanly done. So i wouldnt mind if 
something new existed that had a modern, flexible, extensible and 
generally pleasant interface and implementation. If you are heading in 
that direction, please let me know.

Btw., perf does meet that definition: it functionally replaces all 
facilities that it overlaps/extends - such as Oprofile. [...]
   
   (And they currently separately coexist.)
  
  You didnt get my point apparently. Keeping the overlapped facility for 
  compatibility (and general user inertia) is fine. Creating a new 
  facility that doesnt do everything that the existing facility does, and 
  not integrating it either, is not fine.
 
 oprofile and perfctr are closer in concept than kgdb and ptrace, yet 
 AFAIK perfctr doesn't interface to oprofile, except perhaps to the 
 extent of resolving contention over the underlying physical resources. 
 In any case this is not a great analogy.

(FYI, 'perfctr' is a different project that has existed for years, i 
suspect you meant perf events?)

perf replaces oprofile functionally. If the in-kernel gdb stub replaced 
kgdb functionally you'd hear no complaints from me.

Ingo



Re: [RFC,PATCH 14/14] utrace core

2009-12-01 Thread Oleg Nesterov
On 12/01, Peter Zijlstra wrote:

 On Tue, 2009-11-24 at 21:02 +0100, Oleg Nesterov wrote:

  +  para
  +There is nothing a kernel module can do to keep a structnamestruct
  +task_struct/structname alive outside of
  +functionrcu_read_lock/function.

 Sure there is, get_task_struct() comes to mind.

it is not exported ;)

Peter, I skipped other comments about the documentation, I never read
it myself. Update: I skipped a lot more for today ;)

  @@ -351,6 +394,10 @@ static inline void tracehook_report_vfor
*/
   static inline void tracehook_prepare_release_task(struct task_struct *task)
   {
  +   /* see utrace_add_engine() about this barrier */
  +   smp_mb();
  +   if (task_utrace_flags(task))
  +   utrace_release_task(task);
   }

 OK, that seems to properly order -exit_state vs -utrace_flags,

 This site does:

  assign -state
  mb
  observe -utrace_flags

 and the other site does:

  assign -utrace_flags
  mb
  observe -exit_state

Yes, we hope.

  @@ -560,6 +625,20 @@ static inline void tracehook_report_deat
int signal, void *death_cookie,
int group_dead)
   {
  +   /*
  +* This barrier ensures that our caller's setting of
  +* @task-exit_state precedes checking @task-utrace_flags here.
  +* If utrace_set_events() was just called to enable
  +* UTRACE_EVENT(DEATH), then we are obliged to call
  +* utrace_report_death() and not miss it.  utrace_set_events()
  +* uses tasklist_lock to synchronize enabling the bit with the
  +* actual change to @task-exit_state, but we need this barrier
  +* to be sure we see a flags change made just before our caller
  +* took the tasklist_lock.
  +*/
  +   smp_mb();
  +   if (task_utrace_flags(task)  _UTRACE_DEATH_EVENTS)
  +   utrace_report_death(task, death_cookie, group_dead, signal);
   }

 I don't think its allowed to pair a mb with a lock-barrier, since the
 lock barriers are semi-permeable.

Could you clarify?

  @@ -589,10 +668,20 @@ static inline void set_notify_resume(str
* asynchronously, this will be called again before we return to
* user mode.
*
  - * Called without locks.
  + * Called without locks.  However, on some machines this may be
  + * called with interrupts disabled.
*/
   static inline void tracehook_notify_resume(struct pt_regs *regs)
   {
  +   struct task_struct *task = current;
  +   /*
  +* This pairs with the barrier implicit in set_notify_resume().
  +* It ensures that we read the nonzero utrace_flags set before
  +* set_notify_resume() was called by utrace setup.
  +*/
  +   smp_rmb();
  +   if (task_utrace_flags(task))
  +   utrace_resume(task, regs);
   }

 Sending an IPI implies the mb?

Yes, but this has nothing to do with IPI. The caller, do_notify_resume(),
does:
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume:
if (task_utrace_flags(task))
utrace_resume();

We should not read task_utrace_flags() before we clear TIF_NOTIFY_RESUME.

  +static inline struct utrace *task_utrace_struct(struct task_struct *task)
  +{
  +   struct utrace *utrace;
  +
  +   /*
  +* This barrier ensures that any prior load of task-utrace_flags
  +* is ordered before this load of task-utrace.  We use those
  +* utrace_flags checks in the hot path to decide to call into
  +* the utrace code.  The first attach installs task-utrace before
  +* setting task-utrace_flags nonzero, with a barrier between.
  +* See utrace_task_alloc().
  +*/
  +   smp_rmb();
  +   utrace = task-utrace;
  +
  +   smp_read_barrier_depends(); /* See utrace_task_alloc().  */
  +   return utrace;
  +}

 I spot two barriers here, but only 1 over in utrace_task_alloc(), hmm?

smp_read_barrier_depends() pairs with utrace_task_alloc()-wmb().

smp_rmb() is needed for another reason. Suppose the code does:

if (task_utrace_flags()  SOMETHING)
do_something_with(task-utrace);

if we race with utrace_attach_task(), we can see -utrace_flags != 0
but task-utrace == NULL without rmb().

  +struct utrace_engine {
  +/* private: */
  +   struct kref kref;
  +   void (*release)(void *);
  +   struct list_head entry;
  +
  +/* public: */
  +   const struct utrace_engine_ops *ops;
  +   void *data;
  +
  +   unsigned long flags;
  +};

 Sorry, the kernel is written in C, not C++.

Hmm. I almost never read the comments, but these 2 look very clear
to me ;)

  + * Most callbacks take an @action argument, giving the resume action
  + * chosen by other tracing engines.  All callbacks take an @engine
  + * argument, and a @task argument, which is always equal to @current.

 Given that some functions have a lot of arguments (depleting regparam),
 isn't it more expensive to push current on the stack than it is to
 simply read it again?

Yes, perhaps. Only -report_reap() really needs @task, it may be

Re: [RFC,PATCH 14/14] utrace core

2009-12-01 Thread Roland McGrath
 Could we just drop the tracehook layer if this finally merged
 and call the low level functions directly?

We can certainly do appropriate streamlining cleanups later, by all
means.  The original purpose of the tracehook layer is simply this:

A person hacking on core kernel code or arch code should not have to
think about all the innards of tracing (ptrace, utrace, or anything
else).  If he comes across a tracehook_* call, he can just read its
kernel-doc for explanation of its parameters, return value, what it
expects about its context (locking et al), and what the semantic
significance of making that particular call is.  If changes to the
core/arch code in question do not require changing any of those
aspects, then said person need not consider tracing issues further.
If a change to a function's calling interface or contextual
assumptions looks warranted, then he knows he should discuss the
details with some tracing-related folks (i.e. find tracehook.h in
MAINTAINERS and thus get to me and Oleg).

Likewise, a person hacking on tracing code should not have to think
about every corner of interaction with the core code or arch code.
Each tracehook_* call's kernel-doc comments say everything that
matters about how and where it's called.  If some of those details
are problematic for what we want to do inside the tracing code, then
we know we have to hash out the details with the maintainers of the
core or arch code that makes those calls.  Otherwise we can keep our
focus on tracing infrastructure without spending time getting lost
in arcane details of the arch or core kernel code.

These two things seem permanently worthwhile to me: that the core
and arch source code use simple function calls without open-coding
any innards of the tracing infrastructure; and that these functions
have clear and complete documentation about their calling interfaces
and context (locking et al).  I find it natural and convenient that
such calls have a common prefix that makes it obvious to any reader
of the core code what subsystem the call relates to.

Beyond those ideas, I certainly don't care at all what the names of
these functions are, what common prefix is used, or in which source
files those declarations and kernel-doc comments appear.


Thanks,
Roland



Re: [RFC,PATCH 0/14] utrace/ptrace

2009-12-01 Thread Roland McGrath
Note to all: I'm on the road this week (having had a holiday last week)
and will be somewhat slow in replying on these threads, but I will be
sure to get to them all.

 Yes, nobody likes 2 implementations. I guess Roland and me hate
 CONFIG_UTRACE much more than anybody else.

:-) We both hate maintaining the old ptrace implementation, that is true!
The other thing we hate is having our work delayed arbitrarily again and
again to wait for the arch maintainers of arch's we don't use ourselves.

Oleg and I have been trying to follow the advice we get on how to get this
work merged in.  When multiple gate-keepers give conflicting advice, we
really don't know what to do next.  Our interest is in whatever path
smooths the merging of our work.  We'd greatly prefer to spend our time
hacking on our new code to make it better or different in cool and useful
ways than to spend more time guessing what order of patches and rejuggling
of the work will fit the changing whims of the next round of review.

We don't want to rush anyone, like uninterested arch maintainers.  We don't
want to break anyone who doesn't care about our work (we do test for ptrace
regressions but of course new code will always have new bugs so some
instances of instability in the transition to a new ptrace implementation
have to be expected no matter how hard we try).  We just don't want to keep
working out of tree.

The advantage of making the new code optional is, obviously, that you can
turn it off.  That is, lagging arch's won't be broken, just unable to turn
it on.  And, anyone who doesn't want to try anything new yet can just turn
it off and not be exposed to new code.

The advantage of making the new code nonoptional is, obviously, that you
can't turn it off.  The old implementation will be gone and we won't have
to maintain it any more (outside some -stable streams for a while).  The
kernel source will be cleaner with no optional old cruft code duplicating
functionality.  All ptrace users will be testing the new code, and so we'll
find its bugs and gain confidence that it works solidly.

Like I've said, we want to do whatever the people want.  My concern about
what Christoph has suggested is that it really seems like an open-ended
delay depending on some arch people who are not even in the conversation.
For anyone either who likes utrace or who is concerned about its details,
the sooner we are working in-tree the sooner we can really hash it out
thoroughly and get to having merged code that works well.  As long as there
is anything unfinished like arch work that we've decided is a gating event,
then the review and hashing out of the new code itself does not seem to get
very serious.  (To wit, this thread is still talking about merge order and
such, another long thread was about incidental trivia, and we've only just
started to see a tiny bit of actual code review today.)


Thanks,
Roland



Re: [RFC,PATCH 14/14] utrace core

2009-12-01 Thread Roland McGrath
 This above text seems inconsistent. First you say report_reap() is the
 final callback and should be used for cleanup, then you say
 report_death() is the penultimate callback but might not always happen
 and people would normally use that as cleanup.

 If we cannot rely on report_death() then I would suggest to simply
 recommend report_reap() as cleanup callback and leave it at that.

I'm sorry the explanation was not clear to you.  I'll try to make it clear
now and then I'd appreciate your suggestions on how the documentation could
be worded better.  (I do appreciate your suggestion here but it's one based
on an idea that's not factual, so I don't think we should follow it.)

There is no unreliable aspect to it.  If you call utrace_set_events() to
ask for report_death() callbacks then you will get that callback for that
task--guaranteed--unless utrace_set_events() returned an error code that
tells you unambiguously that you could not get that callback.

What's true is that report_reap() is the last callback you can ever
get for a given task.  If you had asked for report_death() callbacks,
then you always get report_death() and if you've asked for both these
callbacks then report_death() is always before report_reap().  
(This is a special ordering guarantee, because in actuality the
release_task() call that constitutes reap can happen in the parent
simultaneously with the task's own exit path.)

The one situation in which you cannot get report_death() is when the task
was already dead when you called utrace_set_events().  In that case, it
gives an error code as I mentioned above.  Even in that situation, you can
still ask to enable just the report_reap() callback.  With either of these,
if you enabled it successfully, then you are guaranteed to get it.

When you get report_death() and are not interested in getting report_reap()
afterwards, then you can return UTRACE_DETACH from report_death().  If you
don't detach, then you will get report_reap() later too.  The documentation
mentions using report_death() to detach and clean up because many kinds of
uses will have no interest in report_reap() at all.  The only reason to get
a report_reap() callback is if you are interested in tracking when a task's
(real) parent reaps it with wait* calls, but usually people are only really
interested in a task until it dies.

  +  para
  +There is nothing a kernel module can do to keep a structnamestruct
  +task_struct/structname alive outside of
  +functionrcu_read_lock/function. 
 
 Sure there is, get_task_struct() comes to mind.

__put_task_struct() is not exported to modules and so put_task_struct()
cannot be used by modules.

  +still valid until functionrcu_read_unlock/function.  The
  +infrastructure never holds task references of its own.  
 
 And here you imply its existence.

[I take it this refers to the next quoted bit:]

  Though neither
  +functionrcu_read_lock/function nor any other lock is held while
  +making a callback, it's always guaranteed that the structnamestruct
  +task_struct/structname and the structnamestruct
  +utrace_engine/structname passed as arguments remain valid
  +until the callback function returns.

No, we do not imply that the utrace infrastructure holds any task
reference.  The current task_struct is always live while it's running
and until it's passed to release_task().  The task_struct passed to
callbacks is current.  That's all it means.  

The true facts are that utrace callbacks are all made in the current task
except for report_reap(), which is made at the start of release_task().
So the kernel's core logic is holding a task reference at all times that
utrace callbacks are made.  If you really think it is clearer to explain
that set of facts as utrace holds a reference, then please suggest the
exact wording you have in mind.

 The above seems weird to me at best... why hold a pid around when you
 can keep a ref on the task struct? A pid might get reused leaving you
 with a completely different task than you thought you had.

The *_pid interfaces are only there because put_pid() can be used by
modules while put_task_struct() cannot.  If we can make __put_task_struct()
an exported symbol again, I would see no reason for these *_pid calls.

 Right, except you cannot generally rely on this report_death() thing, so
 a trace engine needs to deal with the report_reap() thing anyway.

This is a false antecedent.  I hope the explanation above made this
subject clear to you.

  +  sect2 id=interlocktitleInterlock with final callbacks/title
[...]
 Hrmm, better mention this earlier, or at least refer to this when
 describing the above cleanup bits.

This explanation is somewhat long and has its own whole section so as to
be thoroughly explicit and clear.  Do you think there should just be a
cross reference here in the earlier mention of report_reap() and
report_death()?  Or do you think