Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)

2009-12-14 Thread Jan Kratochvil
On Tue, 01 Dec 2009 20:39:40 +0100, Roland McGrath wrote:
 I think the best bet is to link with -Wl,-z,now and then minimize the
 library code you rely on.

Checked-in the fix of at least Fedora 12 x86_64 below.

getppid() does not look to be needed there - PTRACE_SYSCALL does stop
(WIFSTOPPED) on the entry (before WIFEXITED) to __NR_exit keeping the
PASS/FAIL reproducibility.


Regards,
Jan


--- Makefile.am 29 Nov 2009 02:23:25 -  1.60
+++ Makefile.am 14 Dec 2009 09:47:54 -  1.61
@@ -111,6 +111,8 @@ stopped_attach_transparency_LDFLAGS = -l
 erestartsys_trap_LDFLAGS = -lutil
 erestartsys_trap_debugger_LDFLAGS = -lutil
 erestartsys_trap_32fails_debugger_LDFLAGS = -lutil
+# After clone syscall it must call no glibc code (such as _dl_runtime_resolve).
+clone_multi_ptrace_LDFLAGS = -Wl,-z,now
 
 check_TESTS = $(SAFE)
 xcheck_TESTS = $(CRASHERS)
--- clone-multi-ptrace.c5 Dec 2008 14:41:57 -   1.6
+++ clone-multi-ptrace.c14 Dec 2009 09:47:54 -  1.7
@@ -65,10 +65,10 @@ static char grandchild_seen[THREAD_NUM];
 static int
 grandchild_func (void *unused)
 {
-  /* Need to have at least one syscall before exit */
-  getppid ();
-  /* _exit() would make ALL threads to exit. We need rew syscall */
+  /* _exit() would make ALL threads to exit.  We need rew syscall.  After the
+ clone syscall it must call no glibc code (such as _dl_runtime_resolve).  
*/
   syscall (__NR_exit, 22);
+
   return 0;
 }
 



Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)

2009-12-01 Thread Oleg Nesterov
On 11/30, Oleg Nesterov wrote:

 On 11/29, Roland McGrath wrote:
 
  Please file this test case on bugzilla.redhat.com for Fedora 12 glibc.

 https://bugzilla.redhat.com/show_bug.cgi?id=542731

It was closed as NOTABUG, Andreas Schwab wrote:

 If you call clone directly you are responsible for setting up
 the TLS area yourself.

troll mode

Very nice. If I understand correctly, this means clone(CLONE_VM)
must not be used without CLONE_SETTLS, right?

This in turn means clone(CLONE_VM) is not useable, afaics it is not
possible to use CLONE_SETTLS in a more or less portable manner.
Even arch/x86/ needs struct user_desc * or long addr depending
on CONFIG_X86_32.

And it used to work? I downloaded glibc-2.11, and afaics this was
broken by

Preserve SSE registers in runtime relocations on x86-64.
commit: b48a267b8fbb885191a04cffdb4050a4d4c8a20b

I do not understand glibc even remotely, but this lools like
regression to me. I see nothing in the changelog or man page
which explains that CLONE_VM requires CLONE_SETTLS now.

/troll mode


So. Any ptrace test which uses clone() is broken, at least on x86_64.

Jan, Roland, how should we fix this? We can rewrite the code to use
pthread_create(), this should be trivial. Unfortunately, libpthread
is not trivial, it can shadow the problem and complicate the testing.

And the stupid question. If I create the subthread via pthread_create(),
how can I know its tid? I grepped glibc-2.11, and afaics pthread_create
returns the pointer to struct pthread which has pid_t tid but I can
not find the helper which returns -tid and struct pthread is not
exported.

Oleg.



Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)

2009-12-01 Thread Roland McGrath
 So. Any ptrace test which uses clone() is broken, at least on x86_64.

If you use clone() directly then you need to have the code run in that
child be purely under your control.  You can't use miscellaneous libc
calls nor any libpthread calls, only ones you are sure do not require
any thread setup.  Given TLS, that means even using errno or anything
that might set it.  It also means any libc function that might use any
TLS you don't know about, i.e. really anything beyond the pure
computation calls like str*/mem*.  It also means running any dynamic
linker code, such as relying on dynamic linking without LD_BIND_NOW
(or -Wl,-z,now at compile time).  

The only thing that changed about this recently in glibc is that even
more code paths through the dynamic linker now happen to depend on
thread setup.

 Jan, Roland, how should we fix this? We can rewrite the code to use
 pthread_create(), this should be trivial. Unfortunately, libpthread
 is not trivial, it can shadow the problem and complicate the testing.

We should avoid library code more thoroughly, not use more of it.  
As well as being complex, it also varies a lot across systems and
interferes with using the same sources to translate to exact
kernel-level testing across various people's development environments.

I think the best bet is to link with -Wl,-z,now and then minimize the
library code you rely on.  (It really only matters to be extra careful
about that for the code running in the clone child.)  So you can use
syscall() if you are not relying on its error-case behavior--if the
system call fails, the function will set errno, which can rely on the
TLS setup.

 And the stupid question. If I create the subthread via pthread_create(),
 how can I know its tid? I grepped glibc-2.11, and afaics pthread_create
 returns the pointer to struct pthread which has pid_t tid but I can
 not find the helper which returns -tid and struct pthread is not
 exported.

There is no official exported way.  You can use syscall(__NR_gettid).
That kernel concept of a global thread ID number does not exist in
pthreads, it is a detail of the Linux implementation.


Thanks,
Roland



Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)

2009-11-30 Thread Oleg Nesterov
On 11/29, Roland McGrath wrote:

 Please file this test case on bugzilla.redhat.com for Fedora 12 glibc.

https://bugzilla.redhat.com/show_bug.cgi?id=542731

Oleg.