Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)
On Tue, 01 Dec 2009 20:39:40 +0100, Roland McGrath wrote: I think the best bet is to link with -Wl,-z,now and then minimize the library code you rely on. Checked-in the fix of at least Fedora 12 x86_64 below. getppid() does not look to be needed there - PTRACE_SYSCALL does stop (WIFSTOPPED) on the entry (before WIFEXITED) to __NR_exit keeping the PASS/FAIL reproducibility. Regards, Jan --- Makefile.am 29 Nov 2009 02:23:25 - 1.60 +++ Makefile.am 14 Dec 2009 09:47:54 - 1.61 @@ -111,6 +111,8 @@ stopped_attach_transparency_LDFLAGS = -l erestartsys_trap_LDFLAGS = -lutil erestartsys_trap_debugger_LDFLAGS = -lutil erestartsys_trap_32fails_debugger_LDFLAGS = -lutil +# After clone syscall it must call no glibc code (such as _dl_runtime_resolve). +clone_multi_ptrace_LDFLAGS = -Wl,-z,now check_TESTS = $(SAFE) xcheck_TESTS = $(CRASHERS) --- clone-multi-ptrace.c5 Dec 2008 14:41:57 - 1.6 +++ clone-multi-ptrace.c14 Dec 2009 09:47:54 - 1.7 @@ -65,10 +65,10 @@ static char grandchild_seen[THREAD_NUM]; static int grandchild_func (void *unused) { - /* Need to have at least one syscall before exit */ - getppid (); - /* _exit() would make ALL threads to exit. We need rew syscall */ + /* _exit() would make ALL threads to exit. We need rew syscall. After the + clone syscall it must call no glibc code (such as _dl_runtime_resolve). */ syscall (__NR_exit, 22); + return 0; }
Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)
On 11/30, Oleg Nesterov wrote: On 11/29, Roland McGrath wrote: Please file this test case on bugzilla.redhat.com for Fedora 12 glibc. https://bugzilla.redhat.com/show_bug.cgi?id=542731 It was closed as NOTABUG, Andreas Schwab wrote: If you call clone directly you are responsible for setting up the TLS area yourself. troll mode Very nice. If I understand correctly, this means clone(CLONE_VM) must not be used without CLONE_SETTLS, right? This in turn means clone(CLONE_VM) is not useable, afaics it is not possible to use CLONE_SETTLS in a more or less portable manner. Even arch/x86/ needs struct user_desc * or long addr depending on CONFIG_X86_32. And it used to work? I downloaded glibc-2.11, and afaics this was broken by Preserve SSE registers in runtime relocations on x86-64. commit: b48a267b8fbb885191a04cffdb4050a4d4c8a20b I do not understand glibc even remotely, but this lools like regression to me. I see nothing in the changelog or man page which explains that CLONE_VM requires CLONE_SETTLS now. /troll mode So. Any ptrace test which uses clone() is broken, at least on x86_64. Jan, Roland, how should we fix this? We can rewrite the code to use pthread_create(), this should be trivial. Unfortunately, libpthread is not trivial, it can shadow the problem and complicate the testing. And the stupid question. If I create the subthread via pthread_create(), how can I know its tid? I grepped glibc-2.11, and afaics pthread_create returns the pointer to struct pthread which has pid_t tid but I can not find the helper which returns -tid and struct pthread is not exported. Oleg.
Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)
So. Any ptrace test which uses clone() is broken, at least on x86_64. If you use clone() directly then you need to have the code run in that child be purely under your control. You can't use miscellaneous libc calls nor any libpthread calls, only ones you are sure do not require any thread setup. Given TLS, that means even using errno or anything that might set it. It also means any libc function that might use any TLS you don't know about, i.e. really anything beyond the pure computation calls like str*/mem*. It also means running any dynamic linker code, such as relying on dynamic linking without LD_BIND_NOW (or -Wl,-z,now at compile time). The only thing that changed about this recently in glibc is that even more code paths through the dynamic linker now happen to depend on thread setup. Jan, Roland, how should we fix this? We can rewrite the code to use pthread_create(), this should be trivial. Unfortunately, libpthread is not trivial, it can shadow the problem and complicate the testing. We should avoid library code more thoroughly, not use more of it. As well as being complex, it also varies a lot across systems and interferes with using the same sources to translate to exact kernel-level testing across various people's development environments. I think the best bet is to link with -Wl,-z,now and then minimize the library code you rely on. (It really only matters to be extra careful about that for the code running in the clone child.) So you can use syscall() if you are not relying on its error-case behavior--if the system call fails, the function will set errno, which can rely on the TLS setup. And the stupid question. If I create the subthread via pthread_create(), how can I know its tid? I grepped glibc-2.11, and afaics pthread_create returns the pointer to struct pthread which has pid_t tid but I can not find the helper which returns -tid and struct pthread is not exported. There is no official exported way. You can use syscall(__NR_gettid). That kernel concept of a global thread ID number does not exist in pthreads, it is a detail of the Linux implementation. Thanks, Roland
Re: clone bug (glibc?) (Was: clone-multi-ptrace test failure)
On 11/29, Roland McGrath wrote: Please file this test case on bugzilla.redhat.com for Fedora 12 glibc. https://bugzilla.redhat.com/show_bug.cgi?id=542731 Oleg.