On 2017-11-13 07:39, C Smith wrote:
> Hi Jan,
> 
> I have found a workaround for the problem. Instead of the startup segfault
> happening 10% of the time, I have now started my RT app 90 times with a
> single RT thread, and 80 times with its original three RT threads - with no
> segfaults.
> 
> Per your question: I don't think the problem is that __rt_print_init() is
> getting called twice. The normal order of execution is like this:
> 
> . printer_loop() gets called first when a xenomai RT app starts up
> 
> . pthread_mutex_lock() sets the buffer_lock struct so __lock and __owner
> are nonzero:
> (gdb) p buffer_lock
> $4 = {__data = {__lock = 1, __count = 0, __owner = 18681, __kind = 0,
> __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, __size =
> "\001\000\000\000\000\000\000\000\371H\000\000\000\000\000\000\001\000\000\000\000\000\000",
> __align = 1}
> 
> . then pthread_cond_wait() calls __rt_print_init()
> 
> . inside  __rt_print_init(), printer_wakeup has a valid __mutex:
> (gdb) print printer_wakeup
> $5 = {__data = {__lock = 0, __futex = 1, __total_seq = 1, __wakeup_seq = 0,
> __woken_seq = 0, __mutex = 0xb7fd4a1c, __nwaiters = 2, __broadcast_seq =
> 0}, __size = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23
> times>, "\034J\375\267\002\000\000\000\000\000\000\000\000\000\000",
> __align = 4294967296}
> 
> . Then continuing, we get to first line of main() OK with no segfault.
> 
> You had advised to watch for corruption of the vars pthread_cond_wait()
> uses.
> In contrast to the above, when the segfault occurs, the vars buffer_lock
> and printer_wakeup, which get passed into pthread_cond_wait(), contain all
> zeros:
> 
> (gdb) print buffer_lock
> $6 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 0, __nusers
> = 0, {__spins = 0, __list = {__next = 0x0}}}, __size = '\000' <repeats 23
> times>, __align = 0}
> (gdb) print printer_wakeup
> $7 = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0,
> __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0},
> __size = '\000' <repeats 47 times>, __align = 0}
> 
> There is one pointer in the pthread_cond_t structure:
> printer_wakeup.__data.__mutex
> So perhaps pthread_cond_wait() dereferences this null mutex pointer ? The
> segfault always happens on access of address 0xC.

You can probably find out what it dereferences by installing debug
symbols for the glibc. But let's assume it's the mutex: This reference
is set by pthread_cond_wait itself when it associates the provided mutex
with the condition variable on function entry. Therefore my assumption
that we see a corruption during the execution of cond_wait.

> 
> This segfault first appeared when I compiled my app for SMP, and it goes
> away if I use kernel arg maxcpus=1. Perhaps some SMP race condition is
> occasionally preventing the data structures (buffer_lock,printer_wakeup)
> from being ready for pthread_cond_wait()?
> 
> As a protection against this I have patched the rt_print.c printer_loop()
> code, skipping the call to pthread_cond_wait() if those two structures
> (buffer_lock,printer_wakeup) are not ready. There is no reason to wait on a
> thread which is not locked and where the mutex is nonexistent, right?
> 
> This is the patch:
> 
> --- rt_print_A.c    2014-09-24 13:57:49.000000000 -0700
> +++ rt_print_B.c    2017-11-11 23:24:34.309832301 -0800
> @@ -680,9 +680,10 @@
>      while (1) {
>          pthread_cleanup_push(unlock, &buffer_lock);
>          pthread_mutex_lock(&buffer_lock);
> -
> -        while (buffers == 0)
> -            pthread_cond_wait(&printer_wakeup, &buffer_lock);
> +
> +        if ((buffer_lock.__data.__lock != 0) &&
> (printer_wakeup.__data.__mutex != 0))
> +            while (buffers == 0)
> +                pthread_cond_wait(&printer_wakeup, &buffer_lock);
> 
>          print_buffers();
> 
> Can you verify that this patch is safe?

It's definitely not because we still have no clue what actually goes wrong.

My suggestion to debug this via watchpoints still stands: First find out
which field is actually dereferenced on the crash. Then set a watchpoint
on it during __rt_print_init.

There is some ordering issue of initialization function that I cannot
explain yet. Have a specific look at when and how often
forked_child_init is run because it a) reinitializes buffer_lock and b)
spawns the printer thread. In theory, everything should be up an read
PRIOR to that spawning.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

_______________________________________________
Xenomai mailing list
[email protected]
https://xenomai.org/mailman/listinfo/xenomai

Reply via email to