2014-07-28 23:52 GMT+02:00 Philippe Gerum <r...@xenomai.org>: > On 07/28/2014 05:18 PM, Kim De Mey wrote: >> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <r...@xenomai.org>: >>> On 07/23/2014 10:31 AM, Kim De Mey wrote: >>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.de...@gmail.com>: >>>>> >>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <r...@xenomai.org>: >>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote: >>>>>>> >>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <r...@xenomai.org>: >>>>>>>> >>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote: >>>>>>>> >>>>>>>>> Newer version --dump-config output: >>>>>>>>> ... >>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version: >>>>>>>>> SDK_3_1_0 build 27) " >>>>>>>>> ... >>>>>>>>> >>>>>>>> >>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb >>>>>>>> backtraces for all the threads still shown by "info threads" after a >>>>>>>> deadlock would help. >>>>>>>> >>>>>>>> TIA, >>>>>>>> >>>>>>> >>>>>>> Below backtraces of all the threads. It is a case with two "worker" >>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to >>>>>>> their "tm_wkafter" loop. >>>>>>> >>>>>>> >>>>>>> (gdb) thread a a bt >>>>>>> >>>>>>> Thread 4 (Thread 9406): >>>>>>> #0 0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at >>>>>>> pthread_mutex_lock.c:321 >>>>>>> #1 0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at >>>>>>> pthread_mutex_lock.c:55 >>>>>>> #2 0x77cf91e0 in threadobj_lock () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0 >>>>>>> #3 0x77cf92e4 in finalize_thread () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0 >>>>>>> #4 0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156 >>>>>>> #5 0x77cc5940 in start_thread (arg=0x75dff490) at pthread_create.c:315 >>>>>>> #6 0x77bf1bbc in __thread_start () from >>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6 >>>>>>> Backtrace stopped: frame did not save the PC >>>>>>> >>>>>>> Thread 3 (Thread 31145): >>>>>>> #0 0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at >>>>>>> pthread_mutex_lock.c:321 >>>>>>> #1 0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at >>>>>>> pthread_mutex_lock.c:55 >>>>>>> #2 0x77cf91e0 in threadobj_lock () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0 >>>>>>> #3 0x77cf92e4 in finalize_thread () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0 >>>>>>> #4 0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156 >>>>>>> #5 0x77cc5940 in start_thread (arg=0x766ff490) at pthread_create.c:315 >>>>>>> #6 0x77bf1bbc in __thread_start () from >>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6 >>>>>>> Backtrace stopped: frame did not save the PC >>>>>>> >>>>>>> Thread 2 (Thread 6970): >>>>>>> #0 clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, >>>>>>> req=<optimized out>, rem=<optimized out>) >>>>>>> at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50 >>>>>>> #1 0x77cfa0bc in threadobj_sleep () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0 >>>>>>> #2 0x77d2af08 in tm_wkafter () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0 >>>>>>> #3 0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at >>>>>>> delete_child_hangs.c:31 >>>>>>> #4 0x77d296fc in task_trampoline () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0 >>>>>>> #5 0x77cf7d84 in thread_trampoline () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0 >>>>>>> #6 0x77cc592c in start_thread (arg=0x77a00490) at pthread_create.c:310 >>>>>>> #7 0x77bf1bbc in __thread_start () from >>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6 >>>>>>> Backtrace stopped: frame did not save the PC >>>>>>> >>>>>>> Thread 1 (Thread 6969): >>>>>>> #0 clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>, >>>>>>> req=<optimized out>, rem=<optimized out>) >>>>>>> at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50 >>>>>>> #1 0x77cfa0bc in threadobj_sleep () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0 >>>>>>> #2 0x77d2af08 in tm_wkafter () from >>>>>>> >>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0 >>>>>>> #3 0x10000c00 in main (argc=1, argv=0x10011030) at >>>>>>> delete_child_hangs.c:43 >>>>>>> >>>>>> >>>>>> Ok, unless we've entered the twilight zone, the only reason I could see >>>>>> this >>>>>> happening would be that the thread prologue somehow gets hit by an async >>>>>> cancellation signal while holding its own lock. If so, then the patch >>>>>> below >>>>>> would cause an assertion to trigger in such circumstance. You will have >>>>>> to >>>>>> turn on --enable-assert in your configuration, leaving debugging entirely >>>>>> off not to significantly affect the current timings. >>>>>> >>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h >>>>>> index 6f0218c..b704987 100644 >>>>>> --- a/include/boilerplate/lock.h >>>>>> +++ b/include/boilerplate/lock.h >>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype); >>>>>> #define read_unlock_safe(__lock, __state) \ >>>>>> __do_unlock_safe(__lock, __state) >>>>>> >>>>>> -#ifdef CONFIG_XENO_DEBUG >>>>>> +#ifndef NDEBUG >>>>>> #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK >>>>>> #else >>>>>> #define mutex_type_attribute PTHREAD_MUTEX_NORMAL >>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c >>>>>> index 05bb6cb..6547460 100644 >>>>>> --- a/lib/copperplate/threadobj.c >>>>>> +++ b/lib/copperplate/threadobj.c >>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj) >>>>>> static void finalize_thread(void *p) /* thobj->lock free */ >>>>>> { >>>>>> struct threadobj *thobj = p; >>>>>> + int ret; >>>>>> >>>>>> if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT) >>>>>> return; >>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* thobj->lock >>>>>> free */ >>>>>> * waiting for us to start, pending on >>>>>> * wait_on_barrier(). Instead, hand it over to this thread. >>>>>> */ >>>>>> - threadobj_lock(thobj); >>>>>> + ret = threadobj_lock(thobj); >>>>>> + assert(ret == 0); >>>>>> + (void)ret; >>>>>> if ((thobj->status & __THREAD_S_SAFE) == 0) { >>>>>> threadobj_unlock(thobj); >>>>>> destroy_thread(thobj); >>>>>> >>>>> >>>>> And...the assert kicks in! >>>>> >>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed. >>>> >>>> >>>> Philippe, >>>> >>>> Could you explain this case of the async cancellation signal? I don't >>>> fully understand how this happens or why it causes the issue. >>>> >>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the >>>> issue doesn't occur anymore. >>>> >>>> >>> >>> I have not been able to reproduce this race yet, but still the following >>> patch would make sense: >>> >>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c >>> index 05bb6cb..48aa032 100644 >>> --- a/lib/copperplate/threadobj.c >>> +++ b/lib/copperplate/threadobj.c >>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj *thobj, >>> int mask) >>> if (status & mask) >>> break; >>> oldstate = thobj->cancel_state; >>> + push_cleanup_lock(&thobj->lock); >>> __threadobj_tag_unlocked(thobj); >>> __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock)); >>> __threadobj_tag_locked(thobj); >>> + pop_cleanup_lock(&thobj->lock); >>> thobj->cancel_state = oldstate; >>> } >>> >>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) /* >>> thobj->lock held */ >>> */ >>> while (thobj->status & __THREAD_S_WARMUP) { >>> oldstate = thobj->cancel_state; >>> + push_cleanup_lock(&thobj->lock); >>> __threadobj_tag_unlocked(thobj); >>> __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock)); >>> __threadobj_tag_locked(thobj); >>> + pop_cleanup_lock(&thobj->lock); >>> thobj->cancel_state = oldstate; >>> } >>> >> >> I have tried this patch but it does not seem to make any difference. >> The threads are still deadlocked at the same location. >> >> I've only tested this on my setup with the old toolchain/kernel so >> far. I'll check also with the newer version. >> > > Ok, thanks for testing. I would not bother testing with another > toolchain, I don't buy the toolchain bug in this case. This looks like a > common race due to the async nature of the cancellation event which is > released while the caller holds its own lock without proper cleanup > actions. I'll dig deeper. >
Okay, in case there is anything you want me to try out, don't hesitate to tell me. Thanks a lot already! Kim _______________________________________________ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai