2014-08-04 19:06 GMT+02:00 Philippe Gerum <r...@xenomai.org>:
> On 07/29/2014 09:31 AM, Kim De Mey wrote:
>> 2014-07-28 23:52 GMT+02:00 Philippe Gerum <r...@xenomai.org>:
>>> On 07/28/2014 05:18 PM, Kim De Mey wrote:
>>>> 2014-07-28 13:09 GMT+02:00 Philippe Gerum <r...@xenomai.org>:
>>>>> On 07/23/2014 10:31 AM, Kim De Mey wrote:
>>>>>> 2014-07-09 17:05 GMT+02:00 Kim De Mey <kim.de...@gmail.com>:
>>>>>>>
>>>>>>> 2014-07-09 16:16 GMT+02:00 Philippe Gerum <r...@xenomai.org>:
>>>>>>>> On 07/09/2014 12:23 PM, Kim De Mey wrote:
>>>>>>>>>
>>>>>>>>> 2014-07-09 11:25 GMT+02:00 Philippe Gerum <r...@xenomai.org>:
>>>>>>>>>>
>>>>>>>>>> On 07/09/2014 10:09 AM, Kim De Mey wrote:
>>>>>>>>>>
>>>>>>>>>>> Newer version --dump-config output:
>>>>>>>>>>> ...
>>>>>>>>>>> CONFIG_XENO_COMPILER="gcc version 4.7.0 (Cavium Inc. Version:
>>>>>>>>>>> SDK_3_1_0 build 27) "
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks. I can't reproduce this issue yet, so posting the current gdb
>>>>>>>>>> backtraces for all the threads still shown by "info threads" after a
>>>>>>>>>> deadlock would help.
>>>>>>>>>>
>>>>>>>>>> TIA,
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Below backtraces of all the threads. It is a case with two "worker"
>>>>>>>>> tasks in deadlock. The "main" and "create_delete" tasks continued to
>>>>>>>>> their "tm_wkafter" loop.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (gdb) thread a a bt
>>>>>>>>>
>>>>>>>>> Thread 4 (Thread 9406):
>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b03300) at
>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b03300) at
>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x75dff490) at 
>>>>>>>>> pthread_create.c:315
>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>
>>>>>>>>> Thread 3 (Thread 31145):
>>>>>>>>> #0  0x77cc2d94 in __pthread_mutex_lock_full (mutex=0x77b030b8) at
>>>>>>>>> pthread_mutex_lock.c:321
>>>>>>>>> #1  0x77cc7e64 in __GI___pthread_mutex_lock (mutex=0x77b030b8) at
>>>>>>>>> pthread_mutex_lock.c:55
>>>>>>>>> #2  0x77cf91e0 in threadobj_lock () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #3  0x77cf92e4 in finalize_thread () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #4  0x77cc5704 in __nptl_deallocate_tsd () at pthread_create.c:156
>>>>>>>>> #5  0x77cc5940 in start_thread (arg=0x766ff490) at 
>>>>>>>>> pthread_create.c:315
>>>>>>>>> #6  0x77bf1bbc in __thread_start () from
>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>
>>>>>>>>> Thread 2 (Thread 6970):
>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>> #3  0x10000b64 in create_delete (a=0, b=0, c=0, d=0) at
>>>>>>>>> delete_child_hangs.c:31
>>>>>>>>> #4  0x77d296fc in task_trampoline () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>> #5  0x77cf7d84 in thread_trampoline () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #6  0x77cc592c in start_thread (arg=0x77a00490) at 
>>>>>>>>> pthread_create.c:310
>>>>>>>>> #7  0x77bf1bbc in __thread_start () from
>>>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>>> Backtrace stopped: frame did not save the PC
>>>>>>>>>
>>>>>>>>> Thread 1 (Thread 6969):
>>>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
>>>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>>>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:50
>>>>>>>>> #1  0x77cfa0bc in threadobj_sleep () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>>> #2  0x77d2af08 in tm_wkafter () from
>>>>>>>>>
>>>>>>>>> /repo/kdemey/buildroot-sdk31/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>>> #3  0x10000c00 in main (argc=1, argv=0x10011030) at
>>>>>>>>> delete_child_hangs.c:43
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ok, unless we've entered the twilight zone, the only reason I could 
>>>>>>>> see this
>>>>>>>> happening would be that the thread prologue somehow gets hit by an 
>>>>>>>> async
>>>>>>>> cancellation signal while holding its own lock. If so, then the patch 
>>>>>>>> below
>>>>>>>> would cause an assertion to trigger in such circumstance. You will 
>>>>>>>> have to
>>>>>>>> turn on --enable-assert in your configuration, leaving debugging 
>>>>>>>> entirely
>>>>>>>> off not to significantly affect the current timings.
>>>>>>>>
>>>>>>>> diff --git a/include/boilerplate/lock.h b/include/boilerplate/lock.h
>>>>>>>> index 6f0218c..b704987 100644
>>>>>>>> --- a/include/boilerplate/lock.h
>>>>>>>> +++ b/include/boilerplate/lock.h
>>>>>>>> @@ -206,7 +206,7 @@ int __check_cancel_type(const char *locktype);
>>>>>>>>  #define read_unlock_safe(__lock, __state)      \
>>>>>>>>         __do_unlock_safe(__lock, __state)
>>>>>>>>
>>>>>>>> -#ifdef CONFIG_XENO_DEBUG
>>>>>>>> +#ifndef NDEBUG
>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_ERRORCHECK
>>>>>>>>  #else
>>>>>>>>  #define mutex_type_attribute PTHREAD_MUTEX_NORMAL
>>>>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>>>>> index 05bb6cb..6547460 100644
>>>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>>>> @@ -1310,6 +1310,7 @@ int threadobj_cancel(struct threadobj *thobj)
>>>>>>>>  static void finalize_thread(void *p) /* thobj->lock free */
>>>>>>>>  {
>>>>>>>>         struct threadobj *thobj = p;
>>>>>>>> +       int ret;
>>>>>>>>
>>>>>>>>         if (thobj == NULL || thobj == THREADOBJ_IRQCONTEXT)
>>>>>>>>                 return;
>>>>>>>> @@ -1343,7 +1344,9 @@ static void finalize_thread(void *p) /* 
>>>>>>>> thobj->lock
>>>>>>>> free */
>>>>>>>>          * waiting for us to start, pending on
>>>>>>>>          * wait_on_barrier(). Instead, hand it over to this thread.
>>>>>>>>          */
>>>>>>>> -       threadobj_lock(thobj);
>>>>>>>> +       ret = threadobj_lock(thobj);
>>>>>>>> +       assert(ret == 0);
>>>>>>>> +       (void)ret;
>>>>>>>>         if ((thobj->status & __THREAD_S_SAFE) == 0) {
>>>>>>>>                 threadobj_unlock(thobj);
>>>>>>>>                 destroy_thread(thobj);
>>>>>>>>
>>>>>>>
>>>>>>> And...the assert kicks in!
>>>>>>>
>>>>>>> threadobj.c:1348: finalize_thread: Assertion `ret == 0' failed.
>>>>>>
>>>>>>
>>>>>> Philippe,
>>>>>>
>>>>>> Could you explain this case of the async cancellation signal? I don't
>>>>>> fully understand how this happens or why it causes the issue.
>>>>>>
>>>>>> I also tested once with CONFIG_XENO_ASYNC_CANCEL OFF and then the
>>>>>> issue doesn't occur anymore.
>>>>>>
>>>>>>
>>>>>
>>>>> I have not been able to reproduce this race yet, but still the following 
>>>>> patch would make sense:
>>>>>
>>>>> diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
>>>>> index 05bb6cb..48aa032 100644
>>>>> --- a/lib/copperplate/threadobj.c
>>>>> +++ b/lib/copperplate/threadobj.c
>>>>> @@ -1045,9 +1045,11 @@ static int wait_on_barrier(struct threadobj 
>>>>> *thobj, int mask)
>>>>>                 if (status & mask)
>>>>>                         break;
>>>>>                 oldstate = thobj->cancel_state;
>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>                 __threadobj_tag_locked(thobj);
>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>                 thobj->cancel_state = oldstate;
>>>>>         }
>>>>>
>>>>> @@ -1243,9 +1245,11 @@ static void cancel_sync(struct threadobj *thobj) 
>>>>> /* thobj->lock held */
>>>>>          */
>>>>>         while (thobj->status & __THREAD_S_WARMUP) {
>>>>>                 oldstate = thobj->cancel_state;
>>>>> +               push_cleanup_lock(&thobj->lock);
>>>>>                 __threadobj_tag_unlocked(thobj);
>>>>>                 __RT(pthread_cond_wait(&thobj->barrier, &thobj->lock));
>>>>>                 __threadobj_tag_locked(thobj);
>>>>> +               pop_cleanup_lock(&thobj->lock);
>>>>>                 thobj->cancel_state = oldstate;
>>>>>         }
>>>>>
>>>>
>>>> I have tried this patch but it does not seem to make any difference.
>>>> The threads are still deadlocked at the same location.
>>>>
>>>> I've only tested this on my setup with the old toolchain/kernel so
>>>> far. I'll check also with the newer version.
>>>>
>>>
>>> Ok, thanks for testing. I would not bother testing with another
>>> toolchain, I don't buy the toolchain bug in this case. This looks like a
>>> common race due to the async nature of the cancellation event which is
>>> released while the caller holds its own lock without proper cleanup
>>> actions. I'll dig deeper.
>>>
>>
>> Okay, in case there is anything you want me to try out, don't hesitate
>> to tell me.
>>
>> Thanks a lot already!
>>
>>
>
> I could reproduce the issue eventually on your exact test code, by
> dropping the cpu affinity from the test options (--cpu-affinity) on a
> multi-core system. The patch below should paper over the current
> manifestation of this bug, but the root issue which bites here lives way
> below this API layer, and is clearly SMP-related. Are you running a
> multi-core board?

I have tried the test with your patch and the issue no longer occurs.
Yes, I am indeed running a multi-core board (cpu: Cavium Octeon II - 8
cores).
The issue does not occur if I run the board with only 1 core activated.

>
> diff --git a/lib/psos/task.c b/lib/psos/task.c
> index 89ba917..1c37cc7 100644
> --- a/lib/psos/task.c
> +++ b/lib/psos/task.c
> @@ -202,9 +202,10 @@ static void *task_trampoline(void *arg)
>
>         threadobj_unlock(&task->thobj);
>
> +       threadobj_notify_entry();
> +
>         CANCEL_RESTORE(svc);
>
> -       threadobj_notify_entry();
>         args->entry(args->arg0, args->arg1, args->arg2, args->arg3);
>         threadobj_lock(&task->thobj);
>         threadobj_set_magic(&task->thobj, ~task_magic);
>
> --
> Philippe.

_______________________________________________
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai

Reply via email to