On Jan 10, 2014 2:52 PM, "Philippe Gerum" <r...@xenomai.org> wrote:
>
> On 01/10/2014 12:35 PM, Kim De Mey wrote:
>>
>> 2014/1/10 Philippe Gerum <r...@xenomai.org>:
>>>
>>> On 01/10/2014 10:25 AM, Philippe Gerum wrote:
>>>>
>>>>
>>>> On 01/09/2014 11:29 AM, Kim De Mey wrote:
>>>>>
>>>>>
>>>>> 2014/1/9 Philippe Gerum <r...@xenomai.org>:
>>>>>>
>>>>>>
>>>>>> On 01/08/2014 01:23 PM, Kim De Mey wrote:
>>>>>>>
>>>>>>>
>>>>>>> 2014/1/8 Philippe Gerum <r...@xenomai.org>:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 01/08/2014 10:25 AM, Kim De Mey wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here are the backtraces:
>>>>>>>
>>>>>>> main thread:
>>>>>>> (gdb) bt
>>>>>>> #0  clock_nanosleep (clock_id=<optimized out>, flags=<optimized
out>,
>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>>> #1  0x00b45e10 in threadobj_sleep () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #2  0x00b2b618 in tm_wkafter () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #3  0x10000b3c in main (argc=1, argv=0x104120e0) at
>>>>>>> suspend_delete_easy.c:42
>>>>>>>
>>>>>>> first psos task:
>>>>>>> (gdb) bt
>>>>>>> #0  0x00b6b5c8 in __old_sem_wait (sem=<optimized out>) at
>>>>>>> ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:105
>>>>>>> #1  0x00b45c84 in threadobj_cancel () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #2  0x00b2ac44 in t_delete () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #3  0x10000a70 in test (a=0, b=0, c=0, d=0) at
suspend_delete_easy.c:22
>>>>>>> #4  0x00b2a658 in task_trampoline () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #5  0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>>> pthread_create.c:299
>>>>>>> #6  0x00c7df4c in __thread_start () from
>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>
>>>>>>> second psos task:
>>>>>>> (gdb) bt
>>>>>>> #0  0x00b6cbbc in read () from
>>>>>>>
>>>>>>>
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
>>>>>>>
>>>>>>> #1  0x00b47bd8 in notifier_wait () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #2  0x00b460b0 in notifier_callback () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #3  0x00b47d30 in notifier_sighandler () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #4  <signal handler called>
>>>>>>> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at
>>>>>>> init.c:136
>>>>>>> #6  <signal handler called>
>>>>>>> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized
out>,
>>>>>>> req=<optimized out>, rem=<optimized out>)
>>>>>>>        at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>>>>>>> #8  0x00b45e10 in threadobj_sleep () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
>>>>>>>
>>>>>>> #9  0x00b2b618 in tm_wkafter () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #10 0x100009bc in idle_task (a=0, b=0, c=0, d=0) at
>>>>>>> suspend_delete_easy.c:9
>>>>>>> #11 0x00b2a658 in task_trampoline () from
>>>>>>>
>>>>>>>
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libpsos.so.0
>>>>>>>
>>>>>>> #12 0x00b63824 in start_thread (arg=<optimized out>) at
>>>>>>> pthread_create.c:299
>>>>>>> #13 0x00c7df4c in __thread_start () from
>>>>>>> output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libc.so.6
>>>>>>>
>>>>>>>
>>>>>>> Invocation command line was
>>>>>>>
>>>>>>>      $ ./configure --target=mips64-buildroot-linux-gnu
>>>>>>> --host=mips64-buildroot-linux-gnu --build=x86_64-unknown-linux-gnu
>>>>>>> --prefix=/usr --exec-prefix=/usr --sysconfdir=/etc --program-prefix=
>>>>>>> --disable-gtk-doc --disable-doc --disable-docs
--disable-documentation
>>>>>>> --with-xmlto=no --with-fop=no --enable-ipv6 --enable-static
>>>>>>> --enable-shared --with-core=mercury
>>>>>>> --includedir=/usr/include/xenomai-forge --disable-doc-install
>>>>>>> --enable-lores-clock
>>>>>>>
>>>>>>>
>>>>>>> Is this what you were asking?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, thanks.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> By the way, I forgot the mention the following test that I did:
>>>>>>>
>>>>>>> When the application is blocked I've done an "echo 1 >
>>>>>>> /proc/pid/fd/xx",
>>>>>>> where xx is the file descriptor that is being read from in the
>>>>>>> notifier_wait().
>>>>>>> This kinda emulates a t_resume.
>>>>>>>
>>>>>>> After doing this the t_delete gets finalized and the application
>>>>>>> unblocks.
>>>>>>>
>>>>>>
>>>>>> This should fix the race, until the notifier is reworked.
>>>>>>
>>>>>> diff --git a/include/copperplate/threadobj.h
>>>>>> b/include/copperplate/threadobj.h
>>>>>> index 184c711..218c274 100644
>>>>>> --- a/include/copperplate/threadobj.h
>>>>>> +++ b/include/copperplate/threadobj.h
>>>>>> @@ -124,6 +124,7 @@ void threadobj_save_timeout(struct
>>>>>> threadobj_corespec *corespec,
>>>>>>    #define __THREAD_S_ACTIVE      (1 << 5)        /* Running user
>>>>>> code. */
>>>>>>    #define __THREAD_S_SUSPENDED   (1 << 6)        /* Suspended via
>>>>>> threadobj_suspend(). */
>>>>>>    #define __THREAD_S_SAFE                (1 << 7)        /* TCB
>>>>>> release deferred. */
>>>>>> +#define __THREAD_S_ZOMBIE      (1 << 8)        /* Deletion process
>>>>>> ongoing. */
>>>>>>    #define __THREAD_S_DEBUG       (1 << 31)       /* Debug mode
>>>>>> enabled. */
>>>>>>    /*
>>>>>>     * threadobj->run_state, locklessly updated by "current", merged
>>>>>> diff --git a/lib/copperplate/threadobj.c
b/lib/copperplate/threadobj.c
>>>>>> index a31479c..c54cdf8 100644
>>>>>> --- a/lib/copperplate/threadobj.c
>>>>>> +++ b/lib/copperplate/threadobj.c
>>>>>> @@ -420,11 +420,13 @@ static void notifier_callback(const struct
>>>>>> notifier *nf)
>>>>>>            * threadobj_suspend().
>>>>>>            */
>>>>>>           threadobj_lock(current);
>>>>>> -       current->status |= __THREAD_S_SUSPENDED;
>>>>>> -       threadobj_unlock(current);
>>>>>> -       notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>>> -       threadobj_lock(current);
>>>>>> -       current->status &= ~__THREAD_S_SUSPENDED;
>>>>>> +       if ((current->status & __THREAD_S_ZOMBIE) == 0) {
>>>>>> +               current->status |= __THREAD_S_SUSPENDED;
>>>>>> +               threadobj_unlock(current);
>>>>>> +               notifier_wait(nf); /* Wait for threadobj_resume(). */
>>>>>> +               threadobj_lock(current);
>>>>>> +               current->status &= ~__THREAD_S_SUSPENDED;
>>>>>> +       }
>>>>>>           threadobj_unlock(current);
>>>>>>    }
>>>>>>
>>>>>> @@ -488,7 +490,7 @@ static inline void threadobj_run_corespec(struct
>>>>>> threadobj *thobj)
>>>>>>    {
>>>>>>    }
>>>>>>
>>>>>> -static inline void threadobj_cancel_corespec(struct threadobj
*thobj)
>>>>>> +static inline void threadobj_cancel_corespec(struct threadobj
>>>>>> *thobj) /* thobj->lock held */
>>>>>>    {
>>>>>>    }
>>>>>>
>>>>>> @@ -872,9 +874,9 @@ void threadobj_init(struct threadobj *thobj,
>>>>>>
>>>>>>    static void destroy_thread(struct threadobj *thobj)
>>>>>>    {
>>>>>> +       threadobj_cleanup_corespec(thobj);
>>>>>>           __RT(pthread_cond_destroy(&thobj->barrier));
>>>>>>           __RT(pthread_mutex_destroy(&thobj->lock));
>>>>>> -       threadobj_cleanup_corespec(thobj);
>>>>>>    }
>>>>>>
>>>>>>    void threadobj_destroy(struct threadobj *thobj) /* thobj->lock
free */
>>>>>> @@ -1089,8 +1091,6 @@ static void cancel_sync(struct threadobj
>>>>>> *thobj) /* thobj->lock held */
>>>>>>           int oldstate, ret = 0;
>>>>>>           sem_t *sem;
>>>>>>
>>>>>> -       __threadobj_check_locked(thobj);
>>>>>> -
>>>>>>           /*
>>>>>>            * We have to allocate the cancel sync sema4 in the main
heap
>>>>>>            * dynamically, so that it always live in valid memory
when we
>>>>>> @@ -1106,6 +1106,7 @@ static void cancel_sync(struct threadobj
>>>>>> *thobj) /* thobj->lock held */
>>>>>>                   __STD(sem_init(sem, sem_scope_attribute, 0));
>>>>>>
>>>>>>           thobj->cancel_sem = sem;
>>>>>> +       thobj->status |= __THREAD_S_ZOMBIE;
>>>>>>
>>>>>>           /*
>>>>>>            * If the thread to delete is warming up, wait until it
>>>>>>
>>>>>> --
>>>>>> Philippe.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I've tested the patch and it indeed fixes the problem. Thanks!
>>>>>
>>>>> For learning purpose, do you have an explanation as to why this
happens?
>>>>> Shouldn't the cancelling stop the blocking read? Or not in case the
>>>>> read comes after the cancel?
>>>>>
>>>>
>>>> Over mercury, since we don't have thread-directed suspend/resume
support
>>>> from the kernel, we force a particular thread into a suspended state by
>>>> sending it a notification signal via the async file I/O mechanism
>>>> (O_ASYNC). This way we can emulate a call like t_suspend() which
>>>> requires task-directed, immediate and unconditional action.
>>>>
>>>> The signal handler then waits on a read() call until it receives a
>>>> release message on the notifying I/O channel. But when that handler
>>>> runs, SIGCANCEL is blocked, so the cancellation point is temporarily
>>>> disabled. Conversely, when SIGCANCEL is first, the thread unwinds
>>>> properly.
>>>>
>>>> Looking at the idle_task's status when the issue happens:
>>>>
>>>> {rpm@cobalt} grep ^Sig /proc/6029/status
>>>> SigQ:    4/63608
>>>> SigPnd:    0000000000000000
>>>> ShdPnd:    0000000000000000
>>>> SigBlk:    0000020080000000
>>>>                   ^
>>>> __SIGRTMIN, aka nptl's SIGCANCEL
>>>>
>>>
>>> I'm unsure the spacing was right, so just in case:
>>> __SIGRTMIN is 32, so bit #31 is relevant (i.e. 0x80000000).
>>>
>>
>> I checked the Blocked Signals and bit #31 is blocked indeed, very
>> useful to check this SigBlk.
>> Thanks for explaining!
>>
>
> For the sake of completeness, this issue happens when SIGCANCEL
> and the notification signal pile up, which explains why SIGCANCEL ends up
being masked when read() is called by the notification handler.
> This is confirmed by the backtrace you sent:
>
>
> #0  0x00b6cbbc in read () from
output/host/usr/mips64-buildroot-linux-gnu/sysroot/lib32/libpthread.so.0
> #1  0x00b47bd8 in notifier_wait () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #2  0x00b460b0 in notifier_callback () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #3  0x00b47d30 in notifier_sighandler () from
/repo/kdemey/buildroot/output/host/usr/mips64-buildroot-linux-gnu/sysroot/usr/lib/libcopperplate.so.0
> #4  <signal handler called>
> #5  sigcancel_handler (sig=32, si=0x1e03aa8, ctx=0x1e03b28) at init.c:136
> #6  <signal handler called>
> #7  clock_nanosleep (clock_id=<optimized out>, flags=<optimized out>,
> req=<optimized out>, rem=<optimized out>)
>      at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:51
>
> Which gives:
>
> 1. idle_task receives SIGCANCEL, unblocked from tm_wkafter(), starts
running sigcancel_handler() from the nptl code (frames #7 -> #5). The
kernel has blocked SIGCANCEL on entry to the handler to prevent recursion
from the same signal.
>
> 2. before sigcancel_handler() unwinds, it is preempted by the
notification signal (NOTIFYSIG), which eventually blocks in read() (frame
#4 -> #0). This happens before the thread finalizer is called.
>
> So we end up sleeping on a read(), which impedes the releasing of the
sema4 cancel_sync() waits on.
>
> Randomness was introduced by the timing of events: we have to see both
SIGCANCEL and NOTIFYSIG pending at the same time for the target thread. In
that case, the kernel first delivers SIGCANCEL, then NOTIFYSIG, by order of
priority (__SIGRTMIN < SIGRTMIN + 8), creating a pile up situation. Raising
the priority of the t_suspend()+t_delete() caller compared to idle_task
makes the issue more likely, since the later will wake up last for handling
both signals. Nice bug.
>
> --
> Philippe.

That explains nicely why it did not always occur, which I was still
wondering. Interesting!
A nice bug it was.
_______________________________________________
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai

Reply via email to