On 02/12/2016 03:08 PM, Charles Kiorpes wrote:
> 
> 
> On Fri, Feb 12, 2016 at 5:43 AM, Philippe Gerum <r...@xenomai.org
> <mailto:r...@xenomai.org>> wrote:
> 
>     On 02/11/2016 01:57 PM, Charles Kiorpes wrote:
>     >
>     > I attempted to run several tests:  'task-1', 'event-1', and 'mutex-1'.
>     > Each of these hung indefinitely.  A gdb trace indicated that they were
>     > hanging on __libc_do_syscall() within __pthread_cond_wait() within
>     > threadobj_cond_wait().
>     >
>     > I have attached the full backtrace from mutex-1 as mutex-1_bt.txt
>     >
> 
>     Ok, if the test suite does not pass, something is badly wrong, so we
>     should investigate that hang issue before anything else.
> 
>     The backtrace reveals that copperplate cannot handshake with a newly
>     spawned task, this is the purpose of the wait_on_barrier() call over the
>     context of rt_task_start(). That barrier should be signaled by a call to
>     threadobj_notify_entry() from the internal trampoline code of the
>     emerging thread (task_entry() in alchemy/task.c).
> 
>     - maybe task_prologue_2() (alchemy/task.c) which is called earlier hangs
>     indefinitely, and therefore prevents threadobj_notify_entry() from
>     running?
> 
>     - maybe the new thread does not even start for some reason, are we sure
>     task_entry() is reached (e.g. do we hit a breakpoint there?)
> 
>     Could you inspect the current thread list under gdb when the program
>     hangs?
> 
>     Also, I would recommend to enable full debugging for now
>     (--enable-debug=full) to get accurate line information, assuming the
>     issue should still show up with a non-optimized code. Hopefully.
> 
>     --
>     Philippe.
> 
> 
> I ran the task-1 test under gdb with this Xenomai configuration:
> --with-core=mercury \
> --enable-debug=full \
> --enable-registry \
> --enable-smp \
> --enable-pshared \
> --enable-condvar-workaround
> 
> It appears that the new thread is being launched, and getting stuck in
> threadobj_wait_start() within task_prologue_2(), as you indicated might
> be the case.
> I have attached the thread list and a full backtrace for each thread (in
> separate files by thread id).
> 
> As per your other message, my kernel configs all include CONFIG_FUTEX.
> 
> I have tried glibc 2.19 and 2.21, as well as RT patched and vanilla kernels.
> 
> Interestingly, when I removed --enable-pshared from my configuration,
> the task-1 test passed.
> 

Here is the sync pattern the code normally achieves, once the parent has 
successfully spawned a child thread, which has to wait for a start signal 
before it may run application code:

1. parent calls threadobj_start(child)
        1.1 child->status |= __THREAD_S_STARTED
        1.2 wait for child->status & __THREAD_S_ACTIVE

2. child calls threadobj_wait_start(self)
        2.1 wait for self->status & __THREAD_S_STARTED
        2.2 raise self->status |= __THREAD_S_ACTIVE

All accesses to the status bits are serialized by a per-thread mutex, operated 
by the threadobj_lock/unlock accessors, which also covers the condvar 
signaling/waiting as one would expect.

When running in pshared mode, thread descriptors (holding ->status, mutex and 
barrier sync) are obtained from /dev/shm. If --disable-pshared, we are using 
100% process-private memory.

Case 1: a race when manipulating the thread status due to inconsistent locking. 
I could not find any so far.

Case 2: a cache coherence issue in SMP, also caused by improper locking. 
Otherwise, the locking should enforce memory barriers as expected.

Case 3: anything not mentioned in other cases...

- Could you paste/copy the disassembly (objdump -dl rather than gdb's disass) 
of the wait_on_barrier() function?

- Does running both programs with --cpu-affinity=0/1 change the outcome?

- Without specifying any affinity this time, could you run the current test 
with the debug patch below applied (this is clearly not a fix)? The patch 
forces the code to read the value of the ->status field before waiting on the 
barrier. With that code in and a backtrace showing locals, we should be able to 
check the status word when threadobj_wait_start() is entered.

diff --git a/lib/copperplate/threadobj.c b/lib/copperplate/threadobj.c
index cc64caa..ed85a12 100644
--- a/lib/copperplate/threadobj.c
+++ b/lib/copperplate/threadobj.c
@@ -1273,7 +1273,9 @@ void threadobj_wait_start(void) /* current->lock free. */
        int status;
 
        threadobj_lock(current);
-       status = wait_on_barrier(current, 
__THREAD_S_STARTED|__THREAD_S_ABORTED);
+       status = current->status;
+       if (!(status & __THREAD_S_STARTED))
+               status = wait_on_barrier(current, 
__THREAD_S_STARTED|__THREAD_S_ABORTED);
        threadobj_unlock(current);
 
        /*

-- 
Philippe.

_______________________________________________
Xenomai mailing list
Xenomai@xenomai.org
http://xenomai.org/mailman/listinfo/xenomai

Reply via email to