Hi Nadav,

Thanks for the review.

More replies inline...[RG]

On Mon, Apr 3, 2017 at 6:16 AM, Nadav Har'El <n...@scylladb.com
<javascript:_e(%7B%7D,'cvml','n...@scylladb.com');>> wrote:

> Were you able to reproduce this problem - perhaps by reducing sleeping in
> the test case, or something?
>

[RG] The root cause is that a thread can start the next round of waiting
before the previous round has completed. To reproduce the problem, I added
an 'in' counter to the barrier along with a sleep after the call to await()
but before the mutex was acquired to update barrier->out. I was able to see
threads pass the barrier with a return value of 0 (i.e., they are not the
ones that received the special retval and thus responsible for resetting
the barrier for the next round of waits) and then re-start another round of
waiting before the previous round ends. This eventually let a thread from
the previous round cross the open barrier twice and push the count in the
latch down to -1, which prevents the next round from ever completing -
leading to a hang.  Adding the assert in latch.hh highlighted this issue as
well.

>
> Some comments below.
>
> On Mon, Apr 3, 2017 at 4:30 AM, 'rean' via OSv Development <
> osv-dev@googlegroups.com
> <javascript:_e(%7B%7D,'cvml','osv-dev@googlegroups.com');>> wrote:
>
>> Signed-off-by: Rean Griffith <r...@caa.columbia.edu
>> <javascript:_e(%7B%7D,'cvml','r...@caa.columbia.edu');>>
>> ---
>>  include/osv/latch.hh    |  3 +++
>>  libc/pthread_barrier.cc | 25 ++++++++++++++++---------
>>  2 files changed, 19 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/osv/latch.hh b/include/osv/latch.hh
>> index 0fc47ba..d8a90a9 100644
>> --- a/include/osv/latch.hh
>> +++ b/include/osv/latch.hh
>> @@ -10,6 +10,7 @@
>>  #include <condition_variable>
>>  #include <mutex>
>>  #include <atomic>
>> +#include <assert.h>
>>
>>  class latch
>>  {
>> @@ -29,6 +30,8 @@ public:
>>          if (_count.fetch_add(-1, std::memory_order_release) == 1) {
>>              _condvar.notify_all();
>>          }
>> +        // Sanity check
>> +        assert(_count.load(std::memory_order_acquire) >= 0);
>>
>
> There is no reason to use memory_order_acquire here - memory_order_relaxed
> should be just as good.
>
[RG] Sure. I'll update that.

>
>
>>      }
>>
>>      bool is_released()
>> diff --git a/libc/pthread_barrier.cc b/libc/pthread_barrier.cc
>> index 2437d73..0fbbb63 100644
>> --- a/libc/pthread_barrier.cc
>> +++ b/libc/pthread_barrier.cc
>> @@ -15,6 +15,7 @@
>>  // pthread_barrierattr_t
>>  typedef struct
>>  {
>> +    unsigned int in;
>>      unsigned int out;
>>      unsigned int count;
>>      latch *ltch;
>> @@ -47,6 +48,7 @@ int pthread_barrier_init(pthread_barrier_t
>> *barrier_opq,
>>      // threads can manipulate the memory area associated with the
>>      // pthread_barrier_t so it doesn't matter what the value of pshared
>> is set to
>>      barrier->count = count;
>> +    barrier->in = 0;
>>      barrier->out = 0;
>>      barrier->ltch = new latch(count);
>>      barrier->mtx = new pthread_mutex_t;
>> @@ -67,7 +69,19 @@ int pthread_barrier_wait(pthread_barrier_t
>> *barrier_opq)
>>      int retval = 0;
>>      pthread_mutex_t *mtx = barrier->mtx;
>>
>> +    // Check whether we are about to start another round of waiting on
>> the
>> +    // barrier before the previous round completes (i.e., the 'in'
>> counter
>> +    // on the barrier plus 1 from the current thread entering is greater
>> +    // than barrier->count). If the round isn't complete then we should
>> +    // sleep until the previous round is finished and the 'in' counter is
>> +    // reset by the last thread exiting from the previous round
>> +    while(barrier->in + 1 > barrier->count) {
>> +        usleep(5);
>> +    }
>>
>
> This doesn't seem right to me for several reasons... First, how can we
> access barrier->in and barrier->count without the mutex protection?
> Second, it's not nice to use usleep() where we should have slept until an
> event (e.g., using a condition variable).
> Perhaps we should have, inside the mutex protection we have one line
> below, use a condition variable to wait?
>
[RG] I can change barrier->in and barrier->count to atomic ints and then we
should not need to synchronize to read their values.

re: usleep, I'll see what a prototype looks like with a condition variable
instead of a busy wait. If that works then we may not need to read the
barrier's 'in' counter or count at all.

>
> I'm a bit rusty as to what this code does, so I can't offer more detailed
> advice. I'm starting to get the feeling that using "latch" for this added
> more problems than it solved.
>
[RG] Fair enough. Let me see whether the atomic ints or condition variable
approach pan out.

thanks,
Rean

>
>
>> +
>>      pthread_mutex_lock(mtx);
>> +    // Increment the incoming count
>> +    barrier->in++;
>>      pthread_mutex_unlock(mtx);
>>
>>      latch *l  = barrier->ltch;
>> @@ -75,15 +89,6 @@ int pthread_barrier_wait(pthread_barrier_t
>> *barrier_opq)
>>      // All threads stuck here until we get at least 'count' waiters
>>      l->await();
>>
>> -    // If the last thread (thread x) to wait on the barrier is
>> descheduled here
>> -    // (immediately after being the count'th thread crossing the barrier)
>> -    // the barrier remains open (a new waiting thread will cross) until
>> -    // the barrier is reset below (when thread x is rescheduled), which
>> doesn't
>> -    // seem technically incorrect. Only one of the crossing threads will
>> get a
>> -    // retval of PTHREAD_BARRIER_SERIAL_THREAD, when
>> -    // barrier->out == barrier->count.
>> -    // All other crossing threads will get a retval of 0.
>> -
>>      pthread_mutex_lock(mtx);
>>      barrier->out++;
>>      // Make the last thread out responsible for resetting the barrier's
>> latch.
>> @@ -99,6 +104,8 @@ int pthread_barrier_wait(pthread_barrier_t
>> *barrier_opq)
>>          // Reset the 'out' counter so that the equality check above
>> works across
>>          // multiple rounds of threads waiting on the barrier
>>          barrier->out = 0;
>> +        // Reset the 'in' counter so that the next round of waits can
>> start
>> +        barrier->in = 0;
>>      }
>>      pthread_mutex_unlock(mtx);
>>      return retval;
>> --
>> 2.7.4
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to osv-dev+unsubscr...@googlegroups.com
>> <javascript:_e(%7B%7D,'cvml','osv-dev%2bunsubscr...@googlegroups.com');>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to