Re: [OpenAFS-devel] volserver hangs possible fix

Ted Anderson Tue, 12 Apr 2005 08:47:38 -0700

A scenario and request for clarification below...

On 4/11/2005 15:21, Tom Keiser wrote:

On Apr 11, 2005 2:35 PM, Horst Birthelmer <[EMAIL PROTECTED]> wrote:

On Apr 6, 2005, at 1:40 PM, Tom Keiser wrote:

While auditing the callback code, I noticed some funky condition
variable usage between FsyncCheckLWP and BreakVolumeCallbacksLater.
One issue is BreakVolumeCallbacksLater calls pthread_cond_broadcast on
fsync_cond without holding the associated lock.  Here's a quick patch
for that race:


I don't actually think that would be a race condition. It just leads to
some undefined scheduling behavior.
If you don't have locked the mutex with which the other thread is doing
it's wait you may just continue with that thread depending on your
schedulers mood ;-)


It is a race condition.  pthread_cond_wait() is not an atomic
operation; it relies on the mutex to synchronize a set of operations
that enqueue the calling thread onto the list of threads blocked on
the cond var, drop the associated lock, and become blocked by the
kernel scheduler.  If you call signal/broadcast without holding the
lock, you can race against these bookkeeping operations in
pthread_cond_wait().  Thus, a thread in the process of blocking will
never find out about that signal/broadcast, because cond vars are not
stateful like semaphores.

Locking that mutex means you _will_ definitely continue when you call
the unlock and not at some undefined point in time.
Here's what the POSIX standard says about this:


Actually, it cannot guarantee that type of ordering.  Precisely when
the thread will be scheduled to run is an implementation detail of
whatever platform you're running, the number of processors, etc.  The
core distinction is whether you will _always_ wake up on a call to
signal/broadcast, or only _sometimes_ wake up on a call to
signal/broadcast.  In the general case, there are algorithms where
there will be no future signal/broadcast unless the thread wakes up
every time.  Luckily, the callback algorithm only experiences delays
in the current implementation.

<quote>
If more than one thread is blocked on a condition variable, the
scheduling policy shall determine the order in which threads are
unblocked. When each thread unblocked as a result of a
pthread_cond_broadcast() or pthread_cond_signal() returns from its call
to pthread_cond_wait() or pthread_cond_timedwait(), the thread shall
own the mutex with which it called pthread_cond_wait() or
pthread_cond_timedwait(). The thread(s) that are unblocked shall
contend for the mutex according to the scheduling policy (if
applicable), and as if each had called pthread_mutex_lock().

The pthread_cond_broadcast() or pthread_cond_signal() functions may be
called by a thread whether or not it currently owns the mutex that
threads calling pthread_cond_wait() or pthread_cond_timedwait() have
associated with the condition variable during their waits; however, if
predictable scheduling behavior is required, then that mutex shall be
locked by the thread calling pthread_cond_broadcast() or
pthread_cond_signal().
<\quote>


The paragraph directly following your quote is extremely important:

<quote>
The pthread_cond_broadcast() and pthread_cond_signal() functions shall
have no effect if there are no threads currently blocked on cond.
</quote>

And here's an important related paragraph from phtread_cond_wait() in the spec:

<quote>
These functions atomically release mutex and cause the calling thread
to block on the condition variable cond; atomically here means
"atomically with respect to access by another thread to the mutex and
then the condition variable". That is, if another thread is able to
acquire the mutex after the about-to-block thread has released it,
then a subsequent call to pthread_cond_broadcast() or
pthread_cond_signal() in that thread shall behave as if it were issued
after the about-to-block thread has blocked.
</quote>

By "predictable scheduling" they are referring to whether there is a
deterministic relationship between calling signal/broadcast, and a
thread waking up.  Calling signal/broadcast outside of the lock is
nondeterministic; it may or may not have any effect.

When you use signal/broadcast outside of the lock you have no
guarantee that the thread you're attempting to wake up is busy,
preparing to block, or blocked.  In the callback code we've been
discussing the result is that you could unnecessarily wait 5 minutes
to break callbacks that could just as easily have been broken right
now, and on busy servers you've just increased your callback break
backlog as well :-(

Just to make sure I understand this problem, let me try providing a specific scenario: thread A thread B -------- -------- 1: lock M; 2: check queue; 3: cond_wait(C,M); // queue is empty 4: -> kernel_scheduler { 5: unlock M; 6: lock M; ... 7: add item; ... 8: unlock M; ... 9: broadcast on C; ... // noop, nobody blocked on C 10: put B on C; 11: desched B; 12: DEADLOCK

Is this a problem that can occur with a corrently operating implementation of POSIX threads, i.e. an example of correct but non-deterministic scheduling? I've expanded the operation of cond_wait as I understand it. I observe that M correctly protects the queue itself. Treating "broadcast C" and "put B on C" as black boxes, however, it doesn't seem to help to move broadcast(C) inside M, i.e. by swapping 8 & 9. On the other hand, tt would seem a simple matter to reverse the sequence of 5 & 10, but perhaps that is inconvenient for some multi-processor schedulers.

As I understand your comment, you are saying that the broadcast must be protected by the mutex or race conditions are possible. What am I missing here?

Ted Anderson

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] volserver hangs possible fix

Reply via email to