Re: [OpenAFS-devel] volserver hangs possible fix

Ted Anderson Mon, 18 Apr 2005 05:44:11 -0700

On 4/12/2005 12:04, Jeffrey Hutzelman wrote:

On Tuesday, April 12, 2005 11:11:29 AM -0400 Ted Anderson <[EMAIL PROTECTED]> wrote:

Just to make sure I understand this problem, let me try providing a
specific scenario:
     thread A        thread B
     --------        --------
  1:                 lock M;
  2:                 check queue;
  3:                 cond_wait(C,M);     // queue is empty
  4:                  -> kernel_scheduler {
  5:                         unlock M;
  6: lock M;                 ...
  7: add item;               ...
  8: unlock M;               ...
  9: broadcast on C;         ...         // noop, nobody blocked on C
10:                         put B on C;
11:                         desched B;
12:          DEADLOCK

Is this a problem that can occur with a corrently operating
implementation of POSIX threads, i.e. an example of correct but
non-deterministic scheduling?  I've expanded the operation of cond_wait
as I understand it.  I observe that M correctly protects the queue
itself.  Treating "broadcast C" and "put B on C" as black boxes, however,
it doesn't seem to help to move broadcast(C) inside M, i.e. by swapping 8
& 9.  On the other hand, tt would seem a simple matter to reverse the
sequence of 5 & 10, but perhaps that is inconvenient for some
multi-processor schedulers.

As I understand your comment, you are saying that the broadcast must be
protected by the mutex or race conditions are possible.  What am I
missing here?

What you're missing here is that cond_wait() is atomic. The lines you list as 5 and 10 happen at the same time, so it's not possible for thread A to acquire the lock before thread B is on the wait-list for the CV. This is why it is necessary to hold the lock through the broadcast in order to prevent a race -- otherwise, the broadcast could indeed occur between lines 2 and 10, and thread B would miss the wakeup.

Well, I always thought cond_wait() was atomic. But if that is true then I don't see why broadcast() needs to be covered by the mutex. In the above example (ignoring the kernel_scheduler stuff), in thread A's execution the broadcast (9) follows the locked item addition (6-8). So, either thread B goes first, gets added to the CV's queue and then gets awoken, or thread A goes first and there the item is added to the queue and B won't sleep. How does moving broadcast() inside the mutex change anything?

What I was responding to way this:
On 4/11/2005 16:26, Horst Birthelmer wrote:
> On Apr 11, 2005, at 9:21 PM, Tom Keiser wrote:
>> It is a race condition.  pthread_cond_wait() is not an atomic
>> operation; it relies on the mutex to synchronize a set of operations
>> that enqueue the calling thread onto the list of threads blocked on
>> the cond var, drop the associated lock, and become blocked by the
>> kernel scheduler.  If you call signal/broadcast without holding the
>> lock, you can race against these bookkeeping operations in
>> pthread_cond_wait().  Thus, a thread in the process of blocking will
>> never find out about that signal/broadcast, because cond vars are not
>> stateful like semaphores.
>
> The mutex passed to a condition variable isn't for guarding the
> bookkeeping, but that's another story, I guess.

It seemed that both Horst and Tom agreed that broadcast() should be protected, though neither was clear on the reason. The quotes from POSIX didn't clarify it for me. So what I tried to do was break out the sub-operations that must compose cond_wait() to try to see how protecting broadcast() with the mutex would help, but I couldn't construct a failing case.

Maybe one of you could outline a scenario where broadcast() must be protected to avoid a race condition?

Ted Anderson

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] volserver hangs possible fix

Reply via email to