Re: [OpenAFS-devel] volserver hangs possible fix

Horst Birthelmer Mon, 18 Apr 2005 08:09:09 -0700


On Apr 18, 2005, at 4:30 PM, Ted Anderson wrote:

On 4/18/2005 08:58, Horst Birthelmer wrote:
The problem isn't whether cond_wait is atomic. It's what happens to the algorithm if it's not. Imagine the scenario where it's not atomic (and this was the part where I agreed with Tom) and you have the mutex locked in the cond_wait call, but the thread isn't in the queue yet. Now this thread gets interrupted by whatever event and you perform a ...cond_broadcast(). All the threads are woken up except that one not in there yet. You have a thread waiting a cond_var you weren't aware of... actually you have that thread waiting on that condition variable while you already performed a broadcast. That's pretty weird for the algorithm.
Okay, but I don't agree that this situation would generate a problem in correctly written CV code. As long as the mutex is still held by the trying-to-sleep thread when the broadcast() occurs, then its *reason* for sleeping will still be true and hence there will eventually be another thread to come along and wake it up.

Right, that's what I meant by "weird for your algorithm". If you designed it the wrong way it'll hang here.

However, I am concerned that you introduce this scenario with "where it's not atomic". Are there cases where cond_wait() is not atomic it is necessary to write code to take that into account? So my question is still, what these correct but not atomic cond_wait() implementations are like and how putting the broadcast() into the protection of the mutex would help.

Most implementations don't have an atomic cond_wait since it's not mandatory by POSIX ;-) It's just you have to treat it that way since there's no guaranty that you can rely on an atomic implementation. It's no "problem" at all, it's just one aspect you have to keep in mind.

I introduced that with "where it's not atomic" because that's the very assumption in this discussion. None of the arguments would be true if it was.

By definition, a con_wait is used for waiting on some event inside a critical section. This means you entered the cond_wait with the mutex held. Now the cond_wait call enqueues the thread into the queue of threads waiting on this condition variable and unlocks the mutex. If you use the broadcast call protected by the same mutex you will always have a "settled" environment. This means you're sure you don't have any threads inside the critical section since you won't get to do broadcast (you would be waiting on the mutex_lock) and during the broadcast you won't have any threads entering the critical section. I can just repeat myself. It's more safe but from my point of view not a fix to the problem we had.

I should also say that I have not looked that the particular callback code at issue here. Perhaps it is using broadcast() in some unusual fashion (i.e. not using the producer/consumer model) that affects this discussion.

Well, it did not held the mutex during the broadcast, so there is for some people a theoretical possibility for a problem.

Horst

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] volserver hangs possible fix

Reply via email to