I've been looking recently at reasons why the fileserver performs badly on 
multi processor systems. As part of this, I've been taking a look at the way in 
which the vnode state machine is implemented as part of the demand-attach 
fileserver. On cursory inspection, our current implementation seems to have 
some significant implementation problems, including being susceptible to a 
variation of the thundering herd problem.

Firstly, some background (those familiar with DAFS can cut to the next 
paragraph!). In 1.4 a vnode can be unlocked, read locked, or write locked. 
These locks are handled using the standard AFS locks.h locking model - using 
either pthreads or LWP, depending on the build of the fileserver. With the 
demand-attach file server, these locks change into being a set of vnode states. 
Whilst many different states are defined, these states broadly divide into 
exclusive states (roughly akin to holding the write lock), STATE_READ (roughly 
similar to the read lock), STATE_ONLINE (similar to unlocked), and STATE_ERROR 
(which means something has gone wrong)

Threads use a pair of functions to either wait until a vnode is quiescent (in a 
non-exclusive state, and with no readers), or until it is non-exclusive (there 
is a third function which allows a thread to wait upon any state change, but 
that appears to be unused). These waits are implemented using a single, 
per-vnode, condition variable. Whenever a vnode's state is changed, we 
broadcast and wake up all threads waiting on that variable.

It's these broadcasts that cause us problems on multi-processor systems. 
Firstly, we broadcast regardless of the state change that has just occurred. If 
we have gone into an exclusive state, then we're waking up a load of threads 
that will be unable to make any progress. Secondly, broadcasting wakes up all 
pending threads, but the volume global lock means that only one can make 
progress. If the one that wins this race requires exclusive access, then all of 
the other woken threads will in turn acquire the global lock, note that they 
can't gain access to the vnode, and go back to sleep again. On a contended 
system, this will lead to a huge number of false wakeups. Thirdly, there are 
some situations where we broadcast multiple times for a single state change.

I think any solution to this would require threads to indicate what they are 
going to do once they have waited. This allow us to selectively wake threads 
requiring exclusive access but broadcast to threads requiring read access. 
These wakeups would then only be performed if the state that we have 
transitioned in to would allow those threads to make forward progress.

I'd welcome input from others more familiar with this code as to whether this 
is actually a problem, or if I'm missing something with the pthread condvar 
implementation that mitigates the problem.

Cheers,

Simon.

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to