I've been looking recently at reasons why the fileserver performs badly on multi processor systems. As part of this, I've been taking a look at the way in which the vnode state machine is implemented as part of the demand-attach fileserver. On cursory inspection, our current implementation seems to have some significant implementation problems, including being susceptible to a variation of the thundering herd problem.
Firstly, some background (those familiar with DAFS can cut to the next paragraph!). In 1.4 a vnode can be unlocked, read locked, or write locked. These locks are handled using the standard AFS locks.h locking model - using either pthreads or LWP, depending on the build of the fileserver. With the demand-attach file server, these locks change into being a set of vnode states. Whilst many different states are defined, these states broadly divide into exclusive states (roughly akin to holding the write lock), STATE_READ (roughly similar to the read lock), STATE_ONLINE (similar to unlocked), and STATE_ERROR (which means something has gone wrong) Threads use a pair of functions to either wait until a vnode is quiescent (in a non-exclusive state, and with no readers), or until it is non-exclusive (there is a third function which allows a thread to wait upon any state change, but that appears to be unused). These waits are implemented using a single, per-vnode, condition variable. Whenever a vnode's state is changed, we broadcast and wake up all threads waiting on that variable. It's these broadcasts that cause us problems on multi-processor systems. Firstly, we broadcast regardless of the state change that has just occurred. If we have gone into an exclusive state, then we're waking up a load of threads that will be unable to make any progress. Secondly, broadcasting wakes up all pending threads, but the volume global lock means that only one can make progress. If the one that wins this race requires exclusive access, then all of the other woken threads will in turn acquire the global lock, note that they can't gain access to the vnode, and go back to sleep again. On a contended system, this will lead to a huge number of false wakeups. Thirdly, there are some situations where we broadcast multiple times for a single state change. I think any solution to this would require threads to indicate what they are going to do once they have waited. This allow us to selectively wake threads requiring exclusive access but broadcast to threads requiring read access. These wakeups would then only be performed if the state that we have transitioned in to would allow those threads to make forward progress. I'd welcome input from others more familiar with this code as to whether this is actually a problem, or if I'm missing something with the pthread condvar implementation that mitigates the problem. Cheers, Simon. _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
