Recently I've been working on several problems with very different externals but a similar root cause:
1) While accessing a particular fileserver, AFS clients experience performance delays; some also see multiple "server down/back up" problems. - root cause was a hardware bug on the fileserver that prevented timers from firing reliably; this unpredictably delayed any task in the rxevent queue, while leaving the rest of the fileserver function relatively unaffected. (btw, this was a pthreaded fileserver). 2) Volume releases suffer from poor performance and occasionally fail with timeouts. - root cause was heavier-than-normal vlserver load (perhaps caused by disk performance slowdowns); this starved LWP IOMGR, which in turn prevented LWP rx_Listener from being dispatched (priority inversion), leading to a grossly delayed rxevent queue. So in two very different situations, the rxevent queue was unable to process scheduled events in a timely manner, leading to very strange and difficult-to-diagnose symptoms. I'm writing this note to begin a discussion on possible ways to address this in OpenAFS. One possible approach is to implement some watchdog/sentinel code to detect when the rxevent queue is not working correctly; that is, when it's unable to run scheduled events in a timely manner. Certainly rxevent can't watch itself; but rather than adding another thread as a watchdog, I chose to insert a sanity check into rxevent_Post(). This check essentially compares the current time (if supplied on the "now" parameter) with the scheduled time for the top rxevent on the queue. If it's later than a certain threshold, then we know that the rxevent queue has fallen behind (is "sick") for some unknown reason. At this point, I set a state flag which causes any new connections to abort (with timeout or busy, for example). Both the threshold and reply could be configurable, similar to the current implementation of the -busyat thread-busy threshold and response. After the rxevent queue is able to catch up with its scheduling work, the "sick" state is reset. And lastly, warning messages could be written to the log to indicate that the rxevent queue is having difficulties and later has returned to normal. I have some prototype code working in my test environment; it needs some work before it will be suitable for review. Another possible approach is, instead of sending a abort codes when we are "sick", merely suspend RPC operations completely; that is, don't send any packets or process any calls until we aren't "sick" again. That would mean that the server process appears to "freeze" entirely whenever the event thread gets stuck. Certainly this would be an immediate alert that something is wrong, rather than the hit-or-miss mystery behavior when merely the rxevent queue is not being dispatched. But all that is moot if the upstream development community finds these approaches misguided. One could argue in the case of the first failure, that it's unreasonable to expect OpenAFS to work predictably on buggy hardware. One could also discount the second failure on the grounds that it's just an LWP bug, and that priority inversion is not possible in pthreaded-ubik, which will be here real soon now. However, in my opinion these are both just two instances of an underlying weak spot in OpenAFS; there may be other ways that the rxevent queue could become "sick". Therefore, I believe the rxevent queue could use some bulletproofing. I look forward to your comments. regards, -- Mark Vitale [email protected]
signature.asc
Description: Message signed with OpenPGP using GPGMail
