Recently I've been working on several problems with very different externals 
but a similar root cause:

1) While accessing a particular fileserver, AFS clients experience performance 
delays; some also see multiple "server down/back up" problems.
  - root cause was a hardware bug on the fileserver that prevented timers from 
firing reliably; this unpredictably delayed any task in the rxevent queue, 
while leaving the rest of the fileserver function relatively unaffected.  (btw, 
this was a pthreaded fileserver).

2) Volume releases suffer from poor performance and occasionally fail with 
timeouts.
  - root cause was heavier-than-normal vlserver load (perhaps caused by disk 
performance slowdowns); this starved LWP IOMGR, which in turn prevented LWP 
rx_Listener from being dispatched (priority inversion), leading to a grossly 
delayed rxevent queue.

So in two very different situations, the rxevent queue was unable to process 
scheduled events in a timely manner, leading to very strange and 
difficult-to-diagnose symptoms.

I'm writing this note to begin a discussion on possible ways to address this in 
OpenAFS.

One possible approach is to implement some watchdog/sentinel code to detect 
when the rxevent queue is not working correctly; that is, when it's unable to 
run scheduled events in a timely manner.   Certainly rxevent can't watch 
itself; but rather than adding another thread as a watchdog, I chose to insert 
a sanity check into rxevent_Post().  This check essentially compares the 
current time (if supplied on the "now" parameter) with the scheduled time for 
the top rxevent on the queue.  If it's later than a certain threshold, then we 
know that the rxevent queue has fallen behind (is "sick") for some unknown 
reason.  At this point, I set a state flag which causes any new connections to 
abort (with timeout or busy, for example).  Both the threshold and reply could 
be configurable, similar to the current implementation of the -busyat 
thread-busy threshold and response.  After the rxevent queue is able to catch 
up with its scheduling work, the "sick" state is reset.  And lastly, warning 
messages could be written to the log to indicate that the rxevent queue is 
having difficulties and later has returned to normal.  I have some prototype 
code working in my test environment; it needs some work before it will be 
suitable for review.   

Another possible approach is, instead of sending a abort codes when we are 
"sick", merely suspend RPC operations completely; that is, don't send any 
packets or process any calls until we aren't "sick" again.   That would mean 
that the server process appears to "freeze" entirely whenever the event thread 
gets stuck.  Certainly this would be an immediate alert that something is 
wrong, rather than the hit-or-miss mystery behavior when merely the rxevent 
queue is not being dispatched.   

But all that is moot if the upstream development community finds these 
approaches misguided.  One could argue in the case of the first failure, that 
it's unreasonable to expect OpenAFS to work predictably on buggy hardware.  One 
could also discount the second failure on the grounds that it's just an LWP 
bug, and that priority inversion is not possible in pthreaded-ubik, which will 
be here real soon now.   However, in my opinion these are both just two 
instances of an underlying weak spot in OpenAFS; there may be other ways that 
the rxevent queue could become "sick".   Therefore, I believe the rxevent queue 
could use some bulletproofing.

I look forward to your comments.

regards,
--
Mark Vitale
[email protected]

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to