On 9/23/2013 3:52 PM, Mark Vitale wrote:
> Recently I've been working on several problems with very different externals 
> but a similar root cause:
> 
> 1) While accessing a particular fileserver, AFS clients experience 
> performance delays; some also see multiple "server down/back up" problems.
>   - root cause was a hardware bug on the fileserver that prevented timers 
> from firing reliably; this unpredictably delayed any task in the rxevent 
> queue, while leaving the rest of the fileserver function relatively 
> unaffected.  (btw, this was a pthreaded fileserver).

I sympathize.  Tracking this down must have been quite frustrating.  One
of the basic assumptions of the rx event implementation is that timers
are reliable and that the OS is trusted to managed them.  If this
assumption is not true on a given system then rx_event will appear to
misbehave.

> 2) Volume releases suffer from poor performance and occasionally fail with 
> timeouts.
>   - root cause was heavier-than-normal vlserver load (perhaps caused by disk 
> performance slowdowns); this starved LWP IOMGR, which in turn prevented LWP 
> rx_Listener from being dispatched (priority inversion), leading to a grossly 
> delayed rxevent queue.

Did you intend to write "rx_Listener" in the above sentence or did you
mean the rx_Event thread?  If the listener thread never runs then
incoming packets are not processed in a timely manner and although it
will appear as if events are taking a long time to process it is more
likely the case that they were not posted in a timely fashion.

If the listener thread is running and the event thread is not then the
length of the event queues become very long and the cost of processing
an event post increases on 1.6 to the point where it slows down the
listener thread.

> So in two very different situations, the rxevent queue was unable to process 
> scheduled events in a timely manner, leading to very strange and 
> difficult-to-diagnose symptoms.
> 
> I'm writing this note to begin a discussion on possible ways to address this 
> in OpenAFS.
> 
> One possible approach is to implement some watchdog/sentinel code to detect 
> when the rxevent queue is not working correctly; that is, when it's unable to 
> run scheduled events in a timely manner.   Certainly rxevent can't watch 
> itself; but rather than adding another thread as a watchdog, I chose to 
> insert a sanity check into rxevent_Post().  This check essentially compares 
> the current time (if supplied on the "now" parameter) with the scheduled time 
> for the top rxevent on the queue.  If it's later than a certain threshold, 
> then we know that the rxevent queue has fallen behind (is "sick") for some 
> unknown reason.  At this point, I set a state flag which causes any new 
> connections to abort (with timeout or busy, for example).  Both the threshold 
> and reply could be configurable, similar to the current implementation of the 
> -busyat thread-busy threshold and response.  After the rxevent queue is able 
> to catch up with its scheduling work, the "sick" state is reset.  And lastly, 
> warning m
essages could be written to the log to indicate that the rxevent queue is 
having difficulties and later has returned to normal.  I have some prototype 
code working in my test environment; it needs some work before it will be 
suitable for review.   
> 
> Another possible approach is, instead of sending a abort codes when we are 
> "sick", merely suspend RPC operations completely; that is, don't send any 
> packets or process any calls until we aren't "sick" again.   That would mean 
> that the server process appears to "freeze" entirely whenever the event 
> thread gets stuck.  Certainly this would be an immediate alert that something 
> is wrong, rather than the hit-or-miss mystery behavior when merely the 
> rxevent queue is not being dispatched.   
> 
> But all that is moot if the upstream development community finds these 
> approaches misguided.  One could argue in the case of the first failure, that 
> it's unreasonable to expect OpenAFS to work predictably on buggy hardware.  
> One could also discount the second failure on the grounds that it's just an 
> LWP bug, and that priority inversion is not possible in pthreaded-ubik, which 
> will be here real soon now.   However, in my opinion these are both just two 
> instances of an underlying weak spot in OpenAFS; there may be other ways that 
> the rxevent queue could become "sick".   Therefore, I believe the rxevent 
> queue could use some bulletproofing.


In rx, the listener and event processing are time critical.  If you
compare 1.6 to master you will see that Simon has redesigned event
management in order to cut down on the number of comparisons required in
that path.  Adding new checks or writing to a log file in the event post
path should be avoided.  Increasing the complexity of the code to work
around an OS or hardware failure that violates basic assumptions is a
bad idea.  If this was expected behavior of the OS or hardware, then we
might need to consider whether that OS or hardware should even be supported.

For the LWP issue.  It is unclear from your description what the root
cause of the problem is.  You have not described the resulting packet
flows on the wire and how those flows are being interpreted or
misinterpreted by the peer.

As a more general comment, I am concerned about repercussions of the
proposed behavior changes on the overall ecosystem.  When you say
"merely suspect RPC operations completely" I assume that you have a file
server or volume location server in mind but they aren't the only rx
servers.  I don't think that replacing one broken behavior with a new
broken behavior (aborts or refusal to respond) is a good idea.  Doing so
increases complexity and has the potential for unintended side effects.

Feel free to respond with more specific details of the problem.

Jeffrey Altman


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to