On 9/23/2013 3:52 PM, Mark Vitale wrote: > Recently I've been working on several problems with very different externals > but a similar root cause: > > 1) While accessing a particular fileserver, AFS clients experience > performance delays; some also see multiple "server down/back up" problems. > - root cause was a hardware bug on the fileserver that prevented timers > from firing reliably; this unpredictably delayed any task in the rxevent > queue, while leaving the rest of the fileserver function relatively > unaffected. (btw, this was a pthreaded fileserver).
I sympathize. Tracking this down must have been quite frustrating. One of the basic assumptions of the rx event implementation is that timers are reliable and that the OS is trusted to managed them. If this assumption is not true on a given system then rx_event will appear to misbehave. > 2) Volume releases suffer from poor performance and occasionally fail with > timeouts. > - root cause was heavier-than-normal vlserver load (perhaps caused by disk > performance slowdowns); this starved LWP IOMGR, which in turn prevented LWP > rx_Listener from being dispatched (priority inversion), leading to a grossly > delayed rxevent queue. Did you intend to write "rx_Listener" in the above sentence or did you mean the rx_Event thread? If the listener thread never runs then incoming packets are not processed in a timely manner and although it will appear as if events are taking a long time to process it is more likely the case that they were not posted in a timely fashion. If the listener thread is running and the event thread is not then the length of the event queues become very long and the cost of processing an event post increases on 1.6 to the point where it slows down the listener thread. > So in two very different situations, the rxevent queue was unable to process > scheduled events in a timely manner, leading to very strange and > difficult-to-diagnose symptoms. > > I'm writing this note to begin a discussion on possible ways to address this > in OpenAFS. > > One possible approach is to implement some watchdog/sentinel code to detect > when the rxevent queue is not working correctly; that is, when it's unable to > run scheduled events in a timely manner. Certainly rxevent can't watch > itself; but rather than adding another thread as a watchdog, I chose to > insert a sanity check into rxevent_Post(). This check essentially compares > the current time (if supplied on the "now" parameter) with the scheduled time > for the top rxevent on the queue. If it's later than a certain threshold, > then we know that the rxevent queue has fallen behind (is "sick") for some > unknown reason. At this point, I set a state flag which causes any new > connections to abort (with timeout or busy, for example). Both the threshold > and reply could be configurable, similar to the current implementation of the > -busyat thread-busy threshold and response. After the rxevent queue is able > to catch up with its scheduling work, the "sick" state is reset. And lastly, > warning m essages could be written to the log to indicate that the rxevent queue is having difficulties and later has returned to normal. I have some prototype code working in my test environment; it needs some work before it will be suitable for review. > > Another possible approach is, instead of sending a abort codes when we are > "sick", merely suspend RPC operations completely; that is, don't send any > packets or process any calls until we aren't "sick" again. That would mean > that the server process appears to "freeze" entirely whenever the event > thread gets stuck. Certainly this would be an immediate alert that something > is wrong, rather than the hit-or-miss mystery behavior when merely the > rxevent queue is not being dispatched. > > But all that is moot if the upstream development community finds these > approaches misguided. One could argue in the case of the first failure, that > it's unreasonable to expect OpenAFS to work predictably on buggy hardware. > One could also discount the second failure on the grounds that it's just an > LWP bug, and that priority inversion is not possible in pthreaded-ubik, which > will be here real soon now. However, in my opinion these are both just two > instances of an underlying weak spot in OpenAFS; there may be other ways that > the rxevent queue could become "sick". Therefore, I believe the rxevent > queue could use some bulletproofing. In rx, the listener and event processing are time critical. If you compare 1.6 to master you will see that Simon has redesigned event management in order to cut down on the number of comparisons required in that path. Adding new checks or writing to a log file in the event post path should be avoided. Increasing the complexity of the code to work around an OS or hardware failure that violates basic assumptions is a bad idea. If this was expected behavior of the OS or hardware, then we might need to consider whether that OS or hardware should even be supported. For the LWP issue. It is unclear from your description what the root cause of the problem is. You have not described the resulting packet flows on the wire and how those flows are being interpreted or misinterpreted by the peer. As a more general comment, I am concerned about repercussions of the proposed behavior changes on the overall ecosystem. When you say "merely suspect RPC operations completely" I assume that you have a file server or volume location server in mind but they aren't the only rx servers. I don't think that replacing one broken behavior with a new broken behavior (aborts or refusal to respond) is a good idea. Doing so increases complexity and has the potential for unintended side effects. Feel free to respond with more specific details of the problem. Jeffrey Altman
smime.p7s
Description: S/MIME Cryptographic Signature
