On Sep 24, 2013, at 11:52 AM, Simon Wilkinson <[email protected]> wrote:

> So, it's worth noting that lots of the code involved here is very different 
> between 1.6 and master. I assume that you are seeing these issues on 1.6.
Yes, "essentially" 1.6.

> I rewrote the rxevent queue completely for master as part of YFS's RX 
> performance work. A side-effect of the new implementation is that it should 
> be less vulnerable to timer stalls - as soon as it is triggered, all of the 
> expired events will be run, rather than just a subset. Master is also moving 
> towards pthreaded ubik servers, so the need to work around thread starvation 
> (whether through problems with IOMGR, or the use of non-IOMGR I/O) will be 
> reduced. 
> 
> The challenge with rxevent is that, along with the listener thread, it is 
> performance critical for the OpenAFS RX stack. If we do add additional code 
> to handle edge cases, we need to be sure that the impact of that additional 
> code on the common case is negligible. The more uncommon the situation we're 
> trying to handle, the smaller the impact on the common case needs to be. In 
> particular anything that adds additional locking to the rxevent critical path 
> needs to be very carefully handled.
In my prototype, logging is done with dpf() from rxevent_Post(), and only when 
the sick/well state changes (as a one-shot). 

> My feeling is that it will be hard to justify adding the code you suggest to 
> master, where it seems that the only potential trigger is hardware failure. 
> There's a better case to make for 1.6, although as a "stable" release, I 
> think we'd probably be looking for something of minimal impact - logging, 
> rather than aborting seems like the safest bet there.
Okay.

> Finally, it's also worth considering that some platforms have very vague 
> ideas of what "timely" means when it comes to rxevent - some kernels only 
> schedule the event thread every half second - on a link with a low RTT, you 
> can end up with many timeout events going past before the event thread 
> actually notices!
In the prototype, "untimely" is when (the top) scheduled events miss their run 
time by more than 5 seconds. 
I realize that "untimely" from an Rx protocol point of view is several orders 
of magnitude more stringent than that, but I'm only trying to flag gross 
problems with this fix.

Thanks for your comments (and to everyone else that responded).  They were all 
very helpful.

The concensus so far seems to be that I should only attempt to address this in 
1.6 (not master), and then only if I can find a low-impact way to log the 
warning messages.  Have I gauged that correctly?

Regards,
--
Mark Vitale
[email protected]


Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to