On 23 Sep 2013, at 20:52, Mark Vitale <[email protected]> wrote:

> Recently I've been working on several problems with very different externals 
> but a similar root cause:
[ ... ]
> So in two very different situations, the rxevent queue was unable to process 
> scheduled events in a timely manner, leading to very strange and 
> difficult-to-diagnose symptoms.

Hi Mark,

So, it's worth noting that lots of the code involved here is very different 
between 1.6 and master. I assume that you are seeing these issues on 1.6.

I rewrote the rxevent queue completely for master as part of YFS's RX 
performance work. A side-effect of the new implementation is that it should be 
less vulnerable to timer stalls - as soon as it is triggered, all of the 
expired events will be run, rather than just a subset. Master is also moving 
towards pthreaded ubik servers, so the need to work around thread starvation 
(whether through problems with IOMGR, or the use of non-IOMGR I/O) will be 
reduced.

The challenge with rxevent is that, along with the listener thread, it is 
performance critical for the OpenAFS RX stack. If we do add additional code to 
handle edge cases, we need to be sure that the impact of that additional code 
on the common case is negligible. The more uncommon the situation we're trying 
to handle, the smaller the impact on the common case needs to be. In particular 
anything that adds additional locking to the rxevent critical path needs to be 
very carefully handled.

My feeling is that it will be hard to justify adding the code you suggest to 
master, where it seems that the only potential trigger is hardware failure. 
There's a better case to make for 1.6, although as a "stable" release, I think 
we'd probably be looking for something of minimal impact - logging, rather than 
aborting seems like the safest bet there.

Finally, it's also worth considering that some platforms have very vague ideas 
of what "timely" means when it comes to rxevent - some kernels only schedule 
the event thread every half second - on a link with a low RTT, you can end up 
with many timeout events going past before the event thread actually notices!

Cheers,

Simon

_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to