On 23 Sep 2013, at 20:52, Mark Vitale <[email protected]> wrote:
> Recently I've been working on several problems with very different externals > but a similar root cause: [ ... ] > So in two very different situations, the rxevent queue was unable to process > scheduled events in a timely manner, leading to very strange and > difficult-to-diagnose symptoms. Hi Mark, So, it's worth noting that lots of the code involved here is very different between 1.6 and master. I assume that you are seeing these issues on 1.6. I rewrote the rxevent queue completely for master as part of YFS's RX performance work. A side-effect of the new implementation is that it should be less vulnerable to timer stalls - as soon as it is triggered, all of the expired events will be run, rather than just a subset. Master is also moving towards pthreaded ubik servers, so the need to work around thread starvation (whether through problems with IOMGR, or the use of non-IOMGR I/O) will be reduced. The challenge with rxevent is that, along with the listener thread, it is performance critical for the OpenAFS RX stack. If we do add additional code to handle edge cases, we need to be sure that the impact of that additional code on the common case is negligible. The more uncommon the situation we're trying to handle, the smaller the impact on the common case needs to be. In particular anything that adds additional locking to the rxevent critical path needs to be very carefully handled. My feeling is that it will be hard to justify adding the code you suggest to master, where it seems that the only potential trigger is hardware failure. There's a better case to make for 1.6, although as a "stable" release, I think we'd probably be looking for something of minimal impact - logging, rather than aborting seems like the safest bet there. Finally, it's also worth considering that some platforms have very vague ideas of what "timely" means when it comes to rxevent - some kernels only schedule the event thread every half second - on a link with a low RTT, you can end up with many timeout events going past before the event thread actually notices! Cheers, Simon _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
