I'm just getting back to this after a few weeks of distractions... Andrew Deason <adeason <at> sinenomine.net> writes: > > On Mon, 30 Sep 2013 13:19:40 -0500 > Andrew Deason <adeason <at> sinenomine.net> wrote: > > > I don't think you should be avoiding this for master; no matter how > > fast/better the code for processing rxevents is, if the rxevent > > handler only gets to run every e.g. once every 20 seconds, it still > > only gets events to run once every 20 seconds. That can always happen > > in LWP if an LWP is not yielding for any reason, and it can happen > > with any threading implementation if there is a bug. > > Hmm, also, one more thing I don't think has come up. I think people tend > to wave away these problems when saying these are caused by hardware > problems, or disks being slow, etc etc. While I tend to agree that it's > not worthwhile to work around those to fix the local process we're > running in, the impacts can be greater than that. For example, one of > the situations that Mark described I believe causes clients to keep > contacting a "sick" server, and then getting connection timeouts. Yes, the case you are describing was initially reported as poor performance for many clients contacting the "sick" fileserver. Only a few of the affected clients recognized that the server was "down" (due to rxevent caused delays and timeouts); but even those clients immediately marked the server "back up" when they received undelayed responses RXAFS_GetTime probes. So they would continue to contact the "sick" fileserver.
> If such weird rxevent behavior slows down or brings down a whole cell, > or a whole replicated volume, it's still our responsibility to handle it > even if it's caused by a hardware fault. If we don't have some way of > handling it, then such a failing fileserver process can be a single > point of failure for data availability, even if volume data is > replicated. Agreed, that's essentially what happened in the "sick" fileserver case, and why I realized this can really only be addressed on the server side; most clients never realize that anything is wrong or that they should switch to another fileserver. > I'm not sure how possible/serious such issues are with the behavior > described in this thread; I'll let Mark either correct me or advocate > for this line of reasoning if he wants. Done. Thanks, Andrew. -- Mark Vitale [email protected] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
