Re: [OpenAFS-devel] Re: idle dead timeout processing in clients

omalleys Fri, 09 Dec 2011 10:28:03 -0800

Quoting [email protected]:

Quoting Russ Allbery <[email protected]>:
Andrew Deason <[email protected]> writes:
And, well, "visible" in a different sense. If it takes 20 minutes for a
read() to return, it's not visible in the sense that the application
needs a code path to deal with it; AFS isn't "down" but arguably just
"slow". If it takes 5 seconds for read() to return, but it returns -1
with ETIMEDOUT, for some environments that's worse / more visible. I've
had someone seem completely baffled when they were told that not
everyone runs AFS with hardmount turned on; that not only is that
behavior optional, but defaults to 'off'.
Yeah, I suppose it depends on the application.  If your two-week compute
job stalls for a half-hour, you might not notice.

We mostly use AFS for serving web pages, and if it takes more than twenty
seconds, you may as well just give up and return an error message, since
you're already past the point of recovery anyway.
The "original" timeout patch/hack, was put in place simply so if youhad a web server, it wouldn't lock it up to the point you had you toreboot it if afs was offline for any reason, like maintenance, afsserver crash, router crashing, etc. But 20 minutes was fine, sinceit is faster then running all over campus to reboot wedged servers.For us, usually it was a scheduled maintenance for AFS, AFS crashed,the router crashed. Typically much bigger issues.
It wasn't really meant to fix issues with people unplugging thenetwork cables, switches, putting up firewalls, Screwing around withthe routing, etc. Those are theoretically known localized issues.
For that you need something more robust and better thought out thenthe original patch. Not just extending the original patch.
I was told by the person who wrote it, it probably wasn't in theright spot, and probably needed to be more well thought out/robust.But it solved the major issue, which we were both having.

I also just remembered, There is something about windows xp's networkstack isn't as robust (nt was worse) and does have a tendency to droppackets off the queue (oldest first), versus unix which does just theopposite. If you make a request from a windows client to your unixserver, then the windows network stack gets overloaded with networkrequests or the unix server is overloaded busy already, then windowstimes out the oldest request, and then the unix server tries torespond to the old request, which windows disregards as irreleventbecause it is already off its queue, this could be causing spikes inthe AFS processes. (I can't remember if this just affects tcp or itaffects udp also.)





_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Re: [OpenAFS-devel] Re: idle dead timeout processing in clients

Reply via email to