Quoting [email protected]:

Quoting Russ Allbery <[email protected]>:

Andrew Deason <[email protected]> writes:

And, well, "visible" in a different sense. If it takes 20 minutes for a
read() to return, it's not visible in the sense that the application
needs a code path to deal with it; AFS isn't "down" but arguably just
"slow". If it takes 5 seconds for read() to return, but it returns -1
with ETIMEDOUT, for some environments that's worse / more visible. I've
had someone seem completely baffled when they were told that not
everyone runs AFS with hardmount turned on; that not only is that
behavior optional, but defaults to 'off'.

Yeah, I suppose it depends on the application.  If your two-week compute
job stalls for a half-hour, you might not notice.

We mostly use AFS for serving web pages, and if it takes more than twenty
seconds, you may as well just give up and return an error message, since
you're already past the point of recovery anyway.

The "original" timeout patch/hack, was put in place simply so if you had a web server, it wouldn't lock it up to the point you had you to reboot it if afs was offline for any reason, like maintenance, afs server crash, router crashing, etc. But 20 minutes was fine, since it is faster then running all over campus to reboot wedged servers. For us, usually it was a scheduled maintenance for AFS, AFS crashed, the router crashed. Typically much bigger issues.

It wasn't really meant to fix issues with people unplugging the network cables, switches, putting up firewalls, Screwing around with the routing, etc. Those are theoretically known localized issues.

For that you need something more robust and better thought out then the original patch. Not just extending the original patch.

I was told by the person who wrote it, it probably wasn't in the right spot, and probably needed to be more well thought out/robust. But it solved the major issue, which we were both having.


I also just remembered, There is something about windows xp's network stack isn't as robust (nt was worse) and does have a tendency to drop packets off the queue (oldest first), versus unix which does just the opposite. If you make a request from a windows client to your unix server, then the windows network stack gets overloaded with network requests or the unix server is overloaded busy already, then windows times out the oldest request, and then the unix server tries to respond to the old request, which windows disregards as irrelevent because it is already off its queue, this could be causing spikes in the AFS processes. (I can't remember if this just affects tcp or it affects udp also.)




_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to