On Fri, 2014-01-17 at 14:12 -0600, Andrew Deason wrote:
> time, so presumably if we contact a downed dbserver, the client will not > try to contact that dbserver for quite some time. To elaborate: the cache manager keeps track of every server, and periodically sends a sort of "ping" to each server to find out which servers are up. So, it will discover a server is down even if you're not using it. And, other than the periodic pings, the cache manager will never direct a request to a server it thinks is down. So, failover for the CM itself is automatic, persistent, and often completely transparent. The fileserver works a little differently, but also keeps track of which server it is using, fails over when that server stops responding, and generally avoids switching when it doesn't need to. Ubik database servers all communicate among themselves, which is a necessary part of the database replication mechanism. That happens even when one server is down, but in such a way that you'll never notice a communication failure between dbservers except in an unusual combination of circumstances which can sometimes happen if a server goes down while you are making a request that requires writing to the database. > > I have a single-host test OpenAFS cell with 1.6.5.2, and I > > have added a second IP address to '/etc/openafs/CellServDB' > > with an existing DNS entry (just to be sure) but not assigned > > to any machine: sometimes 'vos vldb' hangs for a while (105 > > seconds), doing 8 attempts to connect to the "down" DB server; > > I'm not sure how you are determining that we're making 8 attempts to > contact the down server. Are you just seeing 8 packets go by? We can > send many packets for a single attempt to contact the remote site. Right. Even though AFS communicates over UDP, which itself is connectionless, Rx does have the notion of connections and includes a full transport layer including retransmission, sequencing, flow control, and exponential backoff for congestion control. What you are actually seeing is multiple retransmissions of a request, which may or may not be the first packet in a new connection. The packet is retransmitted because the server did not reply with an acknowledgement, and the intervals get longer because of exponential backoff, which is an key factor in making sure that congested networks eventually get better rather than only getting worse. -- Jeff _______________________________________________ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info