Hi Russ, Uptime patterns are fairly interesting. We looked at uptime datasets from web servers, desktop PCs at Microsoft, and Planetlab nodes, and Saroiu, Gummadi and Gribble looked closely at the lifetime of Gnutella and Napster participants in their paper titled "Measuring and Analyzing the Characteristics of Napster and Gnutella Hosts." Some quick observations:
On average, past uptime performance is predictive of future uptimes. "Random" events that take down a node (e.g. software updates, power outages, someone bumping into the power button) happen to any given node with approximately the same probability over time. This is great because it implies that outages are not actually truly random events, and opens up the use of history to predict future lifetimes. Some classes of nodes are best characterized as bimodal -- they are either in a phase where they experience long periods in between reboots (presumably in production/usage mode), and phases where they are rebooted very frequently (presumably experiencing some kind of distress, e.g. hardware problems, someone tweaking software configs, etc). Saroiu et al studied Gnutella & Napster host lifetimes in detail at the IP level. The good news is that P2P participants seem to be similar regardless of protocol. It unfortunately does not seem possible from their graphs to check if Petar's observation holds. And, if it holds, taking advantage of this observation would require knowing not only that the probability drops with age, but precisely how it drops mathematically. But in any case, exponential backoff is a terrible way to ping nodes. It backs off so aggressively that moderately long-lived nodes will not be well served. Short-lived nodes will be fine, they'll get pinged frequently. Really long-lived nodes will be ok; sure, the first bunch of pings are wasted, but after a relatively quick flurry, these nodes will not be pinged much at all, which is fine for a node like my mail server that goes down every few years. It's the nodes in the middle of the lifetime distribution that will not be pinged frequently enough -- their failures will be missed by the failure detector and lead to pesky app-level timeouts. It turns out that you want to ping nodes in proportion to the square root of their expected lifetimes. Hence, sqrt(s) for the name of the detector and the project. The details are in our paper: Latency- and Bandwidth-Minimizing Optimal Failure Detectors. Kelvin So and Emin Gün Sirer. In Proceedings of the European Conference on Computer Systems, Lisbon, Portugal, March 2007. http://www.cs.cornell.edu/People/egs/sqrt-s/papers.php By the way, many other interesting apps in other domains are similar to the (rather ho-hum) failure detection problem. E.g. detecting when web pages/blogs/profiles are modified so they can be indexed, sensing events in a sensor network with limited battery capacity, etc. Gün. On Thu, 2009-03-26 at 15:25 -0700, Russ Weeks wrote: > 2009/3/26 Emin Gün Sirer <e...@systems.cs.cornell.edu> > This discussion touches on a related & elegant problem: > > Suppose a peer is tasked with monitoring the liveness > of N other > peers by pinging them periodically. What's the optimal > strategy > it should use for pinging those N peers? > > > That's a really interesting question... Petar Maymounkov, in his > description of Kademlia, shows that in the Gnutella network, the > longer a node has been alive, the more likely it is to stay alive. > Does this suggest that the liveness "pings" should follow an > exponential backoff as well? (in addition to the NAT traversal pings, > I mean) Or is this behaviour likely specific to Gnutella and is > unlikely to be found in other P2P systems? > > Russ > _______________________________________________ > p2p-hackers mailing list > p2p-hackers@lists.zooko.com > http://lists.zooko.com/mailman/listinfo/p2p-hackers _______________________________________________ p2p-hackers mailing list p2p-hackers@lists.zooko.com http://lists.zooko.com/mailman/listinfo/p2p-hackers