Re: [p2p-hackers] TCP Keepalive timeouts

Emin Gün Sirer Thu, 26 Mar 2009 10:06:08 -0700

This discussion touches on a related & elegant problem:
        
        Suppose a peer is tasked with monitoring the liveness of N other
        peers by pinging them periodically. What's the optimal strategy
        it should use for pinging those N peers?
        
Many of you will realize that this is the problem statement for a "local
failure detector" commonly assumed by theoretical distributed system
papers to be built into every node. To my knowledge, despite the decades
of theoretical work on distributed failure detectors, the right way to
design and implement a local failure detector was not previously
explored, so my group took a look at it.


The answer, along with an open-source python and Java implementation, is
here:
    http://www.cs.cornell.edu/People/egs/sqrt-s/

In its simplest mode of operation, the code takes as input an amount of
bandwidth to spend on failure detection, and pings the monitored nodes
at just the right frequency such that it will detect failures as early
as possible without exceeding the given bandwidth cap. An optional max
ping delay can be specified to keep the NAT tables updated.

Hope this is useful, and if anyone is interested in solving timeless
problems of practical importance in distributed systems, graduate
schools are a good place, especially in the current economy :-),

Gün.

On Thu, 2009-03-26 at 09:42 -0700, David Barrett wrote:
> Agreed with Wesley's comments.  I tend to do:
> 
> - 60s keepalive messages over TCP connections, more to detect connection 
> failure than for any router/NAT reason.
> 
> - Xs keepalive for all UDP peer connections, to keep the NAT binding 
> alive (and to detect peer failures).
> 
> The trick of course is what's a good "X" and the answer is "there's no 
> single answer that works everywhere, or even a single answer that works 
> always for the same router".
> 
> I suggest setting up some set of servers you are certain have good 
> connectivity, and then having clients ping them with an exponential 
> backoff delay in the response.  Keep doing this until you don't get a 
> response because it comes too late (ie, after the NAT has given up), at 
> which point you've found the NAT timeout.  Set "X" to the last delay 
> that worked, and use that for all peers.
> 
> However, X can change.  So rather than doing this once and being done 
> with it, use this approach to "walk up" to a keepalive frequency that is 
> "too low", then walk back "down" to a frequency that is "unnecessarily 
> high", and then repeat.  I've found X can change for a given router (or 
> collection of routers), perhaps under load?  So this keeps the system on 
> its toes.
> 
> Does this make sense?
> 
> -david
> Follow me at http://twitter.com/quinthar
> 
> Wesley Eddy wrote:
> > On Thu, Mar 26, 2009 at 10:09:20AM +0000, Will Morton wrote:
> >> On 26/03/2009, Richard Price <r.m.pr...@cs.bham.ac.uk> wrote:
> >>>  But my question is, say I'm in a P2P network and I'm connected to
> >>>  multiple peers. Is one keepalive message from a single peer, say every 2
> >>>  minutes, enough to stop my router timing out all my connections? Or do
> >>>  all peers each have to ensure that their individual connection does not
> >>>  timeout by regularly sending keepalive messages? If so how regularly?
> >>>
> >> If you are going through a NAT device, you need to send or receive a
> >> packet to/from each host every N seconds or else the NAT mapping for
> >> that host will be dropped.  These devices don't have very much memory
> >> so they don't wait very long before dumping idle connections.  I
> >> believe most of them use a Least-Recently Used algorithm to keep track
> >> of which details to dump when memory gets tight, so in a sense the
> >> different connections are competing with each other.  Anecdotal
> >> evidence suggests a safe value is about 30 seconds, but I have no hard
> >> data to back that up.
> >>
> >> If you're not going through NAT, you don't need to send a keep-alive
> >> to keep the connection up, but you'd want one anyway to detect dead
> >> hosts, although you could safely use one a lot longer than 30s.  How
> >> much longer would depend on how much memory you're happy to have
> >> sitting around keeping track of dead hosts.  Trade-offs, always. :-)
> >>
> > 
> > 
> > Even without NAT, you may still need this for some stateful firewalls
> > that tend to throw away their state too soon.
> > 
> > Whether or not you need this at all heavily depends upon the exact
> > NAT(s) and firewall implementations you're going through as well as
> > their configuration parameters, and they both vary widely in the wild.
> > Some follow the BEHAVE guidelines, but that number is still far from
> > 100%, so for at least several more years, you still have to code for the
> > worst behaving cases. 
> > 
> > _______________________________________________
> > p2p-hackers mailing list
> > p2p-hackers@lists.zooko.com
> > http://lists.zooko.com/mailman/listinfo/p2p-hackers
> 
> 
> _______________________________________________
> p2p-hackers mailing list
> p2p-hackers@lists.zooko.com
> http://lists.zooko.com/mailman/listinfo/p2p-hackers

_______________________________________________
p2p-hackers mailing list
p2p-hackers@lists.zooko.com
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Re: [p2p-hackers] TCP Keepalive timeouts

Reply via email to