Thanks Colin and Suresh! On Wed, Mar 13, 2013 at 3:08 PM, Colin McCabe <cmcc...@alumni.cmu.edu>wrote:
> My understanding is that the 10 minute timeout helps to avoid replication > storms, especially during startup. > > You might be interested in HDFS-3703, which adds a "stale" state which > datanodes are placed into after 30 seconds of missing heartbeats. (This is > an optional feature controlled by dfs.namenode.check.stale.datanode ) > > best, > Colin > > > On Tue, Mar 12, 2013 at 5:29 PM, André Oriani <aori...@gmail.com> wrote: > > > No take on this one? > > > > In Zookeeper the heartbeats happen on every third of the timeout. If I > am > > not mistaken, recomended timeout is more than 2 minutes to avoid false > > positives. > > > > But I still cannot see the relationship on HDFS between heartbeat > interval > > and timeout. Okay 10 minutes seems to be a conservative value to avoid > > false positives in a big cluster. But that means 200 hearbeats. > Heartbeats > > on HDFS are not only used for liveness detection but also to send > > information about free space and load and to receive commands from > > NameNode. So they are also essential for block placement decisions and > for > > ensuring the replication levels. Would that then be reason why heartbeats > > are so frequent? A lot can happen to a DataNode in just three seconds? > > > > > > Thanks, > > André Oriani > > > > > > > > On Thu, Mar 7, 2013 at 10:37 PM, André Oriani <aori...@gmail.com> wrote: > > > > > Hi, > > > > > > Is there any particular reason why the default heartbeat interval is 3 > > > seconds and the timeout is 10 minutes? Everywhere I looked (code, > Google, > > > ..) only mentions the values but no clue on why those values were > > chosen. > > > > > > > > > Thanks in advance, > > > André Oriani > > > > > >