Hi Nesvarbu, It sounds like your problem might be related to the following JIRA:
https://issues.apache.org/jira/browse/HADOOP-5713 Here's the relevant code from FSNamesystem.java: long heartbeatInterval = conf.getLong("dfs.heartbeat.interval", 3) * 1000; this.heartbeatRecheckInterval = conf.getInt( "heartbeat.recheck.interval", 5 * 60 * 1000); // 5 minutes this.heartbeatExpireInterval = 2 * heartbeatRecheckInterval + 10 * heartbeatInterval; It looks like you specified dfs.heartbeat.recheck.interval instead of heartbeat.recheck.interval. This inconsistency is unfortunate :( -Todd On Fri, May 8, 2009 at 2:13 PM, nesvarbu No <nesvarbu...@gmail.com> wrote: > Hi All, > > I've been testing hdfs with 3 datanodes cluster, and I've noticed that if I > stopped 1 datanode I still can read all the files, but "hadoop dfs > -copyFromLocal" command fails. In the namenode web interface I can see that > it still thinks that datanode is alive and basically detects that it's dead > in 10 minutes. After reading list archives I've tried modifying heartbeat > intervals, by using these options: > > <property> > <name>dfs.heartbeat.interval</name> > <value>1</value> > <description>Determines datanode heartbeat interval in > seconds.</description> > </property> > > <property> > <name>dfs.heartbeat.recheck.interval</name> > <value>1</value> > <description>Determines datanode heartbeat interval in > seconds.</description> > </property> > > <property> > <name>dfs.namenode.decommission.interval</name> > <value>1</value> > <description>Determines datanode heartbeat interval in > seconds.</description> > </property> > > It still detects in 10 minutes. Is there a way to shorten this interval? (I > thought if I set data replication to 2, and have 3 nodes (basically have > one > spare) writes won't fail, but they still do fail.) >