When nodes are not reporting heartbeats, can you ssh into them? Can they see the JT machine? What does netstat -a show?
Cheers, Joep ________________________________ From: Rahul Das [rahul.h...@gmail.com] Sent: Tuesday, August 02, 2011 11:21 PM To: hdfs-user@hadoop.apache.org Subject: Dananode not sending the Hearbeat messages to Namenode Hi, I found a strange behavior in my cluster. The data nodes stop sending any information randomly (no logs coming). So the namenode thinks its down. But after some time ( approx 30 mints) the datanode nodes comes up and start behaving properly. I tried finding any error log, but the datanode node is not writing any error message during this time. The Namenode shows some warning similar to 2011-07-28 20:59:35,275 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: PendingReplicationMonitor timed out block blk_8370263993564715002_23947922 I checked this is not happening due to network outage or some other process eating up the CPU. Please help me with this. -- Rahul