[ https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raghu Angadi updated HADOOP-3232: --------------------------------- Resolution: Fixed Release Note: DU class runs the 'du' command in a seperate thread so that it does not block user. DataNode might miss heartbeats one large nodes otherwise. Status: Resolved (was: Patch Available) I just committed this. Thanks Johan! > Datanodes time out > ------------------ > > Key: HADOOP-3232 > URL: https://issues.apache.org/jira/browse/HADOOP-3232 > Project: Hadoop Core > Issue Type: Bug > Components: dfs > Affects Versions: 0.16.2 > Environment: 10 node cluster + 1 namenode > Reporter: Johan Oskarsson > Assignee: Johan Oskarsson > Priority: Critical > Fix For: 0.18.0 > > Attachments: du-nonblocking-v1.patch, du-nonblocking-v2-trunk.patch, > du-nonblocking-v4-trunk.patch, du-nonblocking-v5-trunk.patch, > du-nonblocking-v6-trunk.patch, hadoop-hadoop-datanode-new.log, > hadoop-hadoop-datanode-new.out, hadoop-hadoop-datanode.out, > hadoop-hadoop-namenode-master2.out > > > I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster. > Unfortunately we're seeing datanode timeout issues. In previous versions > we've often seen in the nn webui that one or two datanodes "last contact" > goes from the usual 0-3 sec to ~200-300 before it drops down to 0 again. > This causes mild discomfort but the big problems appear when all nodes do > this at once, as happened a few times after the upgrade. > It was suggested that this could be due to namenode garbage collection, but > looking at the gc log output it doesn't seem to be the case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.