Re: Data node decommission doesn't seem to be working correctly
Hey Scott, Hadoop tends to get confused by nodes with multiple hostnames or multiple IP addresses. Is this your case? I can't remember precisely what our admin does, but I think he puts in the IP address which Hadoop listens on in the exclude-hosts file. Look in the output of hadoop dfsadmin -report to determine precisely which IP address your datanode is listening on. Brian On May 17, 2010, at 11:32 PM, Scott White wrote: I followed the steps mentioned here: http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to decommission a data node. What I see from the namenode is the hostname of the machine that I decommissioned shows up in both the list of dead nodes but also live nodes where its admin status is marked as 'In Service'. It's been twelve hours and there is no sign in the namenode logs that the node has been decommissioned. Any suggestions of what might be the problem and what to try to ensure that this node gets safely taken down? thanks in advance, Scott smime.p7s Description: S/MIME cryptographic signature
Re: Data node decommission doesn't seem to be working correctly
Hi Scott, You might be hitting two different issues. 1) Decommission not finishing. https://issues.apache.org/jira/browse/HDFS-694 explains decommission never finishing due to open files in 0.20 2) Nodes showing up both in live and dead nodes. I remember Suresh taking a look at this. It was something about same node registered with hostname and IP separately (when datanode is rejumped and started fresh (?)). Cc-ing Suresh. Koji On 5/17/10 9:32 PM, Scott White scottbl...@gmail.com wrote: I followed the steps mentioned here: http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to decommission a data node. What I see from the namenode is the hostname of the machine that I decommissioned shows up in both the list of dead nodes but also live nodes where its admin status is marked as 'In Service'. It's been twelve hours and there is no sign in the namenode logs that the node has been decommissioned. Any suggestions of what might be the problem and what to try to ensure that this node gets safely taken down? thanks in advance, Scott
Re: Data node decommission doesn't seem to be working correctly
Dfsadmin -report reports the hostname for that machine and not the ip. That machine happens to be the master node which is why I am trying to decommission the data node there since I only want the data node running on the slave nodes. Dfs admin -report reports all the ips for the slave nodes. One question: I believe that the namenode was accidentally restarted during the 12 hours or so I was waiting for the decommission to complete. Would this put things into a bad state? I did try running dfsadmin -refreshNodes after it was restarted. Scott On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman bbock...@cse.unl.eduwrote: Hey Scott, Hadoop tends to get confused by nodes with multiple hostnames or multiple IP addresses. Is this your case? I can't remember precisely what our admin does, but I think he puts in the IP address which Hadoop listens on in the exclude-hosts file. Look in the output of hadoop dfsadmin -report to determine precisely which IP address your datanode is listening on. Brian On May 17, 2010, at 11:32 PM, Scott White wrote: I followed the steps mentioned here: http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to decommission a data node. What I see from the namenode is the hostname of the machine that I decommissioned shows up in both the list of dead nodes but also live nodes where its admin status is marked as 'In Service'. It's been twelve hours and there is no sign in the namenode logs that the node has been decommissioned. Any suggestions of what might be the problem and what to try to ensure that this node gets safely taken down? thanks in advance, Scott
Re: Data node decommission doesn't seem to be working correctly
Hey Scott, If the node shows up in the dead nodes and the live nodes as you say, it's definitely not even attempting to be decommissioned. If HDFS was attempting decommissioning and you restart the namenode, then it would only show up in the dead nodes list. Another option is to just turn off HDFS on that node alone, and don't physically delete the data from the node until HDFS completely recovers. This is not recommended for production usage, as it creates a period where the cluster is in danger of losing files. However, it can be used as a one-off to get over this speed-hump. Brian On May 18, 2010, at 12:02 PM, Scott White wrote: Dfsadmin -report reports the hostname for that machine and not the ip. That machine happens to be the master node which is why I am trying to decommission the data node there since I only want the data node running on the slave nodes. Dfs admin -report reports all the ips for the slave nodes. One question: I believe that the namenode was accidentally restarted during the 12 hours or so I was waiting for the decommission to complete. Would this put things into a bad state? I did try running dfsadmin -refreshNodes after it was restarted. Scott On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman bbock...@cse.unl.eduwrote: Hey Scott, Hadoop tends to get confused by nodes with multiple hostnames or multiple IP addresses. Is this your case? I can't remember precisely what our admin does, but I think he puts in the IP address which Hadoop listens on in the exclude-hosts file. Look in the output of hadoop dfsadmin -report to determine precisely which IP address your datanode is listening on. Brian On May 17, 2010, at 11:32 PM, Scott White wrote: I followed the steps mentioned here: http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to decommission a data node. What I see from the namenode is the hostname of the machine that I decommissioned shows up in both the list of dead nodes but also live nodes where its admin status is marked as 'In Service'. It's been twelve hours and there is no sign in the namenode logs that the node has been decommissioned. Any suggestions of what might be the problem and what to try to ensure that this node gets safely taken down? thanks in advance, Scott smime.p7s Description: S/MIME cryptographic signature
Data node decommission doesn't seem to be working correctly
I followed the steps mentioned here: http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to decommission a data node. What I see from the namenode is the hostname of the machine that I decommissioned shows up in both the list of dead nodes but also live nodes where its admin status is marked as 'In Service'. It's been twelve hours and there is no sign in the namenode logs that the node has been decommissioned. Any suggestions of what might be the problem and what to try to ensure that this node gets safely taken down? thanks in advance, Scott