Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Brian Bockelman
Hey Scott,

Hadoop tends to get confused by nodes with multiple hostnames or multiple IP 
addresses.  Is this your case?

I can't remember precisely what our admin does, but I think he puts in the IP 
address which Hadoop listens on in the exclude-hosts file.

Look in the output of 

hadoop dfsadmin -report

to determine precisely which IP address your datanode is listening on.

Brian

On May 17, 2010, at 11:32 PM, Scott White wrote:

 I followed the steps mentioned here:
 http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
 decommission a data node. What I see from the namenode is the hostname of
 the machine that I decommissioned shows up in both the list of dead nodes
 but also live nodes where its admin status is marked as 'In Service'. It's
 been twelve hours and there is no sign in the namenode logs that the node
 has been decommissioned. Any suggestions of what might be the problem and
 what to try to ensure that this node gets safely taken down?
 
 thanks in advance,
 Scott



smime.p7s
Description: S/MIME cryptographic signature


Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Koji Noguchi
Hi Scott, 

You might be hitting two different issues.

1) Decommission not finishing.
   https://issues.apache.org/jira/browse/HDFS-694  explains decommission
never finishing due to open files in 0.20

2) Nodes showing up both in live and dead nodes.
   I remember Suresh taking a look at this.
   It was something about same node registered with hostname and IP
separately (when datanode is rejumped and started fresh (?)).

Cc-ing Suresh.

Koji

On 5/17/10 9:32 PM, Scott White scottbl...@gmail.com wrote:

 I followed the steps mentioned here:
 http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
 decommission a data node. What I see from the namenode is the hostname of
 the machine that I decommissioned shows up in both the list of dead nodes
 but also live nodes where its admin status is marked as 'In Service'. It's
 been twelve hours and there is no sign in the namenode logs that the node
 has been decommissioned. Any suggestions of what might be the problem and
 what to try to ensure that this node gets safely taken down?
 
 thanks in advance,
 Scott



Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Scott White
Dfsadmin -report reports the hostname for that machine and not the ip. That
machine happens to be the master node which is why I am trying to
decommission the data node there since I only want the data node running on
the slave nodes. Dfs admin -report reports all the ips for the slave nodes.

One question: I believe that the namenode was accidentally restarted during
the 12 hours or so I was waiting for the decommission to complete. Would
this put things into a bad state? I did try running dfsadmin -refreshNodes
after it was restarted.

Scott


On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Scott,

 Hadoop tends to get confused by nodes with multiple hostnames or multiple
 IP addresses.  Is this your case?

 I can't remember precisely what our admin does, but I think he puts in the
 IP address which Hadoop listens on in the exclude-hosts file.

 Look in the output of

 hadoop dfsadmin -report

 to determine precisely which IP address your datanode is listening on.

 Brian

 On May 17, 2010, at 11:32 PM, Scott White wrote:

  I followed the steps mentioned here:
  http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
  decommission a data node. What I see from the namenode is the hostname of
  the machine that I decommissioned shows up in both the list of dead nodes
  but also live nodes where its admin status is marked as 'In Service'.
 It's
  been twelve hours and there is no sign in the namenode logs that the node
  has been decommissioned. Any suggestions of what might be the problem and
  what to try to ensure that this node gets safely taken down?
 
  thanks in advance,
  Scott




Re: Data node decommission doesn't seem to be working correctly

2010-05-18 Thread Brian Bockelman
Hey Scott,

If the node shows up in the dead nodes and the live nodes as you say, it's 
definitely not even attempting to be decommissioned.  If HDFS was attempting 
decommissioning and you restart the namenode, then it would only show up in the 
dead nodes list.

Another option is to just turn off HDFS on that node alone, and don't 
physically delete the data from the node until HDFS completely recovers.  This 
is not recommended for production usage, as it creates a period where the 
cluster is in danger of losing files.  However, it can be used as a one-off to 
get over this speed-hump.

Brian

On May 18, 2010, at 12:02 PM, Scott White wrote:

 Dfsadmin -report reports the hostname for that machine and not the ip. That
 machine happens to be the master node which is why I am trying to
 decommission the data node there since I only want the data node running on
 the slave nodes. Dfs admin -report reports all the ips for the slave nodes.
 
 One question: I believe that the namenode was accidentally restarted during
 the 12 hours or so I was waiting for the decommission to complete. Would
 this put things into a bad state? I did try running dfsadmin -refreshNodes
 after it was restarted.
 
 Scott
 
 
 On Tue, May 18, 2010 at 5:44 AM, Brian Bockelman bbock...@cse.unl.eduwrote:
 
 Hey Scott,
 
 Hadoop tends to get confused by nodes with multiple hostnames or multiple
 IP addresses.  Is this your case?
 
 I can't remember precisely what our admin does, but I think he puts in the
 IP address which Hadoop listens on in the exclude-hosts file.
 
 Look in the output of
 
 hadoop dfsadmin -report
 
 to determine precisely which IP address your datanode is listening on.
 
 Brian
 
 On May 17, 2010, at 11:32 PM, Scott White wrote:
 
 I followed the steps mentioned here:
 http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
 decommission a data node. What I see from the namenode is the hostname of
 the machine that I decommissioned shows up in both the list of dead nodes
 but also live nodes where its admin status is marked as 'In Service'.
 It's
 been twelve hours and there is no sign in the namenode logs that the node
 has been decommissioned. Any suggestions of what might be the problem and
 what to try to ensure that this node gets safely taken down?
 
 thanks in advance,
 Scott
 
 



smime.p7s
Description: S/MIME cryptographic signature


Data node decommission doesn't seem to be working correctly

2010-05-17 Thread Scott White
I followed the steps mentioned here:
http://developer.yahoo.com/hadoop/tutorial/module2.html#decommission to
decommission a data node. What I see from the namenode is the hostname of
the machine that I decommissioned shows up in both the list of dead nodes
but also live nodes where its admin status is marked as 'In Service'. It's
been twelve hours and there is no sign in the namenode logs that the node
has been decommissioned. Any suggestions of what might be the problem and
what to try to ensure that this node gets safely taken down?

thanks in advance,
Scott