Adding nodes
Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
You only have to refresh nodes if you're making use of an allows file. Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote: Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes?
Re: Adding nodes
Whatever Joey said is correct for Cloudera's distribution. For same, I am not confident about other distribution as i haven't tried them. Thanks, Anil On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Thanks Regards, Anil Gupta
Re: Adding nodes
WHat Joey said is correct for both apache and cloudera distros. The DN/TT daemons will connect to the NN/JT using the config files. The master and slave files are used for starting the correct daemons. From: anil gupta anilg...@buffalo.edu To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com Sent: Thursday, March 1, 2012 5:42 PM Subject: Re: Adding nodes Whatever Joey said is correct for Cloudera's distribution. For same, I am not confident about other distribution as i haven't tried them. Thanks, Anil On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan rajv...@yahoo.com wrote: The master and slave files, if I remember correctly are used to start the correct daemons on the correct nodes from the master node. Raj From: Joey Echeverria j...@cloudera.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thursday, March 1, 2012 4:57 PM Subject: Re: Adding nodes Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Thanks Regards, Anil Gupta
Re: Adding nodes
It is initiated by the slave.If you have defined files to state which slaves can talk to the namenode (using configdfs.hosts) and which hosts cannot (using propertydfs.hosts.exclude) then you would need to edit these files and issue the refresh command.On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote:On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote:Not quite. Datanodes get the namenode host from fs.defalt.name incore-site.xml. Task trackers find the job tracker from themapred.job.tracker setting in mapred-site.xml.I actually meant to ask how does namenode/jobtracker know there is a newnode in the cluster. Is it initiated by namenode when slave file is edited?Or is it initiated by tasktracker when tasktracker is started?Sent from my iPhoneOn Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote:On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.comwrote:You only have to refresh nodes if you're making use of an allows file.Thanks does it mean that when tasktracker/datanode starts up itcommunicates with namenode using master file?Sent from my iPhoneOn Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote:Is this the right procedure to add nodes? I took some from hadoop wikiFAQ:http://wiki.apache.org/hadoop/FAQ1. Update conf/slave2. on the slave nodes start datanode and tasktracker3. hadoop balancerDo I also need to run dfsadmin -refreshnodes? --ArpitHortonworks, Inc.email: ar...@hortonworks.com
Re: Adding nodes
Thanks all for the answers!! On Thu, Mar 1, 2012 at 5:52 PM, Arpit Gupta ar...@hortonworks.com wrote: It is initiated by the slave. If you have defined files to state which slaves can talk to the namenode (using config dfs.hosts) and which hosts cannot (using property dfs.hosts.exclude) then you would need to edit these files and issue the refresh command. On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote: On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria j...@cloudera.com wrote: Not quite. Datanodes get the namenode host from fs.defalt.name in core-site.xml. Task trackers find the job tracker from the mapred.job.tracker setting in mapred-site.xml. I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started? Sent from my iPhone On Mar 1, 2012, at 18:49, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria j...@cloudera.com wrote: You only have to refresh nodes if you're making use of an allows file. Thanks does it mean that when tasktracker/datanode starts up it communicates with namenode using master file? Sent from my iPhone On Mar 1, 2012, at 18:29, Mohit Anchlia mohitanch...@gmail.com wrote: Is this the right procedure to add nodes? I took some from hadoop wiki FAQ: http://wiki.apache.org/hadoop/FAQ 1. Update conf/slave 2. on the slave nodes start datanode and tasktracker 3. hadoop balancer Do I also need to run dfsadmin -refreshnodes? -- Arpit Hortonworks, Inc. email: ar...@hortonworks.com http://www.hadoopsummit.org/ http://www.hadoopsummit.org/ http://www.hadoopsummit.org/
Re: Adding nodes
Mohit, New datanodes will connect to the namenode so thats how the namenode knows. Just make sure the datanodes have the correct {fs.default.dir} in their hdfs-site.xml and then start them. The namenode can, however, choose to reject the datanode if you are using the {dfs.hosts} and {dfs.hosts.exclude} settings in the namenode's hdfs-site.xml. The namenode doesn't actually care about the slaves file. It's only used by the start/stop scripts. On 2012/03/02 10:35, Mohit Anchlia wrote: I actually meant to ask how does namenode/jobtracker know there is a new node in the cluster. Is it initiated by namenode when slave file is edited? Or is it initiated by tasktracker when tasktracker is started?
Re: Dynamically adding nodes in Hadoop
Thanks for all the input. I am trying to do cluster setup in EC2 but not able to find how i can do dns updation centrally. If anyone one knows how to do this please help me .. On Sat, Dec 17, 2011 at 8:10 PM, Michel Segel michael_se...@hotmail.comwrote: Actually I would recommend avoiding /etc/hosts and using DNS if this is going to be a production grade cluster... Sent from a remote device. Please excuse any typos... Mike Segel On Dec 17, 2011, at 5:40 AM, alo alt wget.n...@googlemail.com wrote: Hi, in the slave - file too. /etc/hosts is also recommend to avoid DNS issues. After adding in slaves the new node has to be started and should quickly appear in the web-ui. If you don't need the nodes all time you can setup a exclude and refresh your cluster ( http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F ) - Alex On Sat, Dec 17, 2011 at 12:06 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to add nodes dynamically to a running hadoop cluster.I started tasktracker and datanode in the node. It works fine. But when some node try fetch values ( for reduce phase) it fails with unknown host exception. When i add a node to running cluster do i have to add its hostname to all nodes (slaves +master) /etc/hosts file? Or some other way is there? -- Join me at http://hadoopworkshop.eventbrite.com/ -- Alexander Lorenz http://mapredit.blogspot.com P Think of the environment: please don't print this email unless you really need to. -- Join me at http://hadoopworkshop.eventbrite.com/
Dynamically adding nodes in Hadoop
Hi, I am trying to add nodes dynamically to a running hadoop cluster.I started tasktracker and datanode in the node. It works fine. But when some node try fetch values ( for reduce phase) it fails with unknown host exception. When i add a node to running cluster do i have to add its hostname to all nodes (slaves +master) /etc/hosts file? Or some other way is there? -- Join me at http://hadoopworkshop.eventbrite.com/
Re: Dynamically adding nodes in Hadoop
Madhu, On Sat, Dec 17, 2011 at 4:36 PM, madhu phatak phatak@gmail.com wrote: When i add a node to running cluster do i have to add its hostname to all nodes (slaves +master) /etc/hosts file? Yes. Or some other way is there? You can run a DNS, and have the resolution centrally managed. -- Harsh J
Re: Dynamically adding nodes in Hadoop
Hi, in the slave - file too. /etc/hosts is also recommend to avoid DNS issues. After adding in slaves the new node has to be started and should quickly appear in the web-ui. If you don't need the nodes all time you can setup a exclude and refresh your cluster (http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F) - Alex On Sat, Dec 17, 2011 at 12:06 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to add nodes dynamically to a running hadoop cluster.I started tasktracker and datanode in the node. It works fine. But when some node try fetch values ( for reduce phase) it fails with unknown host exception. When i add a node to running cluster do i have to add its hostname to all nodes (slaves +master) /etc/hosts file? Or some other way is there? -- Join me at http://hadoopworkshop.eventbrite.com/ -- Alexander Lorenz http://mapredit.blogspot.com P Think of the environment: please don't print this email unless you really need to.
Re: Dynamically adding nodes in Hadoop
Actually I would recommend avoiding /etc/hosts and using DNS if this is going to be a production grade cluster... Sent from a remote device. Please excuse any typos... Mike Segel On Dec 17, 2011, at 5:40 AM, alo alt wget.n...@googlemail.com wrote: Hi, in the slave - file too. /etc/hosts is also recommend to avoid DNS issues. After adding in slaves the new node has to be started and should quickly appear in the web-ui. If you don't need the nodes all time you can setup a exclude and refresh your cluster (http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F) - Alex On Sat, Dec 17, 2011 at 12:06 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to add nodes dynamically to a running hadoop cluster.I started tasktracker and datanode in the node. It works fine. But when some node try fetch values ( for reduce phase) it fails with unknown host exception. When i add a node to running cluster do i have to add its hostname to all nodes (slaves +master) /etc/hosts file? Or some other way is there? -- Join me at http://hadoopworkshop.eventbrite.com/ -- Alexander Lorenz http://mapredit.blogspot.com P Think of the environment: please don't print this email unless you really need to.
After adding nodes to 0.20.2 cluster, getting Could not complete file errors and hung JobTracker
Hi all, We are currently in the process of replacing the servers in our Hadoop 0.20.2 production cluster and in the last couple of days have experienced an error similar to the following (from the JobTracker log) several times, which then appears to hang the JobTracker: 2010-10-15 04:13:38,980 INFO org.apache.hadoop.mapred.JobInProgress: Job job_201010140844_0510 has completed successfully. 2010-10-15 04:13:44,192 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:44,592 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:44,993 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:45,393 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/hist ory/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... 2010-10-15 04:13:45,794 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /user/kaduindexer-18509/us/201010150300/dealdocid_pre_merged_1/_logs/history/phx-phadoop34_1287060250080_job_201010140844_0510_se_DocID_Merge_1_201010150300 retrying... We haven't seen an issue like this until we added 6 new nodes to our existing 65 node cluster. The only other configuration change made recently was to setup include/exclude files for DFS and MapReduce to enable Hadoop's node decommissioning functionality. Once we encounter this issue (which has happened twice in the last 24 hours), we end up needing to restart the MapReduce processes which we cannot do on a frequent basis. After the last occurrence, I increased the value of the mapred.job.tracker.handler.count to 60 and am waiting to see if it has an impact. Has anyone else seen this behavior before? Are there any recommendations for trying to prevent this from happening in the future? Thanks in advance, -Bobby