I've got to be doing something stupid cause I can't find any mention of others having this problem. Here's what's happening. I had a cluster of nine nodes running (1 namenode and 8 datanodes) the 0.15.3 release. I've been running the mapred samples and reformatting filesystems, just getting comfortable with the software. When I upgraded to 0.16.0 release I reformatted (mke2fs) all of my data partitions (including the namenode data). I ran a "hadoop namenode -format" which ran fine. Then I brought them back up, the only slave to connect to the master was the master itself acting as a datanode. The dfs daemon was started on the slave nodes but it just doesn't seem to connect to the master.

I know that the slaves are doing *something* with the master because if i start them before the namenode is running then I get lots of log messages about attempting to reconnect. Below is my site config and logs from the namenode and a "zombie" datanode.

**** hadoop-site.xml (same across all nodes) ****
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
 <name>hadoop.tmp.dir</name>
 <value>/mnt/sda1/hadoop-datastore-0.15.3/hadoop-${user.name}</value>
 <description>...</description>
</property>
<property>
 <name>fs.default.name</name>
 <value>hdfs://head00:54310</value>
 <description>...</description>
</property>
<property>
 <name>mapred.job.tracker</name>
 <value>head00:54311</value>
 <description>...</description>
</property>
<property>
 <name>dfs.replication</name>
 <value>2</value>
 <description>...</description>
</property>
</configuration>


**** namenode log ****
2008-03-10 19:32:53,186 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = head00/192.168.16.48
STARTUP_MSG:   args = []
************************************************************/
2008-03-10 19:32:54,260 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: head00/192.168.16.48:54310 2008-03-10 19:32:54,267 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2008-03-10 19:32:54,628 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 2008-03-10 19:32:54,629 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 2008-03-10 19:32:55,421 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-03-10 19:32:56,048 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-03-10 19:32:56,051 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-03-10 19:32:56,052 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-03-10 19:32:57,493 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-03-10 19:32:57,826 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-03-10 19:32:58,112 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50070 2008-03-10 19:32:58,112 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-03-10 19:32:58,112 INFO org.apache.hadoop.fs.FSNamesystem: Web-server up at: 50070 2008-03-10 19:32:58,116 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 54310: starting 2008-03-10 19:32:58,139 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310: starting 2008-03-10 19:32:58,140 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 54310: starting
...
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting 2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 54310: starting 2008-03-10 19:33:02,684 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.16.48:50010 storage DS-437400207-192.168.16.48-50010-1205199182672 ***** this is the namenode connecting to itself as a data node ***** 2008-03-10 19:33:02,693 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.16.48:50010 2008-03-10 19:38:01,248 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log from 192.168.16.48 2008-03-10 19:38:01,249 INFO org.apache.hadoop.fs.FSNamesystem: Number of transactions: 0 Total time for transactions(ms): 0 Number of syncs: 0 SyncTimes(ms): 0 2008-03-10 19:43:02,227 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log from 192.168.16.48 2008-03-10 19:48:02,374 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log from 192.168.16.48

***** datanode log *****
2008-03-10 20:30:31,392 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = node05/192.168.16.55
STARTUP_MSG:   args = []
************************************************************/
2008-03-10 20:30:31,786 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2008-03-10 20:30:32,000 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: head00/192.168.16.48:54310. Already tried 1 time(s). 2008-03-10 20:30:33,006 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: head00/192.168.16.48:54310. Already tried 2 time(s).
...
2008-03-10 20:31:30,244 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: head00/192.168.16.48:54310. Already tried 4 time(s). 2008-03-10 20:31:31,250 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: head00/192.168.16.48:54310. Already tried 5 time(s). 2008-03-10 20:31:36,130 INFO org.apache.hadoop.dfs.Storage: Storage directory /mnt/sda1/hadoop-datastore-0.15.3/hadoop-hadoop/dfs/data is not formatted.
2008-03-10 20:31:36,131 INFO org.apache.hadoop.dfs.Storage: Formatting ...
2008-03-10 20:31:36,196 INFO org.apache.hadoop.dfs.DataNode: Opened server at 50010 2008-03-10 20:31:36,419 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-03-10 20:31:36,563 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-03-10 20:31:37,553 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-03-10 20:31:37,714 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-03-10 20:31:37,730 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075 2008-03-10 20:31:37,730 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED]

I can leave the cluster running for hours and this slave will never "register" itself with the namenode. I've been messing with this problem for three days now and I'm out of ideas. Any suggestions?

Regards,
Tim Nelson

Reply via email to