I've got to be doing something stupid cause I can't find any mention of
others having this problem. Here's what's happening. I had a cluster of
nine nodes running (1 namenode and 8 datanodes) the 0.15.3 release. I've
been running the mapred samples and reformatting filesystems, just
getting comfortable with the software. When I upgraded to 0.16.0 release
I reformatted (mke2fs) all of my data partitions (including the namenode
data). I ran a "hadoop namenode -format" which ran fine. Then I brought
them back up, the only slave to connect to the master was the master
itself acting as a datanode. The dfs daemon was started on the slave
nodes but it just doesn't seem to connect to the master.
I know that the slaves are doing *something* with the master because if
i start them before the namenode is running then I get lots of log
messages about attempting to reconnect. Below is my site config and logs
from the namenode and a "zombie" datanode.
**** hadoop-site.xml (same across all nodes) ****
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/mnt/sda1/hadoop-datastore-0.15.3/hadoop-${user.name}</value>
<description>...</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://head00:54310</value>
<description>...</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>head00:54311</value>
<description>...</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>...</description>
</property>
</configuration>
**** namenode log ****
2008-03-10 19:32:53,186 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = head00/192.168.16.48
STARTUP_MSG: args = []
************************************************************/
2008-03-10 19:32:54,260 INFO org.apache.hadoop.dfs.NameNode: Namenode up
at: head00/192.168.16.48:54310
2008-03-10 19:32:54,267 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-03-10 19:32:54,628 INFO org.apache.hadoop.dfs.StateChange: STATE*
Network topology has 0 racks and 0 datanodes
2008-03-10 19:32:54,629 INFO org.apache.hadoop.dfs.StateChange: STATE*
UnderReplicatedBlocks has 0 blocks
2008-03-10 19:32:55,421 INFO org.mortbay.util.Credential: Checking
Resource aliases
2008-03-10 19:32:56,048 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2008-03-10 19:32:56,051 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2008-03-10 19:32:56,052 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2008-03-10 19:32:57,493 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-03-10 19:32:57,826 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2008-03-10 19:32:58,112 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50070
2008-03-10 19:32:58,112 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-03-10 19:32:58,112 INFO org.apache.hadoop.fs.FSNamesystem:
Web-server up at: 50070
2008-03-10 19:32:58,116 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 54310: starting
2008-03-10 19:32:58,139 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 54310: starting
2008-03-10 19:32:58,140 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310: starting
...
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54310: starting
2008-03-10 19:32:58,626 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 54310: starting
2008-03-10 19:33:02,684 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from 192.168.16.48:50010
storage DS-437400207-192.168.16.48-50010-1205199182672 ***** this is
the namenode connecting to itself as a data node *****
2008-03-10 19:33:02,693 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/192.168.16.48:50010
2008-03-10 19:38:01,248 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log from 192.168.16.48
2008-03-10 19:38:01,249 INFO org.apache.hadoop.fs.FSNamesystem: Number
of transactions: 0 Total time for transactions(ms): 0 Number of syncs: 0
SyncTimes(ms): 0
2008-03-10 19:43:02,227 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log from 192.168.16.48
2008-03-10 19:48:02,374 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log from 192.168.16.48
***** datanode log *****
2008-03-10 20:30:31,392 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG: host = node05/192.168.16.55
STARTUP_MSG: args = []
************************************************************/
2008-03-10 20:30:31,786 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=DataNode, sessionId=null
2008-03-10 20:30:32,000 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 1 time(s).
2008-03-10 20:30:33,006 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 2 time(s).
...
2008-03-10 20:31:30,244 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 4 time(s).
2008-03-10 20:31:31,250 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: head00/192.168.16.48:54310. Already tried 5 time(s).
2008-03-10 20:31:36,130 INFO org.apache.hadoop.dfs.Storage: Storage
directory /mnt/sda1/hadoop-datastore-0.15.3/hadoop-hadoop/dfs/data is
not formatted.
2008-03-10 20:31:36,131 INFO org.apache.hadoop.dfs.Storage: Formatting ...
2008-03-10 20:31:36,196 INFO org.apache.hadoop.dfs.DataNode: Opened
server at 50010
2008-03-10 20:31:36,419 INFO org.mortbay.util.Credential: Checking
Resource aliases
2008-03-10 20:31:36,563 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2008-03-10 20:31:36,566 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2008-03-10 20:31:37,553 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-03-10 20:31:37,714 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2008-03-10 20:31:37,730 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50075
2008-03-10 20:31:37,730 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
I can leave the cluster running for hours and this slave will never
"register" itself with the namenode. I've been messing with this problem
for three days now and I'm out of ideas. Any suggestions?
Regards,
Tim Nelson