I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from Hortonworks).
Yesterday a lot of things happened nad in some point of time we decided to
one by one reboot all datanodes. Unfortunate the operator did monitor the
namenode health monitor.

The result of above operation is that all datanodes shows as dead nodes,
all blocked are lost, ... .

In one datanode which we decided to reboot it once again to see if datanode
will log anything interesting. The log finished with informations:

INFO  ipc.Server (Server.java:run(861)) - IPC Server Responder: starting
INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting

and hangs here. In the same time on namnode I can see only two types of
messages:

INFO  hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR*
completeFile: [SOME PATH] is closed by
DFSClient_NONMAPREDUCE_288661168_33

and a lot of:

WARN  blockmanagement.BlockManager
(PendingReplicationBlocks.java:pendingReplicationCheck(249)) -
PendingReplicationMonitor timed out blk_1074405820_668233

Today we decided to restart name node and all data nodes. After restart
website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't see
any errors in log except 5 like bellow:

 ERROR datanode.DataNode (DataXceiver.java:run(225)) -
maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation
src: /node1:33470 dest: /node3:50010

org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020 already
exists in state FINALIZED and thus cannot be created.

3 out of 5 nodes shows as lived, but refresh of hadoop status page takes
more than 10 minutes.

The question of course is: what should I check or do now?


p.s. I asked same question on StackOverflow:
http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode

Reply via email to