Hi Enis,
Could you please verify what was the storage id for the node ..15.233
before and after you restarted it. Storage id is the string that starts from
DS....
followed by digits.
There was an old bug, that if you configure a data-node with multiple
directories, then
it would take the storage id value from the last directory in the list.
So if you add a directory to the data-node configuration by appending the path
to the list,
then the storage will be considered empty, a new storage id will be assigned,
and all blocks
will be removed then. I fixed it at some point. New behavior is that data-node
checks that
all dirs have consistent values, and those that are new are automatically
formatted with the
existing storage id.
So if you see that storage id for data-node ..15.233 changed after the restart,
then
it's an old problem.
Thanks,
--Konstantin
Enis Soztutar wrote:
Hi,
After a serious power failure on our cluster running 0.13.0, we have
been able to restore our previous state. But we have realized that a
nontrivial amount of blocks are missing. It seems that namenode has
requested all the blocks which are kept on one specific machine to be
deleted, which resulted in deletion of all the replicas. To clarify, for
some reason all the blocks on the machine as well as all the other
replicas of the blocks are deleted by the namenode. Does anyone know
what might have happened ? Is this a bug that we should seriously
consider fixing, or it may have been already fixed?
datanode which caused data loss was : 192.168.15.233, and it is first
started as a slave, then removed to add a new hard disk and added back
to the cluster
Below are the relevant logs :
Namenode :
2007-11-11 19:15:11,564 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/192.168.15.233:50010
2007-11-11 19:15:12,094 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration fro2007-11-11
19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE*
SafeModeInfo.leave: Safe mode is OFF.
...
2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE*
Network topology has 1 racks and 36 datanodes
2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE*
UnderReplicatedBlocks has 56 blocksm 192.168.15.231:50010 storage
DS1698199061
...
2007-11-11 19:30:05,782 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log
2007-11-11 19:30:40,469 INFO org.apache.hadoop.fs.FSNamesystem: Roll
FSImage
2007-11-11 19:31:29,913 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from 192.168.15.236:50010
storage DS1183829041
...
2007-11-11 19:45:03,483 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.heartbeatCheck: lost heartbeat from 192.168.15.233:50010
2007-11-11 19:45:03,734 INFO org.apache.hadoop.net.NetworkTopology:
Removing a node: /default-rack/192.168.15.233:50010
...
2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology:
Removing a node: /default-rack/192.168.15.233:50010
2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/192.168.15.233:50010
...
and example logs for one of the missing blocks : blk_8859727972037265136
on 192.168.15.203
2007-11-11 19:53:53,755 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_8859727972037265136 file
/data/hadoop/dfs/data/current/subdir63/subdir63/subdir63/subdir63/subdir49/blk_8859727972037265136
on 192.168.15.225
2007-11-11 20:18:07,964 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_8859727972037265136 file
/data2/hadoop/dfs/data/current/subdir11/subdir63/blk_8859727972037265136
on 192.168.15.233
2007-11-11 19:54:56,078 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_8859727972037265136 file
/data/hadoop/dfs/data/current/subdir36/subdir47/blk_8859727972037265136
and the complete log for 192.168.15.233 is :
... 2007-11-11 20:03:37,789 INFO org.apache.hadoop.dfs.DataNode:
Deleting block blk_3987170016844853189 file
/data/hadoop/dfs/data/current/subdir38/blk_3987170016844853189
2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_4414509271638104493 file
/data/hadoop/dfs/data/current/subdir56/subdir40/blk_4414509271638104493
2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_4651660909902273726 file
/data/hadoop/dfs/data/current/subdir32/subdir3/blk_4651660909902273726
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5189049009734931732 file
/data/hadoop/dfs/data/current/subdir56/subdir42/blk_5189049009734931732
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5395031642694782019 file
/data/hadoop/dfs/data/current/subdir41/subdir31/blk_5395031642694782019
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5567722351418795177 file
/data/hadoop/dfs/data/current/subdir56/subdir42/blk_5567722351418795177
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5592463115430469494 file
/data/hadoop/dfs/data/current/subdir10/subdir48/blk_5592463115430469494
... (for all blocks in the datanode)
2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected
error trying to delete block blk_-9219752334498294080. Block not found
in blockMap.
2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected
error trying to delete block blk_-9217018193785551154. Block not found
in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected
error trying to delete block blk_-9211664991594450527. Block not found
in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected
error trying to delete block blk_-9211471391608631351. Block not found
in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected
error trying to delete block blk_-9208445774532268187. Block not found
in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected
error trying to delete block blk_-9202539319669633125. Block not found
in blockMap.
...
Thanks in advance. Enis Soztutar