Thank you Konstantin,
The fact that we indeed added a new empty disk to the machine and set
the configuration appropriately indicates that we may be hitting the bug
you said. I have checked the storageid of the datanode, but I could not
obtain the previous storageid. Do you remember the Jira issue number?
However I still do not know how the namenode ordered to deleted that
blocks, in other nodes. Is it the case that the datanode identified all
the blocks as invalid, so that the namenode deleted all the replicas of
the blocks?
Konstantin Shvachko wrote:
Hi Enis,
Could you please verify what was the storage id for the node ..15.233
before and after you restarted it. Storage id is the string that
starts from DS....
followed by digits.
There was an old bug, that if you configure a data-node with multiple
directories, then
it would take the storage id value from the last directory in the list.
So if you add a directory to the data-node configuration by appending
the path to the list,
then the storage will be considered empty, a new storage id will be
assigned, and all blocks
will be removed then. I fixed it at some point. New behavior is that
data-node checks that
all dirs have consistent values, and those that are new are
automatically formatted with the
existing storage id.
So if you see that storage id for data-node ..15.233 changed after the
restart, then
it's an old problem.
Thanks,
--Konstantin
Enis Soztutar wrote:
Hi,
After a serious power failure on our cluster running 0.13.0, we have
been able to restore our previous state. But we have realized that a
nontrivial amount of blocks are missing. It seems that namenode has
requested all the blocks which are kept on one specific machine to be
deleted, which resulted in deletion of all the replicas. To clarify,
for some reason all the blocks on the machine as well as all the
other replicas of the blocks are deleted by the namenode. Does anyone
know what might have happened ? Is this a bug that we should
seriously consider fixing, or it may have been already fixed?
datanode which caused data loss was : 192.168.15.233, and it is first
started as a slave, then removed to add a new hard disk and added
back to the cluster
Below are the relevant logs :
Namenode :
2007-11-11 19:15:11,564 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/192.168.15.233:50010
2007-11-11 19:15:12,094 INFO org.apache.hadoop.dfs.StateChange:
BLOCK* NameSystem.registerDatanode: node registration fro2007-11-11
19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE*
SafeModeInfo.leave: Safe mode is OFF.
...
2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange:
STATE* Network topology has 1 racks and 36 datanodes
2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange:
STATE* UnderReplicatedBlocks has 56 blocksm 192.168.15.231:50010
storage DS1698199061
...
2007-11-11 19:30:05,782 INFO org.apache.hadoop.fs.FSNamesystem: Roll
Edit Log
2007-11-11 19:30:40,469 INFO org.apache.hadoop.fs.FSNamesystem: Roll
FSImage
2007-11-11 19:31:29,913 INFO org.apache.hadoop.dfs.StateChange:
BLOCK* NameSystem.registerDatanode: node registration from
192.168.15.236:50010 storage DS1183829041
...
2007-11-11 19:45:03,483 INFO org.apache.hadoop.dfs.StateChange:
BLOCK* NameSystem.heartbeatCheck: lost heartbeat from
192.168.15.233:50010
2007-11-11 19:45:03,734 INFO org.apache.hadoop.net.NetworkTopology:
Removing a node: /default-rack/192.168.15.233:50010
...
2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology:
Removing a node: /default-rack/192.168.15.233:50010
2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology:
Adding a new node: /default-rack/192.168.15.233:50010
...
and example logs for one of the missing blocks : blk_8859727972037265136
on 192.168.15.203
2007-11-11 19:53:53,755 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_8859727972037265136 file
/data/hadoop/dfs/data/current/subdir63/subdir63/subdir63/subdir63/subdir49/blk_8859727972037265136
on 192.168.15.225
2007-11-11 20:18:07,964 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_8859727972037265136 file
/data2/hadoop/dfs/data/current/subdir11/subdir63/blk_8859727972037265136
on 192.168.15.233
2007-11-11 19:54:56,078 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_8859727972037265136 file
/data/hadoop/dfs/data/current/subdir36/subdir47/blk_8859727972037265136
and the complete log for 192.168.15.233 is :
... 2007-11-11 20:03:37,789 INFO org.apache.hadoop.dfs.DataNode:
Deleting block blk_3987170016844853189 file
/data/hadoop/dfs/data/current/subdir38/blk_3987170016844853189
2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_4414509271638104493 file
/data/hadoop/dfs/data/current/subdir56/subdir40/blk_4414509271638104493
2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_4651660909902273726 file
/data/hadoop/dfs/data/current/subdir32/subdir3/blk_4651660909902273726
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5189049009734931732 file
/data/hadoop/dfs/data/current/subdir56/subdir42/blk_5189049009734931732
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5395031642694782019 file
/data/hadoop/dfs/data/current/subdir41/subdir31/blk_5395031642694782019
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5567722351418795177 file
/data/hadoop/dfs/data/current/subdir56/subdir42/blk_5567722351418795177
2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_5592463115430469494 file
/data/hadoop/dfs/data/current/subdir10/subdir48/blk_5592463115430469494
... (for all blocks in the datanode)
2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-9219752334498294080.
Block not found in blockMap.
2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-9217018193785551154.
Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-9211664991594450527.
Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-9211471391608631351.
Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-9208445774532268187.
Block not found in blockMap.
2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode:
Unexpected error trying to delete block blk_-9202539319669633125.
Block not found in blockMap.
...
Thanks in advance. Enis Soztutar