Thank you Konstantin,

The fact that we indeed added a new empty disk to the machine and set the configuration appropriately indicates that we may be hitting the bug you said. I have checked the storageid of the datanode, but I could not obtain the previous storageid. Do you remember the Jira issue number?

However I still do not know how the namenode ordered to deleted that blocks, in other nodes. Is it the case that the datanode identified all the blocks as invalid, so that the namenode deleted all the replicas of the blocks?



Konstantin Shvachko wrote:
Hi Enis,

Could you please verify what was the storage id for the node ..15.233
before and after you restarted it. Storage id is the string that starts from DS....
followed by digits.
There was an old bug, that if you configure a data-node with multiple directories, then
it would take the storage id value from the last directory in the list.
So if you add a directory to the data-node configuration by appending the path to the list, then the storage will be considered empty, a new storage id will be assigned, and all blocks will be removed then. I fixed it at some point. New behavior is that data-node checks that all dirs have consistent values, and those that are new are automatically formatted with the
existing storage id.
So if you see that storage id for data-node ..15.233 changed after the restart, then
it's an old problem.

Thanks,

--Konstantin


Enis Soztutar wrote:
Hi,

After a serious power failure on our cluster running 0.13.0, we have been able to restore our previous state. But we have realized that a nontrivial amount of blocks are missing. It seems that namenode has requested all the blocks which are kept on one specific machine to be deleted, which resulted in deletion of all the replicas. To clarify, for some reason all the blocks on the machine as well as all the other replicas of the blocks are deleted by the namenode. Does anyone know what might have happened ? Is this a bug that we should seriously consider fixing, or it may have been already fixed?

datanode which caused data loss was : 192.168.15.233, and it is first started as a slave, then removed to add a new hard disk and added back to the cluster

Below are the relevant logs :

Namenode :

2007-11-11 19:15:11,564 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.15.233:50010 2007-11-11 19:15:12,094 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration fro2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* SafeModeInfo.leave: Safe mode is OFF.
...
2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 1 racks and 36 datanodes 2007-11-11 19:26:49,654 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks has 56 blocksm 192.168.15.231:50010 storage DS1698199061
...
2007-11-11 19:30:05,782 INFO org.apache.hadoop.fs.FSNamesystem: Roll Edit Log 2007-11-11 19:30:40,469 INFO org.apache.hadoop.fs.FSNamesystem: Roll FSImage 2007-11-11 19:31:29,913 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.15.236:50010 storage DS1183829041
...
2007-11-11 19:45:03,483 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.15.233:50010 2007-11-11 19:45:03,734 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.15.233:50010
...
2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.15.233:50010 2007-11-11 19:45:46,123 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.15.233:50010
...


and example logs for one of the missing blocks : blk_8859727972037265136

on 192.168.15.203
2007-11-11 19:53:53,755 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136 file /data/hadoop/dfs/data/current/subdir63/subdir63/subdir63/subdir63/subdir49/blk_8859727972037265136

on 192.168.15.225
2007-11-11 20:18:07,964 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136 file /data2/hadoop/dfs/data/current/subdir11/subdir63/blk_8859727972037265136

on 192.168.15.233
2007-11-11 19:54:56,078 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_8859727972037265136 file /data/hadoop/dfs/data/current/subdir36/subdir47/blk_8859727972037265136

and the complete log for 192.168.15.233 is :

... 2007-11-11 20:03:37,789 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_3987170016844853189 file /data/hadoop/dfs/data/current/subdir38/blk_3987170016844853189 2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_4414509271638104493 file /data/hadoop/dfs/data/current/subdir56/subdir40/blk_4414509271638104493 2007-11-11 20:03:37,807 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_4651660909902273726 file /data/hadoop/dfs/data/current/subdir32/subdir3/blk_4651660909902273726 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5189049009734931732 file /data/hadoop/dfs/data/current/subdir56/subdir42/blk_5189049009734931732 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5395031642694782019 file /data/hadoop/dfs/data/current/subdir41/subdir31/blk_5395031642694782019 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5567722351418795177 file /data/hadoop/dfs/data/current/subdir56/subdir42/blk_5567722351418795177 2007-11-11 20:03:37,808 INFO org.apache.hadoop.dfs.DataNode: Deleting block blk_5592463115430469494 file /data/hadoop/dfs/data/current/subdir10/subdir48/blk_5592463115430469494
... (for all blocks in the datanode)

2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9219752334498294080. Block not found in blockMap. 2007-11-11 20:03:42,941 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9217018193785551154. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9211664991594450527. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9211471391608631351. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9208445774532268187. Block not found in blockMap. 2007-11-11 20:03:42,942 WARN org.apache.hadoop.dfs.DataNode: Unexpected error trying to delete block blk_-9202539319669633125. Block not found in blockMap.
...



Thanks in advance. Enis Soztutar












Reply via email to