[ http://issues.apache.org/jira/browse/HADOOP-124?page=comments#action_12412273 ]
Konstantin Shvachko commented on HADOOP-124: -------------------------------------------- For future development in this direction. We should persistently store on the name node all storage IDs, which the name node ever assigned any blocks to. With that knowledge the name node can reject blocks from any newly registered data storages that are not on the name node list. In other words when a data node registers NEW data storage it should not report any blocks from that storage, and the name node can effectively verify that since it never assigned any blocks to this storage. This would prevent us from accidentally connecting data nodes representing different clusters (DFS instances). > don't permit two datanodes to run from same dfs.data.dir > -------------------------------------------------------- > > Key: HADOOP-124 > URL: http://issues.apache.org/jira/browse/HADOOP-124 > Project: Hadoop > Type: Bug > Components: dfs > Versions: 0.2 > Environment: ~30 node cluster > Reporter: Bryan Pendleton > Assignee: Konstantin Shvachko > Priority: Critical > Fix For: 0.3 > Attachments: DatanodeRegister.txt, DirNotSharing.patch > > DFS files are still rotting. > I suspect that there's a problem with block accounting/detecting identical > hosts in the namenode. I have 30 physical nodes, with various numbers of > local disks, meaning that my current 'bin/hadoop dfs -report" shows 80 nodes > after a full restart. However, when I discovered the problem (which resulted > in losing about 500gb worth of temporary data because of missing blocks in > some of the larger chunks) -report showed 96 nodes. I suspect somehow there > were extra datanodes running against the same paths, and that the namenode > was counting those as replicated instances, which then showed up > over-replicated, and one of them was told to delete its local block, leading > to the block actually getting lost. > I will debug it more the next time the situation arises. This is at least the > 5th time I've had a large amount of file data "rot" in DFS since January. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
