[ https://issues.apache.org/jira/browse/HDFS-11019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589190#comment-15589190 ]
Kuhu Shukla edited comment on HDFS-11019 at 10/19/16 4:36 PM: -------------------------------------------------------------- [~jojochuang] Thank you for reporting this. In Hadoop 2.6(CDH 5.7.2) The attached test shows the same behavior as mentioned above: {code} INFO BlockStateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(76)) - BLOCK NameSystem.addToCorruptReplicasMap: blk_12345 added as corrupt on 127.0.0.1:12345 by null because TEST 1. corruptReplicaMap=[127.0.0.1:12345] 2. corruptReplicaMap=null INFO BlockStateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(76)) - BLOCK NameSystem.addToCorruptReplicasMap: blk_12345 added as corrupt on 127.0.0.1:12345 by null because TEST 3. corruptReplicaMap=[127.0.0.1:12345] //should be null 4. corruptReplicaMap=[127.0.0.1:12345] //should be null {code} This behavior is fixed thru HDFS-9958 and if you run the same test it has the following output . {code} 1. corruptReplicaMap=[127.0.0.1:63829] 2. corruptReplicaMap=null 3. corruptReplicaMap=null 4. corruptReplicaMap=null {code} The code change is in BlockManager#findAndMarkBlockAsCorrupt in 2.7.3 and up releases. {code} if (storage == null) { storage = storedBlock.findStorageInfo(node); } if (storage == null) { blockLog.debug("BLOCK* findAndMarkBlockAsCorrupt: {} not found on {}", blk, dn); return; } {code} Hope this helps. was (Author: kshukla): [~jojochuang] Thank you reporting this. In Hadoop 2.6(CDH 5.7.2) The attached test shows the same behavior as mentioned above: {code} INFO BlockStateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(76)) - BLOCK NameSystem.addToCorruptReplicasMap: blk_12345 added as corrupt on 127.0.0.1:12345 by null because TEST 1. corruptReplicaMap=[127.0.0.1:12345] 2. corruptReplicaMap=null INFO BlockStateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(76)) - BLOCK NameSystem.addToCorruptReplicasMap: blk_12345 added as corrupt on 127.0.0.1:12345 by null because TEST 3. corruptReplicaMap=[127.0.0.1:12345] //should be null 4. corruptReplicaMap=[127.0.0.1:12345] //should be null {code} This behavior is fixed thru HDFS-9958 and if you run the same test it has the following output . {code} 1. corruptReplicaMap=[127.0.0.1:63829] 2. corruptReplicaMap=null 3. corruptReplicaMap=null 4. corruptReplicaMap=null {code} The code change is in BlockManager#findAndMarkBlockAsCorrupt in 2.7.3 and up releases. {code} if (storage == null) { storage = storedBlock.findStorageInfo(node); } if (storage == null) { blockLog.debug("BLOCK* findAndMarkBlockAsCorrupt: {} not found on {}", blk, dn); return; } {code} Hope this helps. > Inconsistent number of corrupt replicas if a corrupt replica is reported > multiple times > --------------------------------------------------------------------------------------- > > Key: HDFS-11019 > URL: https://issues.apache.org/jira/browse/HDFS-11019 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Environment: CDH5.7.2 > Reporter: Wei-Chiu Chuang > Attachments: HDFS-11019.test.patch > > > While investigating a block corruption issue, I found the following warning > message in the namenode log: > {noformat} > (a client reports a block replica is corrupt) > 2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1073803461 added as corrupt on > 10.0.0.63:50010 by /10.0.0.62 because client machine reported it > 2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK* invalidateBlock: > blk_1073803461_74513(stored=blk_1073803461_74553) on 10.0.0.63:50010 > 2016-10-12 10:07:37,166 INFO BlockStateChange: BLOCK* InvalidateBlocks: add > blk_1073803461_74513 to 10.0.0.63:50010 > (another client reports a block replica is corrupt) > 2016-10-12 10:07:37,728 INFO BlockStateChange: BLOCK > NameSystem.addToCorruptReplicasMap: blk_1073803461 added as corrupt on > 10.0.0.63:50010 by /10.0.0.64 because client machine reported it > 2016-10-12 10:07:37,728 INFO BlockStateChange: BLOCK* invalidateBlock: > blk_1073803461_74513(stored=blk_1073803461_74553) on 10.0.0.63:50010 > (ReplicationMonitor thread kicks in to invalidate the replica and add a new > one) > 2016-10-12 10:07:37,888 INFO BlockStateChange: BLOCK* ask 10.0.0.56:50010 to > replicate blk_1073803461_74553 to datanode(s) 10.0.0.63:50010 > 2016-10-12 10:07:37,888 INFO BlockStateChange: BLOCK* BlockManager: ask > 10.0.0.63:50010 to delete [blk_1073803461_74513] > (the two maps are inconsistent) > 2016-10-12 10:08:00,335 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent > number of corrupt replicas for blk_1073803461_74553 blockMap has 0 but > corrupt replicas map has 1 > {noformat} > It seems that when a corrupt block replica is reported twice, blockMap > corrupt and corrupt replica map becomes inconsistent. > Looking at the log, I suspect the bug is in > {{BlockManager#removeStoredBlock}}. When a corrupt replica is reported, > BlockManager removes the block from blocksMap. If the block is already > removed (that is, the corrupt replica is reported twice), return; Otherwise > (that is, the corrupt replica is reported the first time), remove the block > from corruptReplicasMap (The block is added into corruptReplicasMap in > BlockerManager#markBlockAsCorrupt) Therefore, after the second corruption > report, the corrupt replica is removed from blocksMap, but the one in > corruptReplicasMap is not removed. > I can’t tell what’s the impact that they are inconsistent. But I feel it's a > good idea to fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org