[ https://issues.apache.org/jira/browse/HDFS-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167200#comment-17167200 ]
Stephen O'Donnell edited comment on HDFS-15495 at 7/29/20, 4:12 PM: -------------------------------------------------------------------- There are a few things to think about here. For non EC blocks: * If a block is missing it will not block replication, as it will not be on the DN in question and hence will not be checked. * If a block is under-replicated already, then decomission should proceed OK, provided the block can be made perfectly replicated. * Decommission will block if there are not enough nodes on a cluster to make the blocks perfectly replicated - eg decommission 1 node from a 3 node cluster. For EC, a missing block is more complicated. Consider a 6-3 EC file. * If there is already 1 to 3 blocks lost, the file is still readable. If you decommission a host holding the block, I think it will first reconstruct the missing 1 - 3 blocks, and then schedule a simple copy of the decommission block. * If it >3 blocks are lost, then it will not be able to complete the first step, and then will never get to the second step and it will likely hang (I have not tested it out myself as yet). -Looking at the code, I think the NN does not check if there are sufficient EC block sources before it schedules the reconstruction work on a DN - it is left to the DN to figure that part out and fail the task.- -It looks like we might need to do something a bit smarter in ErasureCodingWork to allow the block being decommissioned to be copied to a new DN even if EC reconstruction cannot happen. Something would also need to change in the Decommission logic to notice the file is corrupt and also handle the local block, and not wait for the file to be healthy.- I looked into this a bit more, and it will be tricky to fix I think. When the EC file is corrupted, it goes into the LowRedundancyBlocks list, but in the QUEUE_WITH_CORRUPT_BLOCKS. Then when the decommission monitor checks the block, it sees it as "needing replication", but it also sees it is already in neededReconstruction, therefore it does not add it to the list of blocks the BlockManager needs to replicate. The decommission monitor relies on `BlockManager.computeBlockReconstructionWork` to take care of the under-replication. It never considers corrupt blocks, as it knows it cannot reconstruct them. Therefore, we have an already corrupt EC file stuck in the needingReplication#CORRUPT queue, and the decommission monitor which needs the block to simply be copied from the DN it is currently on, but nothing will ever do that. was (Author: sodonnell): There are a few things to think about here. For non EC blocks: * If a block is missing it will not block replication, as it will not be on the DN in question and hence will not be checked. * If a block is under-replicated already, then decomission should proceed OK, provided the block can be made perfectly replicated. * Decommission will block if there are not enough nodes on a cluster to make the blocks perfectly replicated - eg decommission 1 node from a 3 node cluster. For EC, a missing block is more complicated. Consider a 6-3 EC file. * If there is already 1 to 3 blocks lost, the file is still readable. If you decommission a host holding the block, I think it will first reconstruct the missing 1 - 3 blocks, and then schedule a simple copy of the decommission block. * If it >3 blocks are lost, then it will not be able to complete the first step, and then will never get to the second step and it will likely hang (I have not tested it out myself as yet). - Looking at the code, I think the NN does not check if there are sufficient EC block sources before it schedules the reconstruction work on a DN - it is left to the DN to figure that part out and fail the task. It looks like we might need to do something a bit smarter in ErasureCodingWork to allow the block being decommissioned to be copied to a new DN even if EC reconstruction cannot happen. Something would also need to change in the Decommission logic to notice the file is corrupt and also handle the local block, and not wait for the file to be healthy.- I looked into this a bit more, and it will be tricky to fix I think. When the EC file is corrupted, it goes into the LowRedundancyBlocks list, but in the QUEUE_WITH_CORRUPT_BLOCKS. Then when the decommission monitor checks the block, it sees it as "needing replication", but it also sees it is already in neededReconstruction, therefore it does not add it to the list of blocks the BlockManager needs to replicate. The decommission monitor relies on `BlockManager.computeBlockReconstructionWork` to take care of the under-replication. It never considers corrupt blocks, as it knows it cannot reconstruct them. Therefore, we have an already corrupt EC file stuck in the needingReplication#CORRUPT queue, and the decommission monitor which needs the block to simply be copied from the DN it is currently on, but nothing will ever do that. > Decommissioning a DataNode with corrupted EC files should not be blocked > indefinitely > ------------------------------------------------------------------------------------- > > Key: HDFS-15495 > URL: https://issues.apache.org/jira/browse/HDFS-15495 > Project: Hadoop HDFS > Issue Type: Improvement > Components: block placement, ec > Affects Versions: 3.0.0 > Reporter: Siyao Meng > Assignee: Siyao Meng > Priority: Major > > Originally discovered in patched CDH 6.2.1 (with a bunch of EC fixes: > HDFS-14699, HDFS-14849, HDFS-14847, HDFS-14920, HDFS-14768, HDFS-14946, > HDFS-15186). > When there's an EC file marked as corrupted on NN, if the admin tries to > decommission a DataNode having one of the remaining blocks of the corrupted > EC file, *the decom will never finish* unless the file is recovered by > putting the missing blocks back in: > {code:title=The endless DatanodeAdminManager check loop, every 30s} > 2020-07-23 16:36:12,805 TRACE blockmanagement.DatanodeAdminManager: Processed > 0 blocks so far this tick > 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: > Processing Decommission In Progress node 127.0.1.7:5007 > 2020-07-23 16:36:12,806 TRACE blockmanagement.DatanodeAdminManager: Block > blk_-9223372036854775728_1013 numExpected=9, numLive=4 > 2020-07-23 16:36:12,806 INFO BlockStateChange: Block: > blk_-9223372036854775728_1013, Expected Replicas: 9, live replicas: 4, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 127.0.1.12:5012 127.0.1.10:5010 127.0.1.8:5008 127.0.1.11:5011 127.0.1.7:5007 > , Current Datanode: 127.0.1.7:5007, Is current datanode decommissioning: > true, Is current datanode entering maintenance: false > 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Node > 127.0.1.7:5007 still has 1 blocks to replicate before it is a candidate to > finish Decommission In Progress. > 2020-07-23 16:36:12,806 INFO blockmanagement.DatanodeAdminManager: Checked 1 > blocks and 1 nodes this tick > {code} > "Corrupted" file here meaning the EC file doesn't have enough EC blocks in > the block group to be reconstructed. e.g. for {{RS-6-3-1024k}}, when there > are less than 6 blocks for an EC file, the file can no longer be retrieved > correctly. > Will check on trunk as well soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org