[ 
https://issues.apache.org/jira/browse/HDFS-15495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167200#comment-17167200
 ] 

Stephen O'Donnell edited comment on HDFS-15495 at 7/29/20, 4:12 PM:
--------------------------------------------------------------------

There are a few things to think about here.

For non EC blocks:
 * If a block is missing it will not block replication, as it will not be on 
the DN in question and hence will not be checked.
 * If a block is under-replicated already, then decomission should proceed OK, 
provided the block can be made perfectly replicated.
 * Decommission will block if there are not enough nodes on a cluster to make 
the blocks perfectly replicated - eg decommission 1 node from a 3 node cluster.

For EC, a missing block is more complicated. Consider a 6-3 EC file.
 * If there is already 1 to 3 blocks lost, the file is still readable. If you 
decommission a host holding the block, I think it will first reconstruct the 
missing 1 - 3 blocks, and then schedule a simple copy of the decommission block.
 * If it >3 blocks are lost, then it will not be able to complete the first 
step, and then will never get to the second step and it will likely hang (I 
have not tested it out myself as yet).


 -Looking at the code, I think the NN does not check if there are sufficient EC 
block sources before it schedules the reconstruction work on a DN - it is left 
to the DN to figure that part out and fail the task.-

-It looks like we might need to do something a bit smarter in ErasureCodingWork 
to allow the block being decommissioned to be copied to a new DN even if EC 
reconstruction cannot happen. Something would also need to change in the 
Decommission logic to notice the file is corrupt and also handle the local 
block, and not wait for the file to be healthy.-



I looked into this a bit more, and it will be tricky to fix I think. When the 
EC file is corrupted, it goes into the LowRedundancyBlocks list, but in the 
QUEUE_WITH_CORRUPT_BLOCKS.

Then when the decommission monitor checks the block, it sees it as "needing 
replication", but it also sees it is already in neededReconstruction, therefore 
it does not add it to the list of blocks the BlockManager needs to replicate.

The decommission monitor relies on 
`BlockManager.computeBlockReconstructionWork` to take care of the 
under-replication. It never considers corrupt blocks, as it knows it cannot 
reconstruct them.

Therefore, we have an already corrupt EC file stuck in the 
needingReplication#CORRUPT queue, and the decommission monitor which needs the 
block to simply be copied from the DN it is currently on, but nothing will ever 
do that.


was (Author: sodonnell):
There are a few things to think about here.

For non EC blocks:

 * If a block is missing it will not block replication, as it will not be on 
the DN in question and hence will not be checked.
 * If a block is under-replicated already, then decomission should proceed OK, 
provided the block can be made perfectly replicated.
 * Decommission will block if there are not enough nodes on a cluster to make 
the blocks perfectly replicated - eg decommission 1 node from a 3 node cluster.

For EC, a missing block is more complicated. Consider a 6-3 EC file.

 * If there is already 1 to 3 blocks lost, the file is still readable. If you 
decommission a host holding the block, I think it will first reconstruct the 
missing 1 - 3 blocks, and then schedule a simple copy of the decommission block.
 * If it >3 blocks are lost, then it will not be able to complete the first 
step, and then will never get to the second step and it will likely hang (I 
have not tested it out myself as yet).
-
Looking at the code, I think the NN does not check if there are sufficient EC 
block sources before it schedules the reconstruction work on a DN - it is left 
to the DN to figure that part out and fail the task.

It looks like we might need to do something a bit smarter in ErasureCodingWork 
to allow the block being decommissioned to be copied to a new DN even if EC 
reconstruction cannot happen. Something would also need to change in the 
Decommission logic to notice the file is corrupt and also handle the local 
block, and not wait for the file to be healthy.-

I looked into this a bit more, and it will be tricky to fix I think. When the 
EC file is corrupted, it goes into the LowRedundancyBlocks list, but in the 
QUEUE_WITH_CORRUPT_BLOCKS. 

Then when the decommission monitor checks the block, it sees it as "needing 
replication", but it also sees it is already in neededReconstruction, therefore 
it does not add it to the list of blocks the BlockManager needs to replicate.

The decommission monitor relies on 
`BlockManager.computeBlockReconstructionWork` to take care of the 
under-replication. It never considers corrupt blocks, as it knows it cannot 
reconstruct them.

Therefore, we have an already corrupt EC file stuck in the 
needingReplication#CORRUPT queue, and the decommission monitor which needs the 
block to simply be copied from the DN it is currently on, but nothing will ever 
do that.

> Decommissioning a DataNode with corrupted EC files should not be blocked 
> indefinitely
> -------------------------------------------------------------------------------------
>
>                 Key: HDFS-15495
>                 URL: https://issues.apache.org/jira/browse/HDFS-15495
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: block placement, ec
>    Affects Versions: 3.0.0
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Major
>
> Originally discovered in patched CDH 6.2.1 (with a bunch of EC fixes: 
> HDFS-14699, HDFS-14849, HDFS-14847, HDFS-14920, HDFS-14768, HDFS-14946, 
> HDFS-15186).
> When there's an EC file marked as corrupted on NN, if the admin tries to 
> decommission a DataNode having one of the remaining blocks of the corrupted 
> EC file, *the decom will never finish* unless the file is recovered by 
> putting the missing blocks back in:
> {code:title=The endless DatanodeAdminManager check loop, every 30s}
> 2020-07-23 16:36:12,805 TRACE blockmanagement.DatanodeAdminManager: Processed 
> 0 blocks so far this tick
> 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: 
> Processing Decommission In Progress node 127.0.1.7:5007
> 2020-07-23 16:36:12,806 TRACE blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372036854775728_1013 numExpected=9, numLive=4
> 2020-07-23 16:36:12,806 INFO BlockStateChange: Block: 
> blk_-9223372036854775728_1013, Expected Replicas: 9, live replicas: 4, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 127.0.1.12:5012 127.0.1.10:5010 127.0.1.8:5008 127.0.1.11:5011 127.0.1.7:5007 
> , Current Datanode: 127.0.1.7:5007, Is current datanode decommissioning: 
> true, Is current datanode entering maintenance: false
> 2020-07-23 16:36:12,806 DEBUG blockmanagement.DatanodeAdminManager: Node 
> 127.0.1.7:5007 still has 1 blocks to replicate before it is a candidate to 
> finish Decommission In Progress.
> 2020-07-23 16:36:12,806 INFO blockmanagement.DatanodeAdminManager: Checked 1 
> blocks and 1 nodes this tick
> {code}
> "Corrupted" file here meaning the EC file doesn't have enough EC blocks in 
> the block group to be reconstructed. e.g. for {{RS-6-3-1024k}}, when there 
> are less than 6 blocks for an EC file, the file can no longer be retrieved 
> correctly.
> Will check on trunk as well soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to