[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding

Jing Zhao (Jira) Mon, 19 Dec 2022 14:21:25 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jing Zhao updated HDFS-16874:
-----------------------------
    Description: 
There are a couple of issues with the current DataNode decommission 
implementation when large amounts of Erasure Coding data are involved in the 
data re-replication/reconstruction process:
 # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode 
decommission if the internal EC block is still available. While this strategy 
reduces the CPU cost caused by EC reconstruction, it greatly limits the overall 
data recovery bandwidth, since there is only one single DataNode as the source. 
While high density HDD hosts are more and more widely used by HDFS especially 
along with Erasure Coding for warm data use case, this becomes a big pain for 
cluster management. In our production, to decommission a DataNode with several 
hundred TB EC data stored might take several days. HDFS-16613 provides 
optimization based on the existing mechanism, but more fundamentally we may 
want to allow EC reconstruction for DataNode decommission so as to achieve much 
larger recovery bandwidth.
 # The semantic of the existing EC reconstruction command (the 
BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The 
existing reconstruction command depends on the holes in the 
srcNodes/liveBlockIndices arrays to indicate the target internal blocks for 
recovery, while the holes can also be caused by the fact that the corresponding 
datanode is too busy so it cannot be used as the reconstruction source. This 
causes the later DataNode side reconstruction may not be consistent with the 
original intention. E.g., if the index of the missing block is 6, and the 
datanode storing block 0 is busy, the src nodes in the reconstruction command 
only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct 
the internal block 0 instead of 6. HDFS-16566 is working on this issue by 
indicating an excluding index list. More fundamentally we can follow the same 
path but go steps further by adding an optional field explicitly indicating the 
target block indices in the command protobuf msg. With the extension the 
DataNode will no longer use the holes in the src node array to "guess" the 
reconstruction targets.

Internally we have developed and applied fixes by following the above 
directions. We have seen significant improvement (100+ times speed up) in terms 
of datanode decommission speed for EC data. The more clear semantic of the 
reconstruction command protobuf msg also help prevent potential data corruption 
during the EC reconstruction.

We will use this ticket to track the similar fixes for the Apache releases.

  was:
There are a couple of issues with the current DataNode decommission 
implementation when large amounts of Erasure Coding data are involved in the 
data re-replication/reconstruction process:
 # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode 
decommission if the internal EC block is still available. While this strategy 
reduces the CPU cost caused by EC reconstruction, it greatly limits the overall 
data recovery bandwidth, since there is only one single DataNode as the source. 
While high density HDD hosts are more and more widely used by HDFS especially 
along with Erasure Coding for warm data use case, this becomes a big pain for 
cluster management. In our production, to decommission a DataNode with several 
hundred TB EC data stored might take several days. HDFS-16613 provides 
optimization based on the existing mechanism, but more fundamentally we may 
want to allow EC reconstruction for DataNode decommission so as to achieve much 
larger recovery bandwidth.
 # The semantic of the existing EC reconstruction command (the 
BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The 
existing reconstruction command depends on the holes in the 
srcNodes/liveBlockIndices arrays to indicate the target internal blocks for 
recovery, while the holes can also be caused by the fact that the corresponding 
datanode is too busy so it cannot be used as the reconstruction source. This 
causes the later DataNode side reconstruction may not be consistent with the 
original intention. E.g., if the index of the missing block is 6, and the 
datanode storing block 0 is busy, the src nodes in the reconstruction command 
only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct 
the internal block 0 instead of 6. HDFS-16566 is working on this issue by 
indicating an excluding index list. More fundamentally we can follow the same 
path but go steps further by adding an optional field explicitly indicating the 
target block indices in the command protobuf msg. With the extension the 
DataNode will no longer use the holes in the src node array to "guess" the 
reconstruction targets.

Internally we have developed and applied fixes by following the above 
directions. We have seen significant improvement (100+ times) in terms of 
datanode decommission speed for EC data. The more clear semantic of the 
reconstruction command protobuf msg also help prevent potential data corruption 
during the EC reconstruction.

We will use this ticket to track the similar fixes for the Apache releases.


> Improve DataNode decommission for Erasure Coding
> ------------------------------------------------
>
>                 Key: HDFS-16874
>                 URL: https://issues.apache.org/jira/browse/HDFS-16874
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ec, erasure-coding
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>            Priority: Major
>
> There are a couple of issues with the current DataNode decommission 
> implementation when large amounts of Erasure Coding data are involved in the 
> data re-replication/reconstruction process:
>  # Slowness. In HDFS-8786 we made a decision to use re-replication for 
> DataNode decommission if the internal EC block is still available. While this 
> strategy reduces the CPU cost caused by EC reconstruction, it greatly limits 
> the overall data recovery bandwidth, since there is only one single DataNode 
> as the source. While high density HDD hosts are more and more widely used by 
> HDFS especially along with Erasure Coding for warm data use case, this 
> becomes a big pain for cluster management. In our production, to decommission 
> a DataNode with several hundred TB EC data stored might take several days. 
> HDFS-16613 provides optimization based on the existing mechanism, but more 
> fundamentally we may want to allow EC reconstruction for DataNode 
> decommission so as to achieve much larger recovery bandwidth.
>  # The semantic of the existing EC reconstruction command (the 
> BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The 
> existing reconstruction command depends on the holes in the 
> srcNodes/liveBlockIndices arrays to indicate the target internal blocks for 
> recovery, while the holes can also be caused by the fact that the 
> corresponding datanode is too busy so it cannot be used as the reconstruction 
> source. This causes the later DataNode side reconstruction may not be 
> consistent with the original intention. E.g., if the index of the missing 
> block is 6, and the datanode storing block 0 is busy, the src nodes in the 
> reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target 
> datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is 
> working on this issue by indicating an excluding index list. More 
> fundamentally we can follow the same path but go steps further by adding an 
> optional field explicitly indicating the target block indices in the command 
> protobuf msg. With the extension the DataNode will no longer use the holes in 
> the src node array to "guess" the reconstruction targets.
> Internally we have developed and applied fixes by following the above 
> directions. We have seen significant improvement (100+ times speed up) in 
> terms of datanode decommission speed for EC data. The more clear semantic of 
> the reconstruction command protobuf msg also help prevent potential data 
> corruption during the EC reconstruction.
> We will use this ticket to track the similar fixes for the Apache releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding

Reply via email to