[ https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847246#comment-17847246 ]
Chenyu Zheng commented on HDFS-16874: ------------------------------------- [~jingzhao] Do you have any new plans? I found HDFS-17515 and HDFS-17516 that may be relevant to this. Can I take over this and create some subtasks? > Improve DataNode decommission for Erasure Coding > ------------------------------------------------ > > Key: HDFS-16874 > URL: https://issues.apache.org/jira/browse/HDFS-16874 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec, erasure-coding > Reporter: Jing Zhao > Assignee: Jing Zhao > Priority: Major > > There are a couple of issues with the current DataNode decommission > implementation when large amounts of Erasure Coding data are involved in the > data re-replication/reconstruction process: > # Slowness. In HDFS-8786 we made a decision to use re-replication for > DataNode decommission if the internal EC block is still available. While this > strategy reduces the CPU cost caused by EC reconstruction, it greatly limits > the overall data recovery bandwidth, since there is only one single DataNode > as the source. While high density HDD hosts are more and more widely used by > HDFS especially along with Erasure Coding for warm data use case, this > becomes a big pain for cluster management. In our production, to decommission > a DataNode with several hundred TB EC data stored might take several days. > HDFS-16613 provides optimization based on the existing mechanism, but more > fundamentally we may want to allow EC reconstruction for DataNode > decommission so as to achieve much larger recovery bandwidth. > # The semantic of the existing EC reconstruction command (the > BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The > existing reconstruction command depends on the holes in the > srcNodes/liveBlockIndices arrays to indicate the target internal blocks for > recovery, while the holes can also be caused by the fact that the > corresponding datanode is too busy so it cannot be used as the reconstruction > source. This causes the later DataNode side reconstruction may not be > consistent with the original intention. E.g., if the index of the missing > block is 6, and the datanode storing block 0 is busy, the src nodes in the > reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target > datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is > working on this issue by indicating an excluding index list. More > fundamentally we can follow the same path but go a step further by adding an > optional field explicitly indicating the target block indices in the command > protobuf msg. With the extension the DataNode will no longer use the holes in > the src node array to "guess" the reconstruction targets. > Internally we have developed and applied fixes by following the above > directions. We have seen significant improvement (100+ times speed up) in > terms of datanode decommission speed for EC data. The more clear semantic of > the reconstruction command protobuf msg also help prevent potential data > corruption during the EC reconstruction. > We will use this ticket to track the similar fixes for the Apache releases. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org