Jing Zhao created HDFS-16874:
--------------------------------
Summary: Improve DataNode decommission for Erasure Coding
Key: HDFS-16874
URL: https://issues.apache.org/jira/browse/HDFS-16874
Project: Hadoop HDFS
Issue Type: Improvement
Components: ec, erasure-coding
Reporter: Jing Zhao
Assignee: Jing Zhao
There are a couple of issues with the current DataNode decommission
implementation when large amounts of Erasure Coding data are involved in the
data re-replication/reconstruction process:
# Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode
decommission if the internal EC block is still available. While this strategy
reduces the CPU cost caused by EC reconstruction, it greatly limits the overall
data recovery bandwidth, since there is only one single DataNode as the source.
While high density HDD hosts are more and more widely used by HDFS especially
along with Erasure Coding for warm data use case, this becomes a big pain for
cluster management. In our production, to decommission a DataNode with several
hundred TB EC data stored might take several days. HDFS-16613 provides
optimization based on the existing mechanism, but more fundamentally we may
want to allow EC reconstruction for DataNode decommission so as to achieve much
larger recovery bandwidth.
# The semantic of the existing EC reconstruction command (the
BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The
existing reconstruction command depends on the holes in the
srcNodes/liveBlockIndices arrays to indicate the target internal blocks for
recovery, while the holes can also be caused by the fact that the corresponding
datanode is too busy so it cannot be used as the reconstruction source. This
causes the later DataNode side reconstruction may not be consistent with the
original intention. E.g., if the index of the missing block is 6, and the
datanode storing block 0 is busy, the src nodes in the reconstruction command
only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct
the internal block 0 instead of 6. HDFS-16566 is working on this issue by
indicating an excluding index list. More fundamentally we can follow the same
path but go steps further by adding an optional field explicitly indicating the
target block indices in the command protobuf msg. With the extension the
DataNode will no longer use the holes in the src node array to "guess" the
reconstruction targets.
Internally we have developed and applied fixes by following the above
directions. We have seen significant improvement (100+ times) in terms of
datanode decommission speed for EC data. The more clear semantic of the
reconstruction command protobuf msg also help prevent potential data corruption
during the EC reconstruction.
We will use this ticket to track the similar fixes for the Apache releases.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]