[
https://issues.apache.org/jira/browse/HDFS-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021706#comment-18021706
]
ASF GitHub Bot commented on HDFS-17620:
---------------------------------------
github-actions[bot] closed pull request #7035: HDFS-17620. Better block
placement for small EC files
URL: https://github.com/apache/hadoop/pull/7035
> Better block placement for small EC files
> -----------------------------------------
>
> Key: HDFS-17620
> URL: https://issues.apache.org/jira/browse/HDFS-17620
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: erasure-coding, namenode
> Affects Versions: 3.3.6
> Reporter: Junegunn Choi
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2024-09-10-13-22-50-247.png, screenshot-1.png
>
>
> h2. Problem description
> If an erasure-coded file is not large enough to fill the stripe width of the
> EC policy, the block distribution can be suboptimal.
> For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data
> block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen
> by the block placement policy, only 4 of them are used, and the last 3
> locations are for parity blocks. If the cluster has a very small number of
> racks (e.g. 3), with the current scheme to find a pipeline with the shortest
> path, the last nodes are likely to be in the same rack, resulting in a
> suboptimal rack distribution.
> {noformat}
> Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
> Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
> Blocks: D1 P1 P2 P3
> {noformat}
> We can see that blocks are stored in only 2 racks, not 3.
> Because the block does not have enough racks, {{ErasureCodingWork}} will
> later be created to replicate the block to a new rack, however, the current
> code tries to copy the block to the first node in the chosen locations,
> regardless of its rack. So it is not guaranteed to improve the situation, and
> we constantly see {{PendingReconstructionMonitor timed out}} messages in the
> log.
> h2. Proposed solution
> 1. Reorder the chosen locations by rack so that the parity blocks are stored
> in as many racks as possible.
> 2. Make {{ErasureCodingWork}} try to find a target on a new rack
> h2. Real-world test result
> We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18
> nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data
> directory, we noticed that
> 1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
> {noformat}
> /hbase/***: Replica placement policy is violated for ***. Block should be
> additionally replicated on 2 more rack(s). Total number of racks in the
> cluster: 3
> ...
> Erasure Coded Block Groups:
> ...
> Unsatisfactory placement block groups: 1475 (2.5252092 %)
> {noformat}
> 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages
> every recheck-interval (5 minutes).
> 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every
> recheck-interval.
> After applying the patch, all the problems are gone. "Unsatisfactory
> placement block groups" is now zero. No metrics bumps or "timed out" logs.
> !screenshot-1.png|width=500!
> !image-2024-09-10-13-22-50-247.png|width=500!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]