[jira] [Commented] (HDFS-17620) Better block placement for small EC files

ASF GitHub Bot (Jira) Sun, 21 Sep 2025 18:43:18 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021706#comment-18021706
 ]


ASF GitHub Bot commented on HDFS-17620:
---------------------------------------

github-actions[bot] closed pull request #7035: HDFS-17620. Better block 
placement for small EC files
URL: https://github.com/apache/hadoop/pull/7035




> Better block placement for small EC files
> -----------------------------------------
>
>                 Key: HDFS-17620
>                 URL: https://issues.apache.org/jira/browse/HDFS-17620
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding, namenode
>    Affects Versions: 3.3.6
>            Reporter: Junegunn Choi
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2024-09-10-13-22-50-247.png, screenshot-1.png
>
>
> h2. Problem description
> If an erasure-coded file is not large enough to fill the stripe width of the 
> EC policy, the block distribution can be suboptimal.
> For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data 
> block and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen 
> by the block placement policy, only 4 of them are used, and the last 3 
> locations are for parity blocks. If the cluster has a very small number of 
> racks (e.g. 3), with the current scheme to find a pipeline with the shortest 
> path, the last nodes are likely to be in the same rack, resulting in a 
> suboptimal rack distribution.
> {noformat}
> Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
>     Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
>    Blocks: D1                P1 P2 P3
> {noformat}
> We can see that blocks are stored in only 2 racks, not 3.
> Because the block does not have enough racks, {{ErasureCodingWork}} will 
> later be created to replicate the block to a new rack, however, the current 
> code tries to copy the block to the first node in the chosen locations, 
> regardless of its rack. So it is not guaranteed to improve the situation, and 
> we constantly see {{PendingReconstructionMonitor timed out}} messages in the 
> log.
> h2. Proposed solution
> 1. Reorder the chosen locations by rack so that the parity blocks are stored 
> in as many racks as possible.
> 2. Make {{ErasureCodingWork}} try to find a target on a new rack
> h2. Real-world test result
> We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18 
> nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data 
> directory, we noticed that
> 1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
> {noformat}
>   /hbase/***:  Replica placement policy is violated for ***. Block should be 
> additionally replicated on 2 more rack(s). Total number of racks in the 
> cluster: 3
>   ...
>   Erasure Coded Block Groups:
>     ...
>     Unsatisfactory placement block groups: 1475 (2.5252092 %)
>   {noformat}
> 2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages 
> every recheck-interval (5 minutes).
> 3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every 
> recheck-interval.
> After applying the patch, all the problems are gone. "Unsatisfactory 
> placement block groups" is now zero. No metrics bumps or "timed out" logs.
> !screenshot-1.png|width=500!
>  !image-2024-09-10-13-22-50-247.png|width=500!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17620) Better block placement for small EC files

Reply via email to