Junegunn Choi created HDFS-17620:
------------------------------------
Summary: Better block placement for small EC files
Key: HDFS-17620
URL: https://issues.apache.org/jira/browse/HDFS-17620
Project: Hadoop HDFS
Issue Type: Bug
Components: erasure-coding, namenode
Affects Versions: 3.3.6
Reporter: Junegunn Choi
h2. Problem description
If an erasure-coded file is not large enough to fill the stripe width of the EC
policy, the block distribution can be suboptimal.
For example, an RS-6-3-1024K EC file smaller than 1024K will have 1 data block
and 3 parity blocks. While all 9 (6 + 3) storage locations are chosen by the
block placement policy, only 4 of them are used, and the last 3 locations are
for parity blocks. If the cluster has a very small number of racks (e.g. 3),
with the current scheme to find a pipeline with the shortest path, the last
nodes are likely to be in the same rack, resulting in a suboptimal rack
distribution.
{noformat}
Locations: N1 N2 N3 N4 N5 N6 N7 N8 N9
Racks: R1 R1 R1 R2 R2 R2 R3 R3 R3
Blocks: D1 P1 P2 P3
{noformat}
We can see that blocks are stored in only 2 racks, not 3.
Because the block does not have enough racks, {{ErasureCodingWork}} will later
be created to replicate the block to a new rack, however, the current code
tries to copy the block to the first node in the chosen locations, regardless
of its rack. So it is not guaranteed to improve the situation, and we
constantly see {{PendingReconstructionMonitor timed out}} messages in the log.
h2. Proposed solution
1. Reorder the chosen locations by rack so that the parity blocks are stored in
as many racks as possible.
2. Make {{ErasureCodingWork}} try to find a target on a new rack
h2. Real-world test result
We first noticed the problem on our HBase cluster running Hadoop 3.3.6 on 18
nodes across 3 racks. After setting RS-6-3-1024K policy on the HBase data
directory, we noticed that
1. FSCK reports "Unsatisfactory placement block groups" for small EC files.
{noformat}
/hbase/***: Replica placement policy is violated for ***. Block should be
additionally replicated on 2 more rack(s). Total number of racks in the
cluster: 3
...
Erasure Coded Block Groups:
...
Unsatisfactory placement block groups: 1475 (2.5252092 %)
{noformat}
2. Namenode keeps logging "PendingReconstructionMonitor timed out" messages
every recheck-interval (5 minutes).
3. and {{FSNamesystem.UnderReplicatedBlocks}} metric bumps and clears every
recheck-interval.
After applying the patch, all the problems are gone. "Unsatisfactory placement
block groups" is now zero. No metrics bumps or "timed out" logs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]