[
https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haiyang Hu resolved HDFS-17599.
-------------------------------
Resolution: Fixed
> EC: Fix the mismatch between locations and indices for mover
> ------------------------------------------------------------
>
> Key: HDFS-17599
> URL: https://issues.apache.org/jira/browse/HDFS-17599
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer & mover
> Affects Versions: 3.3.0, 3.4.0
> Reporter: Tao Li
> Assignee: Tao Li
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: image-2024-08-03-17-59-08-059.png,
> image-2024-08-03-18-00-01-950.png
>
>
> We set the EC policy to (6+3) and also have nodes that were in state
> ENTERING_MAINTENANCE.
>
> When we move the data of some directories from SSD to HDD, some blocks move
> fail due to disk full, as shown in the figure below
> (blk_-9223372033441574269).
> We tried to move again and found the following error "{color:#ff0000}Replica
> does not exist{color}".
> Observing the information of fsck, it can be found that the wrong
> blockid(blk_-9223372033441574270) was found when moving block.
>
> {*}Mover Logs{*}:
> !image-2024-08-03-17-59-08-059.png|width=741,height=85!
>
> {*}FSCK Info{*}:
> !image-2024-08-03-18-00-01-950.png|width=738,height=120!
>
> {*}Root Cause{*}:
> Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node
> is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state
> in the locations is filtered when initializing `DBlockStriped`, but the
> indices are not adapted, resulting in a mismatch between the location and
> indices lengths. Finally, ec block calculates the wrong blockid when getting
> internal block (see `DBlockStriped#getInternalBlock`).
>
> We added debug logs, and a few key messages are shown below.
> {color:#ff0000}The result is an incorrect correspondence: xx.xx.7.31 ->
> -9223372033441574270{color}.
> {code:java}
> DBlock getInternalBlock(StorageGroup storage) {
> // storage == xx.xx.7.31
> // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK,
> xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK,
> xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is
> filtered)
> int idxInLocs = locations.indexOf(storage);
> if (idxInLocs == -1) {
> return null;
> }
> // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])
> byte idxInGroup = indices[idxInLocs];
> // blkId: -9223372033441574272 + 2 = -9223372033441574270
> long blkId = getBlock().getBlockId() + idxInGroup;
> long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
> dataBlockNum, idxInGroup);
> Block blk = new Block(getBlock());
> blk.setBlockId(blkId);
> blk.setNumBytes(numBytes);
> DBlock dblk = new DBlock(blk);
> dblk.addLocation(storage);
> return dblk;
> } {code}
> {*}Solution{*}:
> When initializing DBlockStriped, if any location is filtered out, we need to
> remove the corresponding element in the indices to do the adaptation.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]