Tao Li created HDFS-17599:
-----------------------------
Summary: Fix the mismatch between locations and indices for mover
Key: HDFS-17599
URL: https://issues.apache.org/jira/browse/HDFS-17599
Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 3.4.0, 3.3.0
Reporter: Tao Li
Assignee: Tao Li
Attachments: image-2024-08-03-17-59-08-059.png,
image-2024-08-03-18-00-01-950.png
We set the EC policy to (6+3) and also have nodes that were in state
ENTERING_MAINTENANCE.
When we move the data of some directories from SSD to HDD, some blocks move
fail due to disk full, as shown in the figure below (blk_-9223372033441574269).
We tried to move again and found the following error "{color:#FF0000}Replica
does not exist{color}".
Observing the information of fsck, it can be found that the wrong
blockid(blk_-9223372033441574270) was found when moving block.
{*}Mover Logs{*}:
!image-2024-08-03-17-59-08-059.png|width=741,height=85!
{*}FSCK Info{*}:
!image-2024-08-03-18-00-01-950.png|width=738,height=120!
{*}Root Cause{*}:
Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node is
processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state in the
locations is filtered, but the indices are not adapted, resulting in a mismatch
between the location and indices lengths. Finally, ec block calculates the
wrong blockid when getting internal block (see
`DBlockStriped#getInternalBlock`).
We added debug logs, and a few key messages are shown below. {color:#FF0000}The
result is an incorrect correspondence: xx.xx.7.31 ->
-9223372033441574270{color}.
{code:java}
DBlock getInternalBlock(StorageGroup storage) {
// storage == xx.xx.7.31
// idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK,
xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK,
xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is
filtered)
int idxInLocs = locations.indexOf(storage);
if (idxInLocs == -1) {
return null;
}
// idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])
byte idxInGroup = indices[idxInLocs];
// blkId: -9223372033441574272 + 2 = -9223372033441574270
long blkId = getBlock().getBlockId() + idxInGroup;
long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
dataBlockNum, idxInGroup);
Block blk = new Block(getBlock());
blk.setBlockId(blkId);
blk.setNumBytes(numBytes);
DBlock dblk = new DBlock(blk);
dblk.addLocation(storage);
return dblk;
} {code}
{*}Solution{*}:
When initializing DBlockStriped, if any location is filtered out, we need to
remove the corresponding element in the indices to do the adaptation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]