[ https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870823#comment-17870823 ]
ASF GitHub Bot commented on HDFS-17599: --------------------------------------- haiyang1987 commented on PR #6980: URL: https://github.com/apache/hadoop/pull/6980#issuecomment-2267443935 The code https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java#L225-L230 can remove ``` for(MLocation ml : locations) { StorageGroup source = storages.getSource(ml); if (source != null) { db.addLocation(source); } } ``` > Fix the mismatch between locations and indices for mover > -------------------------------------------------------- > > Key: HDFS-17599 > URL: https://issues.apache.org/jira/browse/HDFS-17599 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 3.3.0, 3.4.0 > Reporter: Tao Li > Assignee: Tao Li > Priority: Major > Labels: pull-request-available > Attachments: image-2024-08-03-17-59-08-059.png, > image-2024-08-03-18-00-01-950.png > > > We set the EC policy to (6+3) and also have nodes that were in state > ENTERING_MAINTENANCE. > > When we move the data of some directories from SSD to HDD, some blocks move > fail due to disk full, as shown in the figure below > (blk_-9223372033441574269). > We tried to move again and found the following error "{color:#ff0000}Replica > does not exist{color}". > Observing the information of fsck, it can be found that the wrong > blockid(blk_-9223372033441574270) was found when moving block. > > {*}Mover Logs{*}: > !image-2024-08-03-17-59-08-059.png|width=741,height=85! > > {*}FSCK Info{*}: > !image-2024-08-03-18-00-01-950.png|width=738,height=120! > > {*}Root Cause{*}: > Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node > is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state > in the locations is filtered when initializing `DBlockStriped`, but the > indices are not adapted, resulting in a mismatch between the location and > indices lengths. Finally, ec block calculates the wrong blockid when getting > internal block (see `DBlockStriped#getInternalBlock`). > > We added debug logs, and a few key messages are shown below. > {color:#ff0000}The result is an incorrect correspondence: xx.xx.7.31 -> > -9223372033441574270{color}. > {code:java} > DBlock getInternalBlock(StorageGroup storage) { > // storage == xx.xx.7.31 > // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, > xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, > xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is > filtered) > int idxInLocs = locations.indexOf(storage); > if (idxInLocs == -1) { > return null; > } > // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) > byte idxInGroup = indices[idxInLocs]; > // blkId: -9223372033441574272 + 2 = -9223372033441574270 > long blkId = getBlock().getBlockId() + idxInGroup; > long numBytes = getInternalBlockLength(getNumBytes(), cellSize, > dataBlockNum, idxInGroup); > Block blk = new Block(getBlock()); > blk.setBlockId(blkId); > blk.setNumBytes(numBytes); > DBlock dblk = new DBlock(blk); > dblk.addLocation(storage); > return dblk; > } {code} > {*}Solution{*}: > When initializing DBlockStriped, if any location is filtered out, we need to > remove the corresponding element in the indices to do the adaptation. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org