[ 
https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870823#comment-17870823
 ] 

ASF GitHub Bot commented on HDFS-17599:
---------------------------------------

haiyang1987 commented on PR #6980:
URL: https://github.com/apache/hadoop/pull/6980#issuecomment-2267443935

   The code 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java#L225-L230
 can remove
   ```
    for(MLocation ml : locations) {
         StorageGroup source = storages.getSource(ml);
         if (source != null) {
           db.addLocation(source);
         }
       }
   ``` 
   




> Fix the mismatch between locations and indices for mover
> --------------------------------------------------------
>
>                 Key: HDFS-17599
>                 URL: https://issues.apache.org/jira/browse/HDFS-17599
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.3.0, 3.4.0
>            Reporter: Tao Li
>            Assignee: Tao Li
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2024-08-03-17-59-08-059.png, 
> image-2024-08-03-18-00-01-950.png
>
>
> We set the EC policy to (6+3) and also have nodes that were in state 
> ENTERING_MAINTENANCE.
>  
> When we move the data of some directories from SSD to HDD, some blocks move 
> fail due to disk full, as shown in the figure below 
> (blk_-9223372033441574269).
> We tried to move again and found the following error "{color:#ff0000}Replica 
> does not exist{color}".
> Observing the information of fsck, it can be found that the wrong 
> blockid(blk_-9223372033441574270) was found when moving block.
>  
> {*}Mover Logs{*}:
> !image-2024-08-03-17-59-08-059.png|width=741,height=85!
>  
> {*}FSCK Info{*}:
> !image-2024-08-03-18-00-01-950.png|width=738,height=120!
>  
> {*}Root Cause{*}:
> Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node 
> is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state 
> in the locations is filtered when initializing `DBlockStriped`, but the 
> indices are not adapted, resulting in a mismatch between the location and 
> indices lengths. Finally, ec block calculates the wrong blockid when getting 
> internal block (see `DBlockStriped#getInternalBlock`).
>  
> We added debug logs, and a few key messages are shown below. 
> {color:#ff0000}The result is an incorrect correspondence: xx.xx.7.31 -> 
> -9223372033441574270{color}.
> {code:java}
> DBlock getInternalBlock(StorageGroup storage) {
>   // storage == xx.xx.7.31
>   // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
> xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
> xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
> filtered)
>   int idxInLocs = locations.indexOf(storage);
>   if (idxInLocs == -1) {
>     return null;
>   }
>   // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
>   byte idxInGroup = indices[idxInLocs];
>   // blkId: -9223372033441574272 + 2 = -9223372033441574270
>   long blkId = getBlock().getBlockId() + idxInGroup;
>   long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
>       dataBlockNum, idxInGroup);
>   Block blk = new Block(getBlock());
>   blk.setBlockId(blkId);
>   blk.setNumBytes(numBytes);
>   DBlock dblk = new DBlock(blk);
>   dblk.addLocation(storage);
>   return dblk;
> } {code}
> {*}Solution{*}:
> When initializing DBlockStriped, if any location is filtered out, we need to 
> remove the corresponding element in the indices to do the adaptation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to