[jira] [Updated] (HDFS-17609) [FGL] Fix lock mode in some RPC

2024-08-29 Thread farmmamba (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

farmmamba updated HDFS-17609:
-
Parent: HDFS-17384
Issue Type: Sub-task  (was: Improvement)

> [FGL] Fix lock mode in some RPC
> ---
>
> Key: HDFS-17609
> URL: https://issues.apache.org/jira/browse/HDFS-17609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17594) [ARR] RouterCacheAdmin supports asynchronous rpc.

2024-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877702#comment-17877702
 ] 

ASF GitHub Bot commented on HDFS-17594:
---

Archie-wang commented on code in PR #6986:
URL: https://github.com/apache/hadoop/pull/6986#discussion_r1736172568


##
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAsyncCacheAdmin.java:
##
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hdfs.server.federation.router;
+
+import org.apache.hadoop.fs.CacheFlag;
+import org.apache.hadoop.hdfs.protocol.CacheDirectiveEntry;
+import org.apache.hadoop.hdfs.protocol.CacheDirectiveInfo;
+import org.apache.hadoop.hdfs.protocol.CachePoolEntry;
+import org.apache.hadoop.fs.BatchedRemoteIterator.BatchedEntries;
+import 
org.apache.hadoop.hdfs.server.federation.resolver.FederationNamespaceInfo;
+import org.apache.hadoop.hdfs.server.federation.resolver.RemoteLocation;
+import org.apache.hadoop.hdfs.server.federation.router.async.ApplyFunction;
+
+import java.io.IOException;
+import java.util.EnumSet;
+import java.util.Map;
+
+import static 
org.apache.hadoop.hdfs.server.federation.router.async.AsyncUtil.asyncApply;
+import static 
org.apache.hadoop.hdfs.server.federation.router.async.AsyncUtil.asyncReturn;
+
+/**
+ * Module that implements all the asynchronous RPC calls in
+ * {@link org.apache.hadoop.hdfs.protocol.ClientProtocol} related to Cache 
Admin
+ * in the {@link RouterRpcServer}.
+ */
+public class RouterAsyncCacheAdmin extends RouterCacheAdmin{
+
+  public RouterAsyncCacheAdmin(RouterRpcServer server) {
+super(server);
+  }
+
+  @Override
+  public long addCacheDirective(
+  CacheDirectiveInfo path, EnumSet flags) throws IOException {
+invokeAddCacheDirective(path, flags);
+asyncApply((ApplyFunction, Long>) 

Review Comment:
   done





> [ARR] RouterCacheAdmin supports asynchronous rpc.
> -
>
> Key: HDFS-17594
> URL: https://issues.apache.org/jira/browse/HDFS-17594
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
>
> *Describe*
> The main new addition is RouterAsyncCacheAdmin, which extends 
> RouterCacheAdmin so that cache admin supports asynchronous rpc.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17594) [ARR] RouterCacheAdmin supports asynchronous rpc.

2024-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877762#comment-17877762
 ] 

ASF GitHub Bot commented on HDFS-17594:
---

hadoop-yetus commented on PR #6986:
URL: https://github.com/apache/hadoop/pull/6986#issuecomment-2318280824

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  18m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ HDFS-17531 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 58s |  |  HDFS-17531 passed  |
   | +1 :green_heart: |  compile  |   0m 51s |  |  HDFS-17531 passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  compile  |   0m 39s |  |  HDFS-17531 passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  HDFS-17531 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 44s |  |  HDFS-17531 passed  |
   | +1 :green_heart: |  javadoc  |   0m 44s |  |  HDFS-17531 passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  HDFS-17531 passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 27s |  |  HDFS-17531 passed  |
   | +1 :green_heart: |  shadedclient  |  39m 13s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 43s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javac  |   0m 43s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  javac  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 20s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  the patch passed with JDK 
Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_422-8u422-b05-1~20.04-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 23s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  34m 19s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 194m 23s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.47 ServerAPI=1.47 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6986/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6986 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 966aea133556 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 
20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | HDFS-17531 / aeecad6efb3deb638059b6cf3e7790f9aec009f5 |
   | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 
|
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6986/2/testReport/ |
   | Max. process+thread count | 3497 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6986/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.

[jira] [Commented] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover

2024-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877962#comment-17877962
 ] 

ASF GitHub Bot commented on HDFS-17599:
---

haiyang1987 merged PR #6980:
URL: https://github.com/apache/hadoop/pull/6980




> EC: Fix the mismatch between locations and indices for mover
> 
>
> Key: HDFS-17599
> URL: https://issues.apache.org/jira/browse/HDFS-17599
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-08-03-17-59-08-059.png, 
> image-2024-08-03-18-00-01-950.png
>
>
> We set the EC policy to (6+3) and also have nodes that were in state 
> ENTERING_MAINTENANCE.
>  
> When we move the data of some directories from SSD to HDD, some blocks move 
> fail due to disk full, as shown in the figure below 
> (blk_-9223372033441574269).
> We tried to move again and found the following error "{color:#ff}Replica 
> does not exist{color}".
> Observing the information of fsck, it can be found that the wrong 
> blockid(blk_-9223372033441574270) was found when moving block.
>  
> {*}Mover Logs{*}:
> !image-2024-08-03-17-59-08-059.png|width=741,height=85!
>  
> {*}FSCK Info{*}:
> !image-2024-08-03-18-00-01-950.png|width=738,height=120!
>  
> {*}Root Cause{*}:
> Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node 
> is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state 
> in the locations is filtered when initializing `DBlockStriped`, but the 
> indices are not adapted, resulting in a mismatch between the location and 
> indices lengths. Finally, ec block calculates the wrong blockid when getting 
> internal block (see `DBlockStriped#getInternalBlock`).
>  
> We added debug logs, and a few key messages are shown below. 
> {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> 
> -9223372033441574270{color}.
> {code:java}
> DBlock getInternalBlock(StorageGroup storage) {
>   // storage == xx.xx.7.31
>   // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
> xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
> xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
> filtered)
>   int idxInLocs = locations.indexOf(storage);
>   if (idxInLocs == -1) {
> return null;
>   }
>   // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
>   byte idxInGroup = indices[idxInLocs];
>   // blkId: -9223372033441574272 + 2 = -9223372033441574270
>   long blkId = getBlock().getBlockId() + idxInGroup;
>   long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
>   dataBlockNum, idxInGroup);
>   Block blk = new Block(getBlock());
>   blk.setBlockId(blkId);
>   blk.setNumBytes(numBytes);
>   DBlock dblk = new DBlock(blk);
>   dblk.addLocation(storage);
>   return dblk;
> } {code}
> {*}Solution{*}:
> When initializing DBlockStriped, if any location is filtered out, we need to 
> remove the corresponding element in the indices to do the adaptation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover

2024-08-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877963#comment-17877963
 ] 

ASF GitHub Bot commented on HDFS-17599:
---

haiyang1987 commented on PR #6980:
URL: https://github.com/apache/hadoop/pull/6980#issuecomment-2320052315

   Committed to trunk. Thanks @tomscut  for your works. And @Hexiaoqiao  for 
your reviews.
   
   




> EC: Fix the mismatch between locations and indices for mover
> 
>
> Key: HDFS-17599
> URL: https://issues.apache.org/jira/browse/HDFS-17599
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-08-03-17-59-08-059.png, 
> image-2024-08-03-18-00-01-950.png
>
>
> We set the EC policy to (6+3) and also have nodes that were in state 
> ENTERING_MAINTENANCE.
>  
> When we move the data of some directories from SSD to HDD, some blocks move 
> fail due to disk full, as shown in the figure below 
> (blk_-9223372033441574269).
> We tried to move again and found the following error "{color:#ff}Replica 
> does not exist{color}".
> Observing the information of fsck, it can be found that the wrong 
> blockid(blk_-9223372033441574270) was found when moving block.
>  
> {*}Mover Logs{*}:
> !image-2024-08-03-17-59-08-059.png|width=741,height=85!
>  
> {*}FSCK Info{*}:
> !image-2024-08-03-18-00-01-950.png|width=738,height=120!
>  
> {*}Root Cause{*}:
> Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node 
> is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state 
> in the locations is filtered when initializing `DBlockStriped`, but the 
> indices are not adapted, resulting in a mismatch between the location and 
> indices lengths. Finally, ec block calculates the wrong blockid when getting 
> internal block (see `DBlockStriped#getInternalBlock`).
>  
> We added debug logs, and a few key messages are shown below. 
> {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> 
> -9223372033441574270{color}.
> {code:java}
> DBlock getInternalBlock(StorageGroup storage) {
>   // storage == xx.xx.7.31
>   // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
> xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
> xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
> filtered)
>   int idxInLocs = locations.indexOf(storage);
>   if (idxInLocs == -1) {
> return null;
>   }
>   // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
>   byte idxInGroup = indices[idxInLocs];
>   // blkId: -9223372033441574272 + 2 = -9223372033441574270
>   long blkId = getBlock().getBlockId() + idxInGroup;
>   long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
>   dataBlockNum, idxInGroup);
>   Block blk = new Block(getBlock());
>   blk.setBlockId(blkId);
>   blk.setNumBytes(numBytes);
>   DBlock dblk = new DBlock(blk);
>   dblk.addLocation(storage);
>   return dblk;
> } {code}
> {*}Solution{*}:
> When initializing DBlockStriped, if any location is filtered out, we need to 
> remove the corresponding element in the indices to do the adaptation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover

2024-08-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17599:
--
Component/s: balancer & mover

> EC: Fix the mismatch between locations and indices for mover
> 
>
> Key: HDFS-17599
> URL: https://issues.apache.org/jira/browse/HDFS-17599
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-08-03-17-59-08-059.png, 
> image-2024-08-03-18-00-01-950.png
>
>
> We set the EC policy to (6+3) and also have nodes that were in state 
> ENTERING_MAINTENANCE.
>  
> When we move the data of some directories from SSD to HDD, some blocks move 
> fail due to disk full, as shown in the figure below 
> (blk_-9223372033441574269).
> We tried to move again and found the following error "{color:#ff}Replica 
> does not exist{color}".
> Observing the information of fsck, it can be found that the wrong 
> blockid(blk_-9223372033441574270) was found when moving block.
>  
> {*}Mover Logs{*}:
> !image-2024-08-03-17-59-08-059.png|width=741,height=85!
>  
> {*}FSCK Info{*}:
> !image-2024-08-03-18-00-01-950.png|width=738,height=120!
>  
> {*}Root Cause{*}:
> Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node 
> is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state 
> in the locations is filtered when initializing `DBlockStriped`, but the 
> indices are not adapted, resulting in a mismatch between the location and 
> indices lengths. Finally, ec block calculates the wrong blockid when getting 
> internal block (see `DBlockStriped#getInternalBlock`).
>  
> We added debug logs, and a few key messages are shown below. 
> {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> 
> -9223372033441574270{color}.
> {code:java}
> DBlock getInternalBlock(StorageGroup storage) {
>   // storage == xx.xx.7.31
>   // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
> xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
> xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
> filtered)
>   int idxInLocs = locations.indexOf(storage);
>   if (idxInLocs == -1) {
> return null;
>   }
>   // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
>   byte idxInGroup = indices[idxInLocs];
>   // blkId: -9223372033441574272 + 2 = -9223372033441574270
>   long blkId = getBlock().getBlockId() + idxInGroup;
>   long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
>   dataBlockNum, idxInGroup);
>   Block blk = new Block(getBlock());
>   blk.setBlockId(blkId);
>   blk.setNumBytes(numBytes);
>   DBlock dblk = new DBlock(blk);
>   dblk.addLocation(storage);
>   return dblk;
> } {code}
> {*}Solution{*}:
> When initializing DBlockStriped, if any location is filtered out, we need to 
> remove the corresponding element in the indices to do the adaptation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover

2024-08-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu resolved HDFS-17599.
---
Resolution: Fixed

> EC: Fix the mismatch between locations and indices for mover
> 
>
> Key: HDFS-17599
> URL: https://issues.apache.org/jira/browse/HDFS-17599
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: image-2024-08-03-17-59-08-059.png, 
> image-2024-08-03-18-00-01-950.png
>
>
> We set the EC policy to (6+3) and also have nodes that were in state 
> ENTERING_MAINTENANCE.
>  
> When we move the data of some directories from SSD to HDD, some blocks move 
> fail due to disk full, as shown in the figure below 
> (blk_-9223372033441574269).
> We tried to move again and found the following error "{color:#ff}Replica 
> does not exist{color}".
> Observing the information of fsck, it can be found that the wrong 
> blockid(blk_-9223372033441574270) was found when moving block.
>  
> {*}Mover Logs{*}:
> !image-2024-08-03-17-59-08-059.png|width=741,height=85!
>  
> {*}FSCK Info{*}:
> !image-2024-08-03-18-00-01-950.png|width=738,height=120!
>  
> {*}Root Cause{*}:
> Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node 
> is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state 
> in the locations is filtered when initializing `DBlockStriped`, but the 
> indices are not adapted, resulting in a mismatch between the location and 
> indices lengths. Finally, ec block calculates the wrong blockid when getting 
> internal block (see `DBlockStriped#getInternalBlock`).
>  
> We added debug logs, and a few key messages are shown below. 
> {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> 
> -9223372033441574270{color}.
> {code:java}
> DBlock getInternalBlock(StorageGroup storage) {
>   // storage == xx.xx.7.31
>   // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
> xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
> xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
> filtered)
>   int idxInLocs = locations.indexOf(storage);
>   if (idxInLocs == -1) {
> return null;
>   }
>   // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
>   byte idxInGroup = indices[idxInLocs];
>   // blkId: -9223372033441574272 + 2 = -9223372033441574270
>   long blkId = getBlock().getBlockId() + idxInGroup;
>   long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
>   dataBlockNum, idxInGroup);
>   Block blk = new Block(getBlock());
>   blk.setBlockId(blkId);
>   blk.setNumBytes(numBytes);
>   DBlock dblk = new DBlock(blk);
>   dblk.addLocation(storage);
>   return dblk;
> } {code}
> {*}Solution{*}:
> When initializing DBlockStriped, if any location is filtered out, we need to 
> remove the corresponding element in the indices to do the adaptation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover

2024-08-29 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17599:
--
Fix Version/s: 3.5.0

> EC: Fix the mismatch between locations and indices for mover
> 
>
> Key: HDFS-17599
> URL: https://issues.apache.org/jira/browse/HDFS-17599
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
> Attachments: image-2024-08-03-17-59-08-059.png, 
> image-2024-08-03-18-00-01-950.png
>
>
> We set the EC policy to (6+3) and also have nodes that were in state 
> ENTERING_MAINTENANCE.
>  
> When we move the data of some directories from SSD to HDD, some blocks move 
> fail due to disk full, as shown in the figure below 
> (blk_-9223372033441574269).
> We tried to move again and found the following error "{color:#ff}Replica 
> does not exist{color}".
> Observing the information of fsck, it can be found that the wrong 
> blockid(blk_-9223372033441574270) was found when moving block.
>  
> {*}Mover Logs{*}:
> !image-2024-08-03-17-59-08-059.png|width=741,height=85!
>  
> {*}FSCK Info{*}:
> !image-2024-08-03-18-00-01-950.png|width=738,height=120!
>  
> {*}Root Cause{*}:
> Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node 
> is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state 
> in the locations is filtered when initializing `DBlockStriped`, but the 
> indices are not adapted, resulting in a mismatch between the location and 
> indices lengths. Finally, ec block calculates the wrong blockid when getting 
> internal block (see `DBlockStriped#getInternalBlock`).
>  
> We added debug logs, and a few key messages are shown below. 
> {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> 
> -9223372033441574270{color}.
> {code:java}
> DBlock getInternalBlock(StorageGroup storage) {
>   // storage == xx.xx.7.31
>   // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, 
> xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, 
> xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is 
> filtered)
>   int idxInLocs = locations.indexOf(storage);
>   if (idxInLocs == -1) {
> return null;
>   }
>   // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8])   
>   byte idxInGroup = indices[idxInLocs];
>   // blkId: -9223372033441574272 + 2 = -9223372033441574270
>   long blkId = getBlock().getBlockId() + idxInGroup;
>   long numBytes = getInternalBlockLength(getNumBytes(), cellSize,
>   dataBlockNum, idxInGroup);
>   Block blk = new Block(getBlock());
>   blk.setBlockId(blkId);
>   blk.setNumBytes(numBytes);
>   DBlock dblk = new DBlock(blk);
>   dblk.addLocation(storage);
>   return dblk;
> } {code}
> {*}Solution{*}:
> When initializing DBlockStriped, if any location is filtered out, we need to 
> remove the corresponding element in the indices to do the adaptation.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org