[jira] [Updated] (HDFS-17609) [FGL] Fix lock mode in some RPC
[ https://issues.apache.org/jira/browse/HDFS-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] farmmamba updated HDFS-17609: - Parent: HDFS-17384 Issue Type: Sub-task (was: Improvement) > [FGL] Fix lock mode in some RPC > --- > > Key: HDFS-17609 > URL: https://issues.apache.org/jira/browse/HDFS-17609 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: farmmamba >Assignee: farmmamba >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17594) [ARR] RouterCacheAdmin supports asynchronous rpc.
[ https://issues.apache.org/jira/browse/HDFS-17594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877702#comment-17877702 ] ASF GitHub Bot commented on HDFS-17594: --- Archie-wang commented on code in PR #6986: URL: https://github.com/apache/hadoop/pull/6986#discussion_r1736172568 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAsyncCacheAdmin.java: ## @@ -0,0 +1,75 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hdfs.server.federation.router; + +import org.apache.hadoop.fs.CacheFlag; +import org.apache.hadoop.hdfs.protocol.CacheDirectiveEntry; +import org.apache.hadoop.hdfs.protocol.CacheDirectiveInfo; +import org.apache.hadoop.hdfs.protocol.CachePoolEntry; +import org.apache.hadoop.fs.BatchedRemoteIterator.BatchedEntries; +import org.apache.hadoop.hdfs.server.federation.resolver.FederationNamespaceInfo; +import org.apache.hadoop.hdfs.server.federation.resolver.RemoteLocation; +import org.apache.hadoop.hdfs.server.federation.router.async.ApplyFunction; + +import java.io.IOException; +import java.util.EnumSet; +import java.util.Map; + +import static org.apache.hadoop.hdfs.server.federation.router.async.AsyncUtil.asyncApply; +import static org.apache.hadoop.hdfs.server.federation.router.async.AsyncUtil.asyncReturn; + +/** + * Module that implements all the asynchronous RPC calls in + * {@link org.apache.hadoop.hdfs.protocol.ClientProtocol} related to Cache Admin + * in the {@link RouterRpcServer}. + */ +public class RouterAsyncCacheAdmin extends RouterCacheAdmin{ + + public RouterAsyncCacheAdmin(RouterRpcServer server) { +super(server); + } + + @Override + public long addCacheDirective( + CacheDirectiveInfo path, EnumSet flags) throws IOException { +invokeAddCacheDirective(path, flags); +asyncApply((ApplyFunction, Long>) Review Comment: done > [ARR] RouterCacheAdmin supports asynchronous rpc. > - > > Key: HDFS-17594 > URL: https://issues.apache.org/jira/browse/HDFS-17594 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Jian Zhang >Assignee: Jian Zhang >Priority: Major > Labels: pull-request-available > > *Describe* > The main new addition is RouterAsyncCacheAdmin, which extends > RouterCacheAdmin so that cache admin supports asynchronous rpc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17594) [ARR] RouterCacheAdmin supports asynchronous rpc.
[ https://issues.apache.org/jira/browse/HDFS-17594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877762#comment-17877762 ] ASF GitHub Bot commented on HDFS-17594: --- hadoop-yetus commented on PR #6986: URL: https://github.com/apache/hadoop/pull/6986#issuecomment-2318280824 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 18m 22s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ HDFS-17531 Compile Tests _ | | +1 :green_heart: | mvninstall | 49m 58s | | HDFS-17531 passed | | +1 :green_heart: | compile | 0m 51s | | HDFS-17531 passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | compile | 0m 39s | | HDFS-17531 passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | checkstyle | 0m 31s | | HDFS-17531 passed | | +1 :green_heart: | mvnsite | 0m 44s | | HDFS-17531 passed | | +1 :green_heart: | javadoc | 0m 44s | | HDFS-17531 passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 31s | | HDFS-17531 passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 27s | | HDFS-17531 passed | | +1 :green_heart: | shadedclient | 39m 13s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 43s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javac | 0m 43s | | the patch passed | | +1 :green_heart: | compile | 0m 31s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | javac | 0m 31s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 20s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 34s | | the patch passed | | +1 :green_heart: | javadoc | 0m 30s | | the patch passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 | | +1 :green_heart: | javadoc | 0m 25s | | the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | +1 :green_heart: | spotbugs | 1m 28s | | the patch passed | | +1 :green_heart: | shadedclient | 39m 23s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 34m 19s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 194m 23s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6986/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6986 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 966aea133556 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | HDFS-17531 / aeecad6efb3deb638059b6cf3e7790f9aec009f5 | | Default Java | Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6986/2/testReport/ | | Max. process+thread count | 3497 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6986/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.
[jira] [Commented] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover
[ https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877962#comment-17877962 ] ASF GitHub Bot commented on HDFS-17599: --- haiyang1987 merged PR #6980: URL: https://github.com/apache/hadoop/pull/6980 > EC: Fix the mismatch between locations and indices for mover > > > Key: HDFS-17599 > URL: https://issues.apache.org/jira/browse/HDFS-17599 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2024-08-03-17-59-08-059.png, > image-2024-08-03-18-00-01-950.png > > > We set the EC policy to (6+3) and also have nodes that were in state > ENTERING_MAINTENANCE. > > When we move the data of some directories from SSD to HDD, some blocks move > fail due to disk full, as shown in the figure below > (blk_-9223372033441574269). > We tried to move again and found the following error "{color:#ff}Replica > does not exist{color}". > Observing the information of fsck, it can be found that the wrong > blockid(blk_-9223372033441574270) was found when moving block. > > {*}Mover Logs{*}: > !image-2024-08-03-17-59-08-059.png|width=741,height=85! > > {*}FSCK Info{*}: > !image-2024-08-03-18-00-01-950.png|width=738,height=120! > > {*}Root Cause{*}: > Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node > is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state > in the locations is filtered when initializing `DBlockStriped`, but the > indices are not adapted, resulting in a mismatch between the location and > indices lengths. Finally, ec block calculates the wrong blockid when getting > internal block (see `DBlockStriped#getInternalBlock`). > > We added debug logs, and a few key messages are shown below. > {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> > -9223372033441574270{color}. > {code:java} > DBlock getInternalBlock(StorageGroup storage) { > // storage == xx.xx.7.31 > // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, > xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, > xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is > filtered) > int idxInLocs = locations.indexOf(storage); > if (idxInLocs == -1) { > return null; > } > // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) > byte idxInGroup = indices[idxInLocs]; > // blkId: -9223372033441574272 + 2 = -9223372033441574270 > long blkId = getBlock().getBlockId() + idxInGroup; > long numBytes = getInternalBlockLength(getNumBytes(), cellSize, > dataBlockNum, idxInGroup); > Block blk = new Block(getBlock()); > blk.setBlockId(blkId); > blk.setNumBytes(numBytes); > DBlock dblk = new DBlock(blk); > dblk.addLocation(storage); > return dblk; > } {code} > {*}Solution{*}: > When initializing DBlockStriped, if any location is filtered out, we need to > remove the corresponding element in the indices to do the adaptation. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover
[ https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877963#comment-17877963 ] ASF GitHub Bot commented on HDFS-17599: --- haiyang1987 commented on PR #6980: URL: https://github.com/apache/hadoop/pull/6980#issuecomment-2320052315 Committed to trunk. Thanks @tomscut for your works. And @Hexiaoqiao for your reviews. > EC: Fix the mismatch between locations and indices for mover > > > Key: HDFS-17599 > URL: https://issues.apache.org/jira/browse/HDFS-17599 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2024-08-03-17-59-08-059.png, > image-2024-08-03-18-00-01-950.png > > > We set the EC policy to (6+3) and also have nodes that were in state > ENTERING_MAINTENANCE. > > When we move the data of some directories from SSD to HDD, some blocks move > fail due to disk full, as shown in the figure below > (blk_-9223372033441574269). > We tried to move again and found the following error "{color:#ff}Replica > does not exist{color}". > Observing the information of fsck, it can be found that the wrong > blockid(blk_-9223372033441574270) was found when moving block. > > {*}Mover Logs{*}: > !image-2024-08-03-17-59-08-059.png|width=741,height=85! > > {*}FSCK Info{*}: > !image-2024-08-03-18-00-01-950.png|width=738,height=120! > > {*}Root Cause{*}: > Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node > is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state > in the locations is filtered when initializing `DBlockStriped`, but the > indices are not adapted, resulting in a mismatch between the location and > indices lengths. Finally, ec block calculates the wrong blockid when getting > internal block (see `DBlockStriped#getInternalBlock`). > > We added debug logs, and a few key messages are shown below. > {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> > -9223372033441574270{color}. > {code:java} > DBlock getInternalBlock(StorageGroup storage) { > // storage == xx.xx.7.31 > // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, > xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, > xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is > filtered) > int idxInLocs = locations.indexOf(storage); > if (idxInLocs == -1) { > return null; > } > // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) > byte idxInGroup = indices[idxInLocs]; > // blkId: -9223372033441574272 + 2 = -9223372033441574270 > long blkId = getBlock().getBlockId() + idxInGroup; > long numBytes = getInternalBlockLength(getNumBytes(), cellSize, > dataBlockNum, idxInGroup); > Block blk = new Block(getBlock()); > blk.setBlockId(blkId); > blk.setNumBytes(numBytes); > DBlock dblk = new DBlock(blk); > dblk.addLocation(storage); > return dblk; > } {code} > {*}Solution{*}: > When initializing DBlockStriped, if any location is filtered out, we need to > remove the corresponding element in the indices to do the adaptation. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover
[ https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haiyang Hu updated HDFS-17599: -- Component/s: balancer & mover > EC: Fix the mismatch between locations and indices for mover > > > Key: HDFS-17599 > URL: https://issues.apache.org/jira/browse/HDFS-17599 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 3.3.0, 3.4.0 >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2024-08-03-17-59-08-059.png, > image-2024-08-03-18-00-01-950.png > > > We set the EC policy to (6+3) and also have nodes that were in state > ENTERING_MAINTENANCE. > > When we move the data of some directories from SSD to HDD, some blocks move > fail due to disk full, as shown in the figure below > (blk_-9223372033441574269). > We tried to move again and found the following error "{color:#ff}Replica > does not exist{color}". > Observing the information of fsck, it can be found that the wrong > blockid(blk_-9223372033441574270) was found when moving block. > > {*}Mover Logs{*}: > !image-2024-08-03-17-59-08-059.png|width=741,height=85! > > {*}FSCK Info{*}: > !image-2024-08-03-18-00-01-950.png|width=738,height=120! > > {*}Root Cause{*}: > Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node > is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state > in the locations is filtered when initializing `DBlockStriped`, but the > indices are not adapted, resulting in a mismatch between the location and > indices lengths. Finally, ec block calculates the wrong blockid when getting > internal block (see `DBlockStriped#getInternalBlock`). > > We added debug logs, and a few key messages are shown below. > {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> > -9223372033441574270{color}. > {code:java} > DBlock getInternalBlock(StorageGroup storage) { > // storage == xx.xx.7.31 > // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, > xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, > xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is > filtered) > int idxInLocs = locations.indexOf(storage); > if (idxInLocs == -1) { > return null; > } > // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) > byte idxInGroup = indices[idxInLocs]; > // blkId: -9223372033441574272 + 2 = -9223372033441574270 > long blkId = getBlock().getBlockId() + idxInGroup; > long numBytes = getInternalBlockLength(getNumBytes(), cellSize, > dataBlockNum, idxInGroup); > Block blk = new Block(getBlock()); > blk.setBlockId(blkId); > blk.setNumBytes(numBytes); > DBlock dblk = new DBlock(blk); > dblk.addLocation(storage); > return dblk; > } {code} > {*}Solution{*}: > When initializing DBlockStriped, if any location is filtered out, we need to > remove the corresponding element in the indices to do the adaptation. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover
[ https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haiyang Hu resolved HDFS-17599. --- Resolution: Fixed > EC: Fix the mismatch between locations and indices for mover > > > Key: HDFS-17599 > URL: https://issues.apache.org/jira/browse/HDFS-17599 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 3.3.0, 3.4.0 >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > Attachments: image-2024-08-03-17-59-08-059.png, > image-2024-08-03-18-00-01-950.png > > > We set the EC policy to (6+3) and also have nodes that were in state > ENTERING_MAINTENANCE. > > When we move the data of some directories from SSD to HDD, some blocks move > fail due to disk full, as shown in the figure below > (blk_-9223372033441574269). > We tried to move again and found the following error "{color:#ff}Replica > does not exist{color}". > Observing the information of fsck, it can be found that the wrong > blockid(blk_-9223372033441574270) was found when moving block. > > {*}Mover Logs{*}: > !image-2024-08-03-17-59-08-059.png|width=741,height=85! > > {*}FSCK Info{*}: > !image-2024-08-03-18-00-01-950.png|width=738,height=120! > > {*}Root Cause{*}: > Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node > is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state > in the locations is filtered when initializing `DBlockStriped`, but the > indices are not adapted, resulting in a mismatch between the location and > indices lengths. Finally, ec block calculates the wrong blockid when getting > internal block (see `DBlockStriped#getInternalBlock`). > > We added debug logs, and a few key messages are shown below. > {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> > -9223372033441574270{color}. > {code:java} > DBlock getInternalBlock(StorageGroup storage) { > // storage == xx.xx.7.31 > // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, > xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, > xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is > filtered) > int idxInLocs = locations.indexOf(storage); > if (idxInLocs == -1) { > return null; > } > // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) > byte idxInGroup = indices[idxInLocs]; > // blkId: -9223372033441574272 + 2 = -9223372033441574270 > long blkId = getBlock().getBlockId() + idxInGroup; > long numBytes = getInternalBlockLength(getNumBytes(), cellSize, > dataBlockNum, idxInGroup); > Block blk = new Block(getBlock()); > blk.setBlockId(blkId); > blk.setNumBytes(numBytes); > DBlock dblk = new DBlock(blk); > dblk.addLocation(storage); > return dblk; > } {code} > {*}Solution{*}: > When initializing DBlockStriped, if any location is filtered out, we need to > remove the corresponding element in the indices to do the adaptation. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17599) EC: Fix the mismatch between locations and indices for mover
[ https://issues.apache.org/jira/browse/HDFS-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haiyang Hu updated HDFS-17599: -- Fix Version/s: 3.5.0 > EC: Fix the mismatch between locations and indices for mover > > > Key: HDFS-17599 > URL: https://issues.apache.org/jira/browse/HDFS-17599 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 3.3.0, 3.4.0 >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > Attachments: image-2024-08-03-17-59-08-059.png, > image-2024-08-03-18-00-01-950.png > > > We set the EC policy to (6+3) and also have nodes that were in state > ENTERING_MAINTENANCE. > > When we move the data of some directories from SSD to HDD, some blocks move > fail due to disk full, as shown in the figure below > (blk_-9223372033441574269). > We tried to move again and found the following error "{color:#ff}Replica > does not exist{color}". > Observing the information of fsck, it can be found that the wrong > blockid(blk_-9223372033441574270) was found when moving block. > > {*}Mover Logs{*}: > !image-2024-08-03-17-59-08-059.png|width=741,height=85! > > {*}FSCK Info{*}: > !image-2024-08-03-18-00-01-950.png|width=738,height=120! > > {*}Root Cause{*}: > Similar to this HDFS-16333, when mover is initialized, only the `LIVE` node > is processed. As a result, the datanode in the `ENTERING_MAINTENANCE` state > in the locations is filtered when initializing `DBlockStriped`, but the > indices are not adapted, resulting in a mismatch between the location and > indices lengths. Finally, ec block calculates the wrong blockid when getting > internal block (see `DBlockStriped#getInternalBlock`). > > We added debug logs, and a few key messages are shown below. > {color:#ff}The result is an incorrect correspondence: xx.xx.7.31 -> > -9223372033441574270{color}. > {code:java} > DBlock getInternalBlock(StorageGroup storage) { > // storage == xx.xx.7.31 > // idxInLocs == 1 (location ([xx.xx.,85.29:DISK, xx.xx.7.31:DISK, > xx.xx.207.22:DISK, xx.xx.8.25:DISK, xx.xx.79.30:DISK, xx.xx.87.21:DISK, > xx.xx.8.38:DISK]), xx.xx.179.31 is in the ENTERING_MAINTENANCE state is > filtered) > int idxInLocs = locations.indexOf(storage); > if (idxInLocs == -1) { > return null; > } > // idxInGroup == 2 (indices is [1,2,3,4,5,6,7,8]) > byte idxInGroup = indices[idxInLocs]; > // blkId: -9223372033441574272 + 2 = -9223372033441574270 > long blkId = getBlock().getBlockId() + idxInGroup; > long numBytes = getInternalBlockLength(getNumBytes(), cellSize, > dataBlockNum, idxInGroup); > Block blk = new Block(getBlock()); > blk.setBlockId(blkId); > blk.setNumBytes(numBytes); > DBlock dblk = new DBlock(blk); > dblk.addLocation(storage); > return dblk; > } {code} > {*}Solution{*}: > When initializing DBlockStriped, if any location is filtered out, we need to > remove the corresponding element in the indices to do the adaptation. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org