[jira] [Commented] (HDFS-16831) [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes every time
[ https://issues.apache.org/jira/browse/HDFS-16831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627468#comment-17627468 ] ASF GitHub Bot commented on HDFS-16831: --- hadoop-yetus commented on PR #5098: URL: https://github.com/apache/hadoop/pull/5098#issuecomment-1299591441 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 51s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | -1 :x: | mvninstall | 3m 40s | [/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5098/1/artifact/out/branch-mvninstall-root.txt) | root in trunk failed. | | +1 :green_heart: | compile | 3m 48s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 0m 34s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 0m 36s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 46s | | trunk passed | | +1 :green_heart: | javadoc | 0m 50s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 53s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 31s | | trunk passed | | +1 :green_heart: | shadedclient | 28m 29s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 41s | | the patch passed | | +1 :green_heart: | compile | 0m 41s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 0m 41s | | the patch passed | | +1 :green_heart: | compile | 0m 35s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 0m 35s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 20s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 37s | | the patch passed | | +1 :green_heart: | javadoc | 0m 36s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 0m 55s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 24m 24s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 0m 19s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5098/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch failed. | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 73m 39s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5098/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5098 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 5ecda8112937 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 15470ab94a91c304b21992590a0c3f1c837957ed | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5098/1/testReport/ | | Max. process+thr
[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool
[ https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627461#comment-17627461 ] ASF GitHub Bot commented on HDFS-16804: --- MingXiangLi commented on code in PR #5033: URL: https://github.com/apache/hadoop/pull/5033#discussion_r1011148137 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java: ## @@ -3220,7 +3220,7 @@ public static void setBlockPoolId(String bpid) { } @Override - public void shutdownBlockPool(String bpid) { + public synchronized void shutdownBlockPool(String bpid) { Review Comment: should add synchronized to addBlockPool() too? > AddVolume contains a race condition with shutdown block pool > > > Key: HDFS-16804 > URL: https://issues.apache.org/jira/browse/HDFS-16804 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > Add Volume contains a race condition with shutdown block pool, causing the > ReplicaMap still contains some blocks belong to the removed block pool. > And the new volume still contains one unused BlockPoolSlice belongs to the > removed block pool, caused some problems, such as: incorrect dfsUsed, > incorrect numBlocks of the volume. > Let's review the logic of addVolume and shutdownBlockPool respectively. > > AddVolume Logic: > * Step1: Get all namespaceInfo from blockPoolManager > * Step2: Create one temporary FsVolumeImpl object > * Step3: Create some blockPoolSlice according to the namespaceInfo and add > them to the temporary FsVolumeImpl object > * Step4: Scan all blocks of the namespaceInfo from the volume and store them > by one temporary ReplicaMap > * Step5: Active the temporary FsVolumeImpl which created before (with > FsDatasetImpl synchronized lock) > ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global > ReplicaMap > ** Step5.2: Add the FsVolumeImpl to the volumes > ShutdownBlockPool Logic:(with blockPool write lock) > * Step1: Cleanup the blockPool from the global ReplicaMap > * Step2: Shutdown the block pool from all the volumes > ** Step2.1: do some clean operations for the block pool, such as > saveReplica, saveDfsUsed, etc > ** Step2.2: remove the blockPool from bpSlices > The race condition can be reproduced by the following steps: > * AddVolume Step1: Get all namespaceInfo from blockPoolManager > * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap > * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes > * AddVolume Step 2~5 > And result: > * The global replicaMap contains some blocks belong to the removed blockPool > * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the > removed blockPool > Expected result: > * The global replicaMap shouldn't contain any blocks belong to the removed > blockPool > * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice > belong to the removed blockPool -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool
[ https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627460#comment-17627460 ] ASF GitHub Bot commented on HDFS-16804: --- MingXiangLi commented on code in PR #5033: URL: https://github.com/apache/hadoop/pull/5033#discussion_r1011148137 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java: ## @@ -3220,7 +3220,7 @@ public static void setBlockPoolId(String bpid) { } @Override - public void shutdownBlockPool(String bpid) { + public synchronized void shutdownBlockPool(String bpid) { Review Comment: should add synchronized to addBlockPool() too? > AddVolume contains a race condition with shutdown block pool > > > Key: HDFS-16804 > URL: https://issues.apache.org/jira/browse/HDFS-16804 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > Add Volume contains a race condition with shutdown block pool, causing the > ReplicaMap still contains some blocks belong to the removed block pool. > And the new volume still contains one unused BlockPoolSlice belongs to the > removed block pool, caused some problems, such as: incorrect dfsUsed, > incorrect numBlocks of the volume. > Let's review the logic of addVolume and shutdownBlockPool respectively. > > AddVolume Logic: > * Step1: Get all namespaceInfo from blockPoolManager > * Step2: Create one temporary FsVolumeImpl object > * Step3: Create some blockPoolSlice according to the namespaceInfo and add > them to the temporary FsVolumeImpl object > * Step4: Scan all blocks of the namespaceInfo from the volume and store them > by one temporary ReplicaMap > * Step5: Active the temporary FsVolumeImpl which created before (with > FsDatasetImpl synchronized lock) > ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global > ReplicaMap > ** Step5.2: Add the FsVolumeImpl to the volumes > ShutdownBlockPool Logic:(with blockPool write lock) > * Step1: Cleanup the blockPool from the global ReplicaMap > * Step2: Shutdown the block pool from all the volumes > ** Step2.1: do some clean operations for the block pool, such as > saveReplica, saveDfsUsed, etc > ** Step2.2: remove the blockPool from bpSlices > The race condition can be reproduced by the following steps: > * AddVolume Step1: Get all namespaceInfo from blockPoolManager > * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap > * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes > * AddVolume Step 2~5 > And result: > * The global replicaMap contains some blocks belong to the removed blockPool > * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the > removed blockPool > Expected result: > * The global replicaMap shouldn't contain any blocks belong to the removed > blockPool > * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice > belong to the removed blockPool -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool
[ https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627459#comment-17627459 ] ASF GitHub Bot commented on HDFS-16804: --- MingXiangLi commented on code in PR #5033: URL: https://github.com/apache/hadoop/pull/5033#discussion_r1011147892 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java: ## @@ -452,6 +449,9 @@ private synchronized void activateVolume( throw new IOException(errorMsg); } volumeMap.mergeAll(replicaMap); +for (String bp : volumeMap.getBlockPoolList()) { Review Comment: LGTM here > AddVolume contains a race condition with shutdown block pool > > > Key: HDFS-16804 > URL: https://issues.apache.org/jira/browse/HDFS-16804 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > Add Volume contains a race condition with shutdown block pool, causing the > ReplicaMap still contains some blocks belong to the removed block pool. > And the new volume still contains one unused BlockPoolSlice belongs to the > removed block pool, caused some problems, such as: incorrect dfsUsed, > incorrect numBlocks of the volume. > Let's review the logic of addVolume and shutdownBlockPool respectively. > > AddVolume Logic: > * Step1: Get all namespaceInfo from blockPoolManager > * Step2: Create one temporary FsVolumeImpl object > * Step3: Create some blockPoolSlice according to the namespaceInfo and add > them to the temporary FsVolumeImpl object > * Step4: Scan all blocks of the namespaceInfo from the volume and store them > by one temporary ReplicaMap > * Step5: Active the temporary FsVolumeImpl which created before (with > FsDatasetImpl synchronized lock) > ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global > ReplicaMap > ** Step5.2: Add the FsVolumeImpl to the volumes > ShutdownBlockPool Logic:(with blockPool write lock) > * Step1: Cleanup the blockPool from the global ReplicaMap > * Step2: Shutdown the block pool from all the volumes > ** Step2.1: do some clean operations for the block pool, such as > saveReplica, saveDfsUsed, etc > ** Step2.2: remove the blockPool from bpSlices > The race condition can be reproduced by the following steps: > * AddVolume Step1: Get all namespaceInfo from blockPoolManager > * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap > * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes > * AddVolume Step 2~5 > And result: > * The global replicaMap contains some blocks belong to the removed blockPool > * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the > removed blockPool > Expected result: > * The global replicaMap shouldn't contain any blocks belong to the removed > blockPool > * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice > belong to the removed blockPool -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16831) [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes every time
[ https://issues.apache.org/jira/browse/HDFS-16831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16831: -- Labels: pull-request-available (was: ) > [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes > every time > --- > > Key: HDFS-16831 > URL: https://issues.apache.org/jira/browse/HDFS-16831 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > The method getNamenodesForNameserviceId in MembershipNamenodeResolver.class > should shuffle Observer NameNodes every time. The current logic will return > the cached list and will caused all of read requests are forwarding to the > first observer namenode. > > The related code as bellow: > {code:java} > @Override > public List getNamenodesForNameserviceId( > final String nsId, boolean listObserversFirst) throws IOException { > List ret = cacheNS.get(Pair.of(nsId, > listObserversFirst)); > if (ret != null) { > return ret; > } > ... > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16831) [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes every time
[ https://issues.apache.org/jira/browse/HDFS-16831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627455#comment-17627455 ] ASF GitHub Bot commented on HDFS-16831: --- ZanderXu opened a new pull request, #5098: URL: https://github.com/apache/hadoop/pull/5098 ### Description of PR [HDFS-16831](https://issues.apache.org/jira/browse/HDFS-16831) The method `getNamenodesForNameserviceId` in `MembershipNamenodeResolver.class` should shuffle Observer NameNodes every time. The current logic will return the cached list and will caused all of read requests are forwarding to the first observer namenode. The related code as bellow: ``` @Override public List getNamenodesForNameserviceId( final String nsId, boolean listObserversFirst) throws IOException { List ret = cacheNS.get(Pair.of(nsId, listObserversFirst)); if (ret != null) { return ret; } ... } ``` > [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes > every time > --- > > Key: HDFS-16831 > URL: https://issues.apache.org/jira/browse/HDFS-16831 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > The method getNamenodesForNameserviceId in MembershipNamenodeResolver.class > should shuffle Observer NameNodes every time. The current logic will return > the cached list and will caused all of read requests are forwarding to the > first observer namenode. > > The related code as bellow: > {code:java} > @Override > public List getNamenodesForNameserviceId( > final String nsId, boolean listObserversFirst) throws IOException { > List ret = cacheNS.get(Pair.of(nsId, > listObserversFirst)); > if (ret != null) { > return ret; > } > ... > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16831) [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes every time
ZanderXu created HDFS-16831: --- Summary: [RBF SBN] GetNamenodesForNameserviceId should shuffle Observer NameNodes every time Key: HDFS-16831 URL: https://issues.apache.org/jira/browse/HDFS-16831 Project: Hadoop HDFS Issue Type: Bug Reporter: ZanderXu Assignee: ZanderXu The method getNamenodesForNameserviceId in MembershipNamenodeResolver.class should shuffle Observer NameNodes every time. The current logic will return the cached list and will caused all of read requests are forwarding to the first observer namenode. The related code as bellow: {code:java} @Override public List getNamenodesForNameserviceId( final String nsId, boolean listObserversFirst) throws IOException { List ret = cacheNS.get(Pair.of(nsId, listObserversFirst)); if (ret != null) { return ret; } ... }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool
[ https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627438#comment-17627438 ] ASF GitHub Bot commented on HDFS-16804: --- ZanderXu commented on code in PR #5033: URL: https://github.com/apache/hadoop/pull/5033#discussion_r1011108331 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java: ## @@ -452,6 +449,9 @@ private synchronized void activateVolume( throw new IOException(errorMsg); } volumeMap.mergeAll(replicaMap); +for (String bp : volumeMap.getBlockPoolList()) { Review Comment: @Hexiaoqiao Sir, thanks for your review. - Moving this logic here is to prevent lock leaks when `volumeMap.mergeAll(replicaMap)` failed. - `storageMap` is a `ConcurrentHashMap` and the modification always need to acquire `synchronized` lock. - And this VolumeLock can only be used after `volumes.addVolume(ref);` @MingXiangLi Master, can you help me double-check it? Thanks so much. > AddVolume contains a race condition with shutdown block pool > > > Key: HDFS-16804 > URL: https://issues.apache.org/jira/browse/HDFS-16804 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > Add Volume contains a race condition with shutdown block pool, causing the > ReplicaMap still contains some blocks belong to the removed block pool. > And the new volume still contains one unused BlockPoolSlice belongs to the > removed block pool, caused some problems, such as: incorrect dfsUsed, > incorrect numBlocks of the volume. > Let's review the logic of addVolume and shutdownBlockPool respectively. > > AddVolume Logic: > * Step1: Get all namespaceInfo from blockPoolManager > * Step2: Create one temporary FsVolumeImpl object > * Step3: Create some blockPoolSlice according to the namespaceInfo and add > them to the temporary FsVolumeImpl object > * Step4: Scan all blocks of the namespaceInfo from the volume and store them > by one temporary ReplicaMap > * Step5: Active the temporary FsVolumeImpl which created before (with > FsDatasetImpl synchronized lock) > ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global > ReplicaMap > ** Step5.2: Add the FsVolumeImpl to the volumes > ShutdownBlockPool Logic:(with blockPool write lock) > * Step1: Cleanup the blockPool from the global ReplicaMap > * Step2: Shutdown the block pool from all the volumes > ** Step2.1: do some clean operations for the block pool, such as > saveReplica, saveDfsUsed, etc > ** Step2.2: remove the blockPool from bpSlices > The race condition can be reproduced by the following steps: > * AddVolume Step1: Get all namespaceInfo from blockPoolManager > * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap > * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes > * AddVolume Step 2~5 > And result: > * The global replicaMap contains some blocks belong to the removed blockPool > * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the > removed blockPool > Expected result: > * The global replicaMap shouldn't contain any blocks belong to the removed > blockPool > * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice > belong to the removed blockPool -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16811) Support to make dfs.namenode.decommission.backoff.monitor.pending.limit reconfigurable
[ https://issues.apache.org/jira/browse/HDFS-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627429#comment-17627429 ] ASF GitHub Bot commented on HDFS-16811: --- haiyang1987 commented on PR #5068: URL: https://github.com/apache/hadoop/pull/5068#issuecomment-1299492638 @tomscut thanks help to review it. > Support to make dfs.namenode.decommission.backoff.monitor.pending.limit > reconfigurable > --- > > Key: HDFS-16811 > URL: https://issues.apache.org/jira/browse/HDFS-16811 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When the Backoff monitor is enabled, the parameter > dfs.namenode.decommission.backoff.monitor.pending.limit can be dynamically > adjusted to determines the maximum number of blocks related to decommission > and maintenance operations that can be loaded into the replication queue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool
[ https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627417#comment-17627417 ] ASF GitHub Bot commented on HDFS-16804: --- DaveTeng0 commented on code in PR #5033: URL: https://github.com/apache/hadoop/pull/5033#discussion_r1011082084 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/ReplicaMap.java: ## @@ -166,25 +167,24 @@ void addAll(ReplicaMap other) { /** * Merge all entries from the given replica map into the local replica map. */ - void mergeAll(ReplicaMap other) { + void mergeAll(ReplicaMap other) throws IOException { Set bplist = other.map.keySet(); for (String bp : bplist) { checkBlockPool(bp); try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, bp)) { LightWeightResizableGSet replicaInfos = other.map.get(bp); LightWeightResizableGSet curSet = map.get(bp); +if (curSet == null) { + // Can't find the block pool id in the replicaMap. Maybe it has been removed. Review Comment: Thanks Zander for your detail explanation! got this!! > AddVolume contains a race condition with shutdown block pool > > > Key: HDFS-16804 > URL: https://issues.apache.org/jira/browse/HDFS-16804 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > Add Volume contains a race condition with shutdown block pool, causing the > ReplicaMap still contains some blocks belong to the removed block pool. > And the new volume still contains one unused BlockPoolSlice belongs to the > removed block pool, caused some problems, such as: incorrect dfsUsed, > incorrect numBlocks of the volume. > Let's review the logic of addVolume and shutdownBlockPool respectively. > > AddVolume Logic: > * Step1: Get all namespaceInfo from blockPoolManager > * Step2: Create one temporary FsVolumeImpl object > * Step3: Create some blockPoolSlice according to the namespaceInfo and add > them to the temporary FsVolumeImpl object > * Step4: Scan all blocks of the namespaceInfo from the volume and store them > by one temporary ReplicaMap > * Step5: Active the temporary FsVolumeImpl which created before (with > FsDatasetImpl synchronized lock) > ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global > ReplicaMap > ** Step5.2: Add the FsVolumeImpl to the volumes > ShutdownBlockPool Logic:(with blockPool write lock) > * Step1: Cleanup the blockPool from the global ReplicaMap > * Step2: Shutdown the block pool from all the volumes > ** Step2.1: do some clean operations for the block pool, such as > saveReplica, saveDfsUsed, etc > ** Step2.2: remove the blockPool from bpSlices > The race condition can be reproduced by the following steps: > * AddVolume Step1: Get all namespaceInfo from blockPoolManager > * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap > * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes > * AddVolume Step 2~5 > And result: > * The global replicaMap contains some blocks belong to the removed blockPool > * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the > removed blockPool > Expected result: > * The global replicaMap shouldn't contain any blocks belong to the removed > blockPool > * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice > belong to the removed blockPool -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16816) RBF: auto-create user home dir for trash paths by router
[ https://issues.apache.org/jira/browse/HDFS-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627403#comment-17627403 ] Virajith Jalaparti commented on HDFS-16816: --- [~inigoiri]/ [~fengnanli] do you guys think this is a useful feature to add? If so, my suggestion here as mentioned on the [PR|https://github.com/apache/hadoop/pull/5071/files#r1011042738] would be to add a new API for {{provisionTrashPath}}. > RBF: auto-create user home dir for trash paths by router > > > Key: HDFS-16816 > URL: https://issues.apache.org/jira/browse/HDFS-16816 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > In RBF, trash files are moved to trash root under user's home dir at the > corresponding namespace/namenode where the files reside. This was added in > HDFS-16024. When the user home dir is not created before-hand at a namenode, > we run into permission denied exceptions when trying to create the parent dir > for the trash file before moving the file into it. We propose to enhance > Router, to auto-create a user home's dir at the namenode for trash paths, > using router's identity (which is assumed to be a super-user). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16816) RBF: auto-create user home dir for trash paths by router
[ https://issues.apache.org/jira/browse/HDFS-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627402#comment-17627402 ] ASF GitHub Bot commented on HDFS-16816: --- virajith commented on code in PR #5071: URL: https://github.com/apache/hadoop/pull/5071#discussion_r1011042738 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterClientProtocol.java: ## @@ -734,6 +811,13 @@ public boolean mkdirs(String src, FsPermission masked, boolean createParent) new Class[] {String.class, FsPermission.class, boolean.class}, new RemoteParam(), masked, createParent); +// Auto-create user home dir for a trash path. +// moveToTrash() will first call fs.mkdirs() to create the parent dir, before calling rename() +// to move the file into it. As a result, we need to create user home dir in mkdirs(). +if (autoCreateUserHomeForTrash) { + createUserHomeForTrashPath(src, locations); Review Comment: I don't think you should be piggybacking on {{mkdirs}} to get this created. If this is a useful feature to have, I'd suggest adding a new FileSystem API named something like {{provisionTrashPath()}} similar to {{provisionEZTrash}} and call it in {{TrashPolicy#moveToTrash}} before {{mkdirs}}. Can you also call out why you need this vs. provisioning user trash paths as part of the (off band) process of provisioning the user home directories on HDFS? > RBF: auto-create user home dir for trash paths by router > > > Key: HDFS-16816 > URL: https://issues.apache.org/jira/browse/HDFS-16816 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > In RBF, trash files are moved to trash root under user's home dir at the > corresponding namespace/namenode where the files reside. This was added in > HDFS-16024. When the user home dir is not created before-hand at a namenode, > we run into permission denied exceptions when trying to create the parent dir > for the trash file before moving the file into it. We propose to enhance > Router, to auto-create a user home's dir at the namenode for trash paths, > using router's identity (which is assumed to be a super-user). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16816) RBF: auto-create user home dir for trash paths by router
[ https://issues.apache.org/jira/browse/HDFS-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Virajith Jalaparti reassigned HDFS-16816: - Assignee: Xing Lin > RBF: auto-create user home dir for trash paths by router > > > Key: HDFS-16816 > URL: https://issues.apache.org/jira/browse/HDFS-16816 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > In RBF, trash files are moved to trash root under user's home dir at the > corresponding namespace/namenode where the files reside. This was added in > HDFS-16024. When the user home dir is not created before-hand at a namenode, > we run into permission denied exceptions when trying to create the parent dir > for the trash file before moving the file into it. We propose to enhance > Router, to auto-create a user home's dir at the namenode for trash paths, > using router's identity (which is assumed to be a super-user). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16827) [RBF SBN] RouterStateIdContext shouldn't update the ResponseState if client doesn't use ObserverReadProxyProvider
[ https://issues.apache.org/jira/browse/HDFS-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627347#comment-17627347 ] ASF GitHub Bot commented on HDFS-16827: --- simbadzina commented on code in PR #5088: URL: https://github.com/apache/hadoop/pull/5088#discussion_r1010933954 ## hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java: ## @@ -2879,9 +2881,12 @@ private void processRpcRequest(RpcRequestHeaderProto header, stateId = alignmentContext.receiveRequestState( header, getMaxIdleTime()); call.setClientStateId(stateId); -if (header.hasRouterFederatedState()) { - call.setFederatedNamespaceState(header.getRouterFederatedState()); -} + } + if (header.hasRouterFederatedState()) { +call.setFederatedNamespaceState(header.getRouterFederatedState()); + } else if (header.hasStateId()) { +// Only used to determine whether to return federatedNamespaceState. +call.setFederatedNamespaceState(EMPTY_BYTE_STRING); Review Comment: Can you expand on why we need this else part? > [RBF SBN] RouterStateIdContext shouldn't update the ResponseState if client > doesn't use ObserverReadProxyProvider > - > > Key: HDFS-16827 > URL: https://issues.apache.org/jira/browse/HDFS-16827 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > RouterStateIdContext shouldn't update the ResponseState if client doesn't use > ObserverReadProxyProvider. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627294#comment-17627294 ] Rushabh Shah commented on HDFS-16829: - Thank you [~ayushsaxena] for the response. > What if the client still pushed more data to the single datanode? The data > contained by the other datanodes won't be complete right? The proposal is to keep the blocks on disk so some operator can manually restore them. We don't want to make the replica with old genstamp primary replica automatically. > Moreover with this in genuine cases also you would be occupying the memory > with blocks with old genstamps, if the cluster is really unstable and has a > lot of updatePipelines or things like that, it may be a issue in that case IMHO this shouldn't be an issue. We have ReplicationMonitor thread which works pretty well. Even in case of many update pipelines, ReplicationMonitor is fast enough to replicate the blocks to maintain replication factor. The proposal is just to wait deleting the older genstamp blocks until ReplicationMonitor replicates that block. If you are concerned about this behavior then we can introduce a config key which will default to current behavior and if enabled it will delay until it is fully replicated. > You can get something if you figure out really quick before the configured > time after which the block gets deleted. In this case, the replica with old genstamp was deleted within 3 seconds after the file was closed. One needs to be super human to catch this in production. :) > dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it > doesn't behave that way, good to have a new config with such behaviour > semantics Thank you for pointing out this config. Let me check this out. > I think if there is one node only, we can give it to try that we have >syncBlock always true in such cases, might help or may be not if the Disk has >some issues in some cases... Given that I don't have logs either from datanode or hdfs client, I am not able to understand what exactly happened? :( > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - >
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627290#comment-17627290 ] Viraj Jasani commented on HDFS-16829: - {quote}I think if there is one node only, we can give it to try that we have syncBlock always true in such cases {quote} If only one node, having syncBlock true can have much of latency impact? > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147, newGS=2244173222, > newLength=64414655, newNodes=[nodeA:51010], > client=DFSClient_NONMAPREDUCE_1038417265_1) > 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR* > completeFile: is closed by DFSClient_NONMAPREDUCE_1038417265_1 > {noformat} > {noformat} > - Logs from nodeB - > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeA:30302 > 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode - > PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, > type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating > 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current] > impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 > blk_3317546151_2244173147 file > /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151 > {noformat} > > {noformat} > - Logs from nodeC - > 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - > Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 > from nodeB:56486 > 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode - > PacketResponder: BP-9588
[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.
[ https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627288#comment-17627288 ] Ayush Saxena commented on HDFS-16829: - {quote}One suggested improvement is to delay deleting the blocks with old generation stamp until the block is replicated sufficiently which satisfies replication factor. {quote} What if the client still pushed more data to the single datanode? The data contained by the other datanodes won't be complete right? You can get something if you figure out really quick before the configured time after which the block gets deleted. Moreover with this in genuine cases also you would be occupying the memory with blocks with old genstamps, if the cluster is really unstable and has a lot of updatePipelines or things like that, it may be a issue in that case {quote}3. Namenode updated the pipeline just with the first node. {quote} I think there is a minimum configuration as well, say if you configure that to two, you won't be able to succeed if it doesn't have minimum 2 datanodes in the pipeline. It can prevent you with such situation where one node is screwed. Not sure but I need to check if it is: `` dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it doesn't behave that way, good to have a new config with such behaviour semantics {quote}Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory since nameonode got an acknowledgement from the client that updatePipeline succeeded. {quote} I think if there is one node only, we can give it to try that we have syncBlock always true in such cases, might help or may be not if the Disk has some issues in some cases... > Delay deleting blocks with older generation stamp until the block is fully > replicated. > -- > > Key: HDFS-16829 > URL: https://issues.apache.org/jira/browse/HDFS-16829 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.10.1 >Reporter: Rushabh Shah >Priority: Critical > > We encountered this data loss issue in one of our production clusters which > runs hbase service. We received a missing block alert in this cluster. This > error was logged in the datanode holding the block. > {noformat} > 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - > nodeA:51010:DataXceiver error processing READ_BLOCK operation src: > /nodeA:31722 dst: > java.io.IOException: Offset 64410559 and length 4096 don't match block > BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 ) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:384) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298) > at java.lang.Thread.run(Thread.java:750) > {noformat} > The node +nodeA+ has this block blk_3317546151_2244173222 with file length: > 59158528 but the length of this block according to namenode is 64414655 > (according to fsck) > This are the sequence of events for this block. > > 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and > genstamp: 2244173147. > 2. The first datanode in the pipeline (This physical host was also running > region server process which was hdfs client) was restarting at the same time. > Unfortunately this node was sick and it didn't log anything neither in > datanode process or regionserver process during the time of block creation. > 3. Namenode updated the pipeline just with the first node. > 4. Namenode logged updatePipeline success with just 1st node nodeA with block > size: 64414655 and new generation stamp: 2244173222 > 5. Namenode asked nodeB and nodeC to delete the block since it has old > generation stamp. > 6. All the reads (client reads and data transfer reads) from nodeA are > failing with the above stack trace. > See logs below from namenode and nodeB and nodeC. > {noformat} > Logs from namenode - > 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange - > BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 > , nodeC:51010 for > 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem - > updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success > 2022-10-23 12:36:34,978 INFO [on defaul
[jira] [Commented] (HDFS-3570) Balancer shouldn't rely on "DFS Space Used %" as that ignores non-DFS used space
[ https://issues.apache.org/jira/browse/HDFS-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627136#comment-17627136 ] ASF GitHub Bot commented on HDFS-3570: -- ashutoshcipher commented on PR #5044: URL: https://github.com/apache/hadoop/pull/5044#issuecomment-1298432054 Thanks @ZanderXu, I will work on writing UT. > Balancer shouldn't rely on "DFS Space Used %" as that ignores non-DFS used > space > > > Key: HDFS-3570 > URL: https://issues.apache.org/jira/browse/HDFS-3570 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.0.0-alpha >Reporter: Harsh J >Assignee: Ashutosh Gupta >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-3570.003.patch, HDFS-3570.2.patch, > HDFS-3570.aash.1.patch > > > Report from a user here: > https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pIhNyDVxdVY/b7ENZmEvBjIJ, > post archived at http://pastebin.com/eVFkk0A0 > This user had a specific DN that had a large non-DFS usage among > dfs.data.dirs, and very little DFS usage (which is computed against total > possible capacity). > Balancer apparently only looks at the usage, and ignores to consider that > non-DFS usage may also be high on a DN/cluster. Hence, it thinks that if a > DFS Usage report from DN is 8% only, its got a lot of free space to write > more blocks, when that isn't true as shown by the case of this user. It went > on scheduling writes to the DN to balance it out, but the DN simply can't > accept any more blocks as a result of its disks' state. > I think it would be better if we _computed_ the actual utilization based on > {{(100-(actual remaining space))/(capacity)}}, as opposed to the current > {{(dfs used)/(capacity)}}. Thoughts? > This isn't very critical, however, cause it is very rare to see DN space > being used for non DN data, but it does expose a valid bug. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16804) AddVolume contains a race condition with shutdown block pool
[ https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627109#comment-17627109 ] ASF GitHub Bot commented on HDFS-16804: --- Hexiaoqiao commented on code in PR #5033: URL: https://github.com/apache/hadoop/pull/5033#discussion_r1010316878 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java: ## @@ -452,6 +449,9 @@ private synchronized void activateVolume( throw new IOException(errorMsg); } volumeMap.mergeAll(replicaMap); +for (String bp : volumeMap.getBlockPoolList()) { Review Comment: Could `storageMap` meet some concurrent-modify issues when move hold lock logic here? Thanks. > AddVolume contains a race condition with shutdown block pool > > > Key: HDFS-16804 > URL: https://issues.apache.org/jira/browse/HDFS-16804 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > Add Volume contains a race condition with shutdown block pool, causing the > ReplicaMap still contains some blocks belong to the removed block pool. > And the new volume still contains one unused BlockPoolSlice belongs to the > removed block pool, caused some problems, such as: incorrect dfsUsed, > incorrect numBlocks of the volume. > Let's review the logic of addVolume and shutdownBlockPool respectively. > > AddVolume Logic: > * Step1: Get all namespaceInfo from blockPoolManager > * Step2: Create one temporary FsVolumeImpl object > * Step3: Create some blockPoolSlice according to the namespaceInfo and add > them to the temporary FsVolumeImpl object > * Step4: Scan all blocks of the namespaceInfo from the volume and store them > by one temporary ReplicaMap > * Step5: Active the temporary FsVolumeImpl which created before (with > FsDatasetImpl synchronized lock) > ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global > ReplicaMap > ** Step5.2: Add the FsVolumeImpl to the volumes > ShutdownBlockPool Logic:(with blockPool write lock) > * Step1: Cleanup the blockPool from the global ReplicaMap > * Step2: Shutdown the block pool from all the volumes > ** Step2.1: do some clean operations for the block pool, such as > saveReplica, saveDfsUsed, etc > ** Step2.2: remove the blockPool from bpSlices > The race condition can be reproduced by the following steps: > * AddVolume Step1: Get all namespaceInfo from blockPoolManager > * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap > * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes > * AddVolume Step 2~5 > And result: > * The global replicaMap contains some blocks belong to the removed blockPool > * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the > removed blockPool > Expected result: > * The global replicaMap shouldn't contain any blocks belong to the removed > blockPool > * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice > belong to the removed blockPool -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16830) Improve router msync operation
[ https://issues.apache.org/jira/browse/HDFS-16830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627073#comment-17627073 ] zhengchenyu commented on HDFS-16830: Hi, in our production cluster, huge msync is introduced to active. I think we still should continue two work: (1) propagate state id as client need, avoid msync to namenode which is not used by client. (2) share msync, reduce the mysnc operations. [~simbadzina] How about my proposal? > Improve router msync operation > -- > > Key: HDFS-16830 > URL: https://issues.apache.org/jira/browse/HDFS-16830 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode, rbf >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16785) DataNode hold BP write lock to scan disk
[ https://issues.apache.org/jira/browse/HDFS-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627074#comment-17627074 ] ASF GitHub Bot commented on HDFS-16785: --- Hexiaoqiao commented on PR #4945: URL: https://github.com/apache/hadoop/pull/4945#issuecomment-1298331657 addendum: 1. almost none cases to trigger more than one thread add volume; 2. even in the worst case, one volume instance will be failed to add to volumeMap when active volume. In one word, I think it is acceptable. > DataNode hold BP write lock to scan disk > > > Key: HDFS-16785 > URL: https://issues.apache.org/jira/browse/HDFS-16785 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > When patching the fine-grained locking of datanode, I found that `addVolume` > will hold the write block of the BP lock to scan the new volume to get the > blocks. If we try to add one full volume that was fixed offline before, i > will hold the write lock for a long time. > The related code as bellows: > {code:java} > for (final NamespaceInfo nsInfo : nsInfos) { > String bpid = nsInfo.getBlockPoolID(); > try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > fsVolume.addBlockPool(bpid, this.conf, this.timer); > fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker); > } catch (IOException e) { > LOG.warn("Caught exception when adding " + fsVolume + > ". Will throw later.", e); > exceptions.add(e); > } > } {code} > And I noticed that this lock is added by HDFS-15382, means that this logic is > not in lock before. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16785) DataNode hold BP write lock to scan disk
[ https://issues.apache.org/jira/browse/HDFS-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627067#comment-17627067 ] ASF GitHub Bot commented on HDFS-16785: --- Hexiaoqiao commented on PR #4945: URL: https://github.com/apache/hadoop/pull/4945#issuecomment-1298321977 LGTM. +1 from my side. sorry for the late response. cc @MingXiangLi any more comments here? > DataNode hold BP write lock to scan disk > > > Key: HDFS-16785 > URL: https://issues.apache.org/jira/browse/HDFS-16785 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > When patching the fine-grained locking of datanode, I found that `addVolume` > will hold the write block of the BP lock to scan the new volume to get the > blocks. If we try to add one full volume that was fixed offline before, i > will hold the write lock for a long time. > The related code as bellows: > {code:java} > for (final NamespaceInfo nsInfo : nsInfos) { > String bpid = nsInfo.getBlockPoolID(); > try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > fsVolume.addBlockPool(bpid, this.conf, this.timer); > fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker); > } catch (IOException e) { > LOG.warn("Caught exception when adding " + fsVolume + > ". Will throw later.", e); > exceptions.add(e); > } > } {code} > And I noticed that this lock is added by HDFS-15382, means that this logic is > not in lock before. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16785) DataNode hold BP write lock to scan disk
[ https://issues.apache.org/jira/browse/HDFS-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627068#comment-17627068 ] ASF GitHub Bot commented on HDFS-16785: --- Hexiaoqiao commented on PR #4945: URL: https://github.com/apache/hadoop/pull/4945#issuecomment-1298322152 LGTM. +1 from my side. sorry for the late response. cc @MingXiangLi any more comments here? > DataNode hold BP write lock to scan disk > > > Key: HDFS-16785 > URL: https://issues.apache.org/jira/browse/HDFS-16785 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > When patching the fine-grained locking of datanode, I found that `addVolume` > will hold the write block of the BP lock to scan the new volume to get the > blocks. If we try to add one full volume that was fixed offline before, i > will hold the write lock for a long time. > The related code as bellows: > {code:java} > for (final NamespaceInfo nsInfo : nsInfos) { > String bpid = nsInfo.getBlockPoolID(); > try (AutoCloseDataSetLock l = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > fsVolume.addBlockPool(bpid, this.conf, this.timer); > fsVolume.getVolumeMap(bpid, tempVolumeMap, ramDiskReplicaTracker); > } catch (IOException e) { > LOG.warn("Caught exception when adding " + fsVolume + > ". Will throw later.", e); > exceptions.add(e); > } > } {code} > And I noticed that this lock is added by HDFS-15382, means that this logic is > not in lock before. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16830) Improve router msync operation
zhengchenyu created HDFS-16830: -- Summary: Improve router msync operation Key: HDFS-16830 URL: https://issues.apache.org/jira/browse/HDFS-16830 Project: Hadoop HDFS Issue Type: Improvement Components: namenode, rbf Reporter: zhengchenyu Assignee: zhengchenyu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13522) HDFS-13522: Add federated nameservices states to client protocol and propagate it between routers and clients.
[ https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627065#comment-17627065 ] zhengchenyu commented on HDFS-13522: [~simbadzina] I agree with you! Indeed in our production, I did not dare to disable msync, consequently many msync are introduced to active namenode. Though msync is low cost operation, I think it is necessary for us to reduce the msync. > HDFS-13522: Add federated nameservices states to client protocol and > propagate it between routers and clients. > -- > > Key: HDFS-13522 > URL: https://issues.apache.org/jira/browse/HDFS-13522 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: federation, namenode >Reporter: Erik Krogen >Assignee: Simbarashe Dzinamarira >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.5 > > Attachments: HDFS-13522.001.patch, HDFS-13522.002.patch, > HDFS-13522_WIP.patch, RBF_ Observer support.pdf, Router+Observer RPC > clogging.png, ShortTerm-Routers+Observer.png, > observer_reads_in_rbf_proposal_simbadzina_v1.pdf, > observer_reads_in_rbf_proposal_simbadzina_v2.pdf > > Time Spent: 20h 50m > Remaining Estimate: 0h > > Changes will need to occur to the router to support the new observer node. > One such change will be to make the router understand the observer state, > e.g. {{{}FederationNamenodeServiceState{}}}. > This patch captures the state of all namespaces in the routers and propagates > it to clients. A follow up patch will change router behavior to direct > requests to the observer. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16811) Support to make dfs.namenode.decommission.backoff.monitor.pending.limit reconfigurable
[ https://issues.apache.org/jira/browse/HDFS-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627045#comment-17627045 ] ASF GitHub Bot commented on HDFS-16811: --- tomscut commented on PR #5068: URL: https://github.com/apache/hadoop/pull/5068#issuecomment-1298265787 The failed unit test seems unrelated to the change. > Support to make dfs.namenode.decommission.backoff.monitor.pending.limit > reconfigurable > --- > > Key: HDFS-16811 > URL: https://issues.apache.org/jira/browse/HDFS-16811 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When the Backoff monitor is enabled, the parameter > dfs.namenode.decommission.backoff.monitor.pending.limit can be dynamically > adjusted to determines the maximum number of blocks related to decommission > and maintenance operations that can be loaded into the replication queue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16811) Support to make dfs.namenode.decommission.backoff.monitor.pending.limit reconfigurable
[ https://issues.apache.org/jira/browse/HDFS-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627003#comment-17627003 ] ASF GitHub Bot commented on HDFS-16811: --- haiyang1987 commented on PR #5068: URL: https://github.com/apache/hadoop/pull/5068#issuecomment-1298205771 Trigger notification > Support to make dfs.namenode.decommission.backoff.monitor.pending.limit > reconfigurable > --- > > Key: HDFS-16811 > URL: https://issues.apache.org/jira/browse/HDFS-16811 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When the Backoff monitor is enabled, the parameter > dfs.namenode.decommission.backoff.monitor.pending.limit can be dynamically > adjusted to determines the maximum number of blocks related to decommission > and maintenance operations that can be loaded into the replication queue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org