[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Attachment: screenshot-1.png > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
Daniel Ma created HDFS-16871: Summary: DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname Key: HDFS-16871 URL: https://issues.apache.org/jira/browse/HDFS-16871 Project: Hadoop HDFS Issue Type: Bug Reporter: Daniel Ma Attachments: screenshot-1.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma reassigned HDFS-16871: Assignee: Daniel Ma > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Description: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname,, there will be a > IllegalArgumentException as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Description: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. !screenshot-2.png! For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} was: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > !screenshot-2.png! > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname,, there will be a > IllegalArgumentException as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Attachment: screenshot-2.png > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname,, there will be a > IllegalArgumentException as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Description: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. !screenshot-2.png! For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname, when Balancer process try to migrate on it, there will be a IllegalArgumentException thrown as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} was: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. !screenshot-2.png! For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > !screenshot-2.png! > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname, when Balancer process try to > migrate on it, there will be a IllegalArgumentException thrown as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649213#comment-17649213 ] ASF GitHub Bot commented on HDFS-16871: --- Daniel-009497 opened a new pull request, #5240: URL: https://github.com/apache/hadoop/pull/5240 For a Datanode with lowercase letter in hostname, everyting is ok, but for a Datanode with uppercase hostname, when Balancer process try ro migrate on it, there will be a IllegalArgumentException thrown. For more details, Pls refer to jira HDFS-16871 > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > !screenshot-2.png! > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname, when Balancer process try to > migrate on it, there will be a IllegalArgumentException thrown as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16871: -- Labels: pull-request-available (was: ) > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Labels: pull-request-available > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > !screenshot-2.png! > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname, when Balancer process try to > migrate on it, there will be a IllegalArgumentException thrown as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members
Chengbing Liu created HDFS-16872: Summary: Fix log throttling by declaring LogThrottlingHelper as static members Key: HDFS-16872 URL: https://issues.apache.org/jira/browse/HDFS-16872 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.3.4 Reporter: Chengbing Liu In our production cluster with Observer NameNode enabled, we have plenty of logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The {{LogThrottlingHelper}} doesn't seem to work. {noformat} 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688] maxTxnsToRead = 9223372036854775807 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 1.0, total load time 0.0 ms 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693] maxTxnsToRead = 9223372036854775807 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 5.0, total load time 1.0 ms {noformat} After some digging, I found the cause is that {{LogThrottlingHelper}}'s are declared as instance variables of all the enclosing classes, including {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. Therefore the logging frequency will not be limited across different instances. For classes with only limited number of instances, such as {{FSImage}}, this is fine. For others whose instances are created frequently, such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will result in plenty of logs. This can be fixed by declaring {{LogThrottlingHelper}}'s as static members. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks
[ https://issues.apache.org/jira/browse/HDFS-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649323#comment-17649323 ] ASF GitHub Bot commented on HDFS-16870: --- hadoop-yetus commented on PR #5237: URL: https://github.com/apache/hadoop/pull/5237#issuecomment-1357678231 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 2m 0s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 43s | | trunk passed | | +1 :green_heart: | compile | 1m 50s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 28s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 8s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 42s | | trunk passed | | +1 :green_heart: | javadoc | 1m 10s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 36s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 45s | | trunk passed | | +1 :green_heart: | shadedclient | 27m 9s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 32s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 32s | | the patch passed | | +1 :green_heart: | compile | 1m 29s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 29s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 55s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 32s | | the patch passed | | +1 :green_heart: | javadoc | 1m 0s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 27s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 43s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 29s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 503m 22s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5237/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 15s | | The patch does not generate ASF License warnings. | | | | 626m 49s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5237/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5237 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 30f4e88eb6a3 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 05137dd0ffcd6ca5b4442228db70a75be696df01 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5237/2/testReport/ | | Max. process+thread count | 2087 (vs. u
[jira] [Commented] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649413#comment-17649413 ] ASF GitHub Bot commented on HDFS-16871: --- hadoop-yetus commented on PR #5240: URL: https://github.com/apache/hadoop/pull/5240#issuecomment-1357978080 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 3s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 42m 36s | | trunk passed | | +1 :green_heart: | compile | 1m 33s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 25s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 10s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 35s | | trunk passed | | +1 :green_heart: | javadoc | 1m 12s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 30s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 48s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 30s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 26s | | the patch passed | | +1 :green_heart: | compile | 1m 29s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 29s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 56s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 26s | | the patch passed | | +1 :green_heart: | javadoc | 0m 55s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 20s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 34s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 39s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 381m 0s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5240/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 42s | | The patch does not generate ASF License warnings. | | | | 500m 7s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5240/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5240 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 013a56b05a40 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 56adaeb28a02bde2599da949dc69ef1f339a44c9 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5240/1/testReport/ | | Max. process+thread count | 2108 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-h
[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create
[ https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649443#comment-17649443 ] ASF GitHub Bot commented on HDFS-16867: --- hadoop-yetus commented on PR #5203: URL: https://github.com/apache/hadoop/pull/5203#issuecomment-1358132911 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 24s | | trunk passed | | +1 :green_heart: | compile | 1m 38s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 24s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 5s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 33s | | trunk passed | | +1 :green_heart: | javadoc | 1m 9s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 31s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 42s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 12s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 33s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 33s | | the patch passed | | +1 :green_heart: | compile | 1m 23s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 23s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 1s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 31s | | the patch passed | | +1 :green_heart: | javadoc | 0m 55s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 46s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 359m 24s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5203/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 55s | | The patch does not generate ASF License warnings. | | | | 479m 20s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestDFSInotifyEventInputStream | | | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5203/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5203 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux bd6ad7aa00f3 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 76f08187024b39e0c4035be3b30c35a24b2fa9be | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5203/2/testReport/ | | Max. process+thread count | 1883 (vs. ulimit of
[jira] [Created] (HDFS-16873) FileStatus compareTo does not specify ordering
DDillon created HDFS-16873: -- Summary: FileStatus compareTo does not specify ordering Key: HDFS-16873 URL: https://issues.apache.org/jira/browse/HDFS-16873 Project: Hadoop HDFS Issue Type: Improvement Reporter: DDillon The Javadoc of FileStatus does not specify the field and manner in which objects are ordered. In order to use the Comparable interface, this is critical to understand to avoid making any assumptions. Inspection of code showed that it is by path name quite quickly, but we shouldn't have to go into code to confirm any obvious assumptions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16873) FileStatus compareTo does not specify ordering
[ https://issues.apache.org/jira/browse/HDFS-16873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649457#comment-17649457 ] DDillon commented on HDFS-16873: https://github.com/apache/hadoop/pull/5219 > FileStatus compareTo does not specify ordering > -- > > Key: HDFS-16873 > URL: https://issues.apache.org/jira/browse/HDFS-16873 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: DDillon >Priority: Trivial > > The Javadoc of FileStatus does not specify the field and manner in which > objects are ordered. In order to use the Comparable interface, this is > critical to understand to avoid making any assumptions. Inspection of code > showed that it is by path name quite quickly, but we shouldn't have to go > into code to confirm any obvious assumptions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create
[ https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649458#comment-17649458 ] ASF GitHub Bot commented on HDFS-16867: --- Jing9 commented on code in PR #5203: URL: https://github.com/apache/hadoop/pull/5203#discussion_r1052583684 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java: ## @@ -161,6 +162,7 @@ public static void checkOtherInstanceRunning(boolean toCheck) { private final Path idPath; private OutputStream out; private final List targetPaths; + private final MoverMetrics moverMetrics; Review Comment: NameNodeConnector will also be used by Balancer, while MoverMetrics is only used by Mover. So not sure if placing MoverMetrics directly in NameNodeConnector is a good way from the semantic perspective. ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java: ## @@ -160,7 +160,7 @@ Collections. emptySet(), movedWinWidth, moverThreads, 0, BlockStoragePolicySuite.ID_BIT_LENGTH]; this.excludedPinnedBlocks = excludedPinnedBlocks; this.nnc = nnc; -this.metrics = MoverMetrics.create(this); Review Comment: If the main issue if the potential naming conflict caused by multiple mover instances, can we track the existing MoverMetrics instances and their NNC mappings at the class level (i.e. through a class static field) to avoid the duplication? ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java: ## @@ -160,7 +160,7 @@ Collections. emptySet(), movedWinWidth, moverThreads, 0, BlockStoragePolicySuite.ID_BIT_LENGTH]; this.excludedPinnedBlocks = excludedPinnedBlocks; this.nnc = nnc; -this.metrics = MoverMetrics.create(this); +this.metrics = nnc.getMoverMetrics(); Review Comment: We also need to add some UTs to reproduce the issue (without your fix) and validate the fix. > Exiting Mover due to an exception in MoverMetrics.create > > > Key: HDFS-16867 > URL: https://issues.apache.org/jira/browse/HDFS-16867 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZhiWei Shi >Assignee: ZhiWei Shi >Priority: Major > Labels: pull-request-available > > After the Mover process is started for a period of time, the process exits > unexpectedly and an error is reported in the log > {code:java} > [hdfs@${hostname} hadoop-3.3.2-nn]$ nohup bin/hdfs mover -p > /test-mover-jira9534 > mover.log.jira9534.20221209.2 & > [hdfs@{hostname} hadoop-3.3.2-nn]$ tail -f mover.log.jira9534.20221209.2 > ... > 22/12/09 14:22:32 INFO balancer.Dispatcher: Start moving > blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to > ${ip1}:800:ARCHIVE through ${ip2}:800 > 22/12/09 14:22:32 INFO balancer.Dispatcher: Successfully moved > blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to > ${ip1}:800:ARCHIVE through ${ip2}:800 > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Stopping Mover metrics > system... > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system stopped. > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system shutdown > complete. > Dec 9, 2022, 2:22:42 PM Mover took 13mins, 19sec > 22/12/09 14:22:42 ERROR mover.Mover: Exiting Mover due to an exception > org.apache.hadoop.metrics2.MetricsException: Metrics source > Mover-${BlockpoolID} already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.hdfs.server.mover.MoverMetrics.create(MoverMetrics.java:49) > at org.apache.hadoop.hdfs.server.mover.Mover.(Mover.java:162) > at org.apache.hadoop.hdfs.server.mover.Mover.run(Mover.java:684) > at org.apache.hadoop.hdfs.server.mover.Mover$Cli.run(Mover.java:826) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) > at org.apache.hadoop.hdfs.server.mover.Mover.main(Mover.java:908) > {code} > 1、“final ExitStatus r = m.run()”return only after scheduled one of replica > 2、“r == ExitStatus.IN_PROGRESS”,won’t run iter.remove() > 3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” multiple > times for the same nnc,which leads to the error > {code:java} > //Mover.java > for (final StorageType t : diff.existing) { > for (final MLocation ml : locations) { > final Source source = storages.getSource(ml); > if (ml.storageType == t
[jira] [Created] (HDFS-16874) Improve DataNode decommission for Erasure Coding
Jing Zhao created HDFS-16874: Summary: Improve DataNode decommission for Erasure Coding Key: HDFS-16874 URL: https://issues.apache.org/jira/browse/HDFS-16874 Project: Hadoop HDFS Issue Type: Improvement Components: ec, erasure-coding Reporter: Jing Zhao Assignee: Jing Zhao There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied fixes by following the above directions. We have seen significant improvement (100+ times) in terms of datanode decommission speed for EC data. The more clear semantic of the reconstruction command protobuf msg also help prevent potential data corruption during the EC reconstruction. We will use this ticket to track the similar fixes for the Apache releases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding
[ https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-16874: - Description: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied fixes by following the above directions. We have seen significant improvement (100+ times speed up) in terms of datanode decommission speed for EC data. The more clear semantic of the reconstruction command protobuf msg also help prevent potential data corruption during the EC reconstruction. We will use this ticket to track the similar fixes for the Apache releases. was: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied f
[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding
[ https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-16874: - Description: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go a step further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied fixes by following the above directions. We have seen significant improvement (100+ times speed up) in terms of datanode decommission speed for EC data. The more clear semantic of the reconstruction command protobuf msg also help prevent potential data corruption during the EC reconstruction. We will use this ticket to track the similar fixes for the Apache releases. was: There are a couple of issues with the current DataNode decommission implementation when large amounts of Erasure Coding data are involved in the data re-replication/reconstruction process: # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode decommission if the internal EC block is still available. While this strategy reduces the CPU cost caused by EC reconstruction, it greatly limits the overall data recovery bandwidth, since there is only one single DataNode as the source. While high density HDD hosts are more and more widely used by HDFS especially along with Erasure Coding for warm data use case, this becomes a big pain for cluster management. In our production, to decommission a DataNode with several hundred TB EC data stored might take several days. HDFS-16613 provides optimization based on the existing mechanism, but more fundamentally we may want to allow EC reconstruction for DataNode decommission so as to achieve much larger recovery bandwidth. # The semantic of the existing EC reconstruction command (the BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The existing reconstruction command depends on the holes in the srcNodes/liveBlockIndices arrays to indicate the target internal blocks for recovery, while the holes can also be caused by the fact that the corresponding datanode is too busy so it cannot be used as the reconstruction source. This causes the later DataNode side reconstruction may not be consistent with the original intention. E.g., if the index of the missing block is 6, and the datanode storing block 0 is busy, the src nodes in the reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is working on this issue by indicating an excluding index list. More fundamentally we can follow the same path but go steps further by adding an optional field explicitly indicating the target block indices in the command protobuf msg. With the extension the DataNode will no longer use the holes in the src node array to "guess" the reconstruction targets. Internally we have developed and applied
[jira] [Created] (HDFS-16875) Erasure Coding: data access proxy to allow old clients to read EC data
Jing Zhao created HDFS-16875: Summary: Erasure Coding: data access proxy to allow old clients to read EC data Key: HDFS-16875 URL: https://issues.apache.org/jira/browse/HDFS-16875 Project: Hadoop HDFS Issue Type: New Feature Components: ec, erasure-coding Reporter: Jing Zhao Assignee: Jing Zhao Erasure Coding is only supported by Hadoop 3, while many production deployments still depend on Hadoop 2. Upgrading the whole data tech stack to the Hadoop 3 release may involve big migration efforts and even reliability risks, considering the incompatibilities between these two Hadoop major releases as well as the potential uncovered issues and risks hidden in newer releases. Therefore, we need to find a solution, with the least amount of migration effort and risk, to adopt Erasure Coding for cost efficiency but still allow HDFS clients with old versions (Hadoop 2.x) to access EC data in a transparent manner. Internally we have developed an EC access proxy which translates the EC data for old clients. We also extend the NameNode RPC so it can recognize HDFS clients with/without the EC support, and redirect the old clients to the proxy. With the proxy we set up separate Erasure Coding clusters storing hundreds of PB of data, while leaving other production clusters and all the upper layer applications untouched. Considering some changes are made at fundamental components of HDFS (e.g., client-NN RPC header), we do not aim to merge the change to trunk. We will use this ticket to share the design and implementation details (including the code) and collect feedback. We may use a separate github repo to open source the implementation later. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members
[ https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649564#comment-17649564 ] ASF GitHub Bot commented on HDFS-16872: --- ChengbingLiu opened a new pull request, #5246: URL: https://github.com/apache/hadoop/pull/5246 ### Description of PR In our production cluster with Observer NameNode enabled, we have plenty of logs printed by `FSEditLogLoader` and `RedundantEditLogInputStream`. The `LogThrottlingHelper` doesn't seem to work. ``` 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688] maxTxnsToRead = 92233720368547758072022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to transaction ID 17686250688 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 1.0, total load time 0.0 ms 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693] maxTxnsToRead = 9223372036854775807 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to transaction ID 17686250689 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 5.0, total load time 1.0 ms ``` After some digging, I found the cause is that `LogThrottlingHelper`'s are declared as instance variables of all the enclosing classes, including `FSImage`, `FSEditLogLoader` and `RedundantEditLogInputStream`. Therefore the logging frequency will not be limited across different instances. For classes with only limited number of instances, such as `FSImage`, this is fine. For others whose instances are created frequently, such as `FSEditLogLoader` and `RedundantEditLogInputStream`, it will result in plenty of logs. This can be fixed by declaring `LogThrottlingHelper`'s as static members. ### How was this patch tested? Through a test case. > Fix log throttling by declaring LogThrottlingHelper as static members > - > > Key: HDFS-16872 > URL: https://issues.apache.org/jira/browse/HDFS-16872 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.4 >Reporter: Chengbing Liu >Priority: Major > > In our production cluster with Observer NameNode enabled, we have plenty of > logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The > {{LogThrottlingHelper}} doesn't seem to work. > {noformat} > 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Start loading edits file ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, > 17686250688] maxTxnsToRead = 9223372036854775807 > 2022-10-25 09:26:50,380 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, > 17686250688]' to transaction ID 17686250688 > 2022-10-25 09:26:50,380 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to > transaction ID 17686250688 > 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Loaded 1 edits file(s) (the las
[jira] [Updated] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members
[ https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-16872: -- Labels: pull-request-available (was: ) > Fix log throttling by declaring LogThrottlingHelper as static members > - > > Key: HDFS-16872 > URL: https://issues.apache.org/jira/browse/HDFS-16872 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.4 >Reporter: Chengbing Liu >Priority: Major > Labels: pull-request-available > > In our production cluster with Observer NameNode enabled, we have plenty of > logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The > {{LogThrottlingHelper}} doesn't seem to work. > {noformat} > 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Start loading edits file ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, > 17686250688] maxTxnsToRead = 9223372036854775807 > 2022-10-25 09:26:50,380 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, > 17686250688]' to transaction ID 17686250688 > 2022-10-25 09:26:50,380 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to > transaction ID 17686250688 > 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, > 17686250688], ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits > 1.0, total load time 0.0 ms > 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Start loading edits file ByteStringEditLog[17686250689, 17686250693], > ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, > 17686250693] maxTxnsToRead = 9223372036854775807 > 2022-10-25 09:26:50,387 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], > ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, > 17686250693]' to transaction ID 17686250689 > 2022-10-25 09:26:50,387 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to > transaction ID 17686250689 > 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, > 17686250693], ByteStringEditLog[17686250689, 17686250693], > ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits > 5.0, total load time 1.0 ms > {noformat} > After some digging, I found the cause is that {{LogThrottlingHelper}}'s are > declared as instance variables of all the enclosing classes, including > {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. > Therefore the logging frequency will not be limited across different > instances. For classes with only limited number of instances, such as > {{FSImage}}, this is fine. For others whose instances are created frequently, > such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will > result in plenty of logs. > This can be fixed by declaring {{LogThrottlingHelper}}'s as static members. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members
[ https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649580#comment-17649580 ] ASF GitHub Bot commented on HDFS-16872: --- ChengbingLiu commented on PR #5246: URL: https://github.com/apache/hadoop/pull/5246#issuecomment-1358874741 Here is why I added the reset-`lastLogTimestampMs` logic and slightly tweaked the test case: Each test method in `TestFSEditLogLoader` is executed twice (due to the `@Parameterized` annotation of the class). Since I change the `LogThrottlingHelper` to a static field, other test method execution can have two effects: 1. set the `lastLogTimestampMs` field to a current time; 2. create `LoggingAction`s with suppressed logs. This will break the previous way of testing log throttling, which uses a faked timer. Therefore this patch: 1. resets `lastLogTimestampMs` if the `currentTimeMs` is smaller (can only happen in test cases) 2. clears previous logs by a `loadFSEdits` call before the original test cases These may not be the best solution. Please take a look @xkrogen . Thanks. > Fix log throttling by declaring LogThrottlingHelper as static members > - > > Key: HDFS-16872 > URL: https://issues.apache.org/jira/browse/HDFS-16872 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.4 >Reporter: Chengbing Liu >Priority: Major > Labels: pull-request-available > > In our production cluster with Observer NameNode enabled, we have plenty of > logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The > {{LogThrottlingHelper}} doesn't seem to work. > {noformat} > 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Start loading edits file ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, > 17686250688] maxTxnsToRead = 9223372036854775807 > 2022-10-25 09:26:50,380 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, > 17686250688]' to transaction ID 17686250688 > 2022-10-25 09:26:50,380 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to > transaction ID 17686250688 > 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, > 17686250688], ByteStringEditLog[17686250688, 17686250688], > ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits > 1.0, total load time 0.0 ms > 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Start loading edits file ByteStringEditLog[17686250689, 17686250693], > ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, > 17686250693] maxTxnsToRead = 9223372036854775807 > 2022-10-25 09:26:50,387 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], > ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, > 17686250693]' to transaction ID 17686250689 > 2022-10-25 09:26:50,387 INFO > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: > Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to > transaction ID 17686250689 > 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: > Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, > 17686250693], ByteStringEditLog[17686250689, 17686250693], > ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits > 5.0, total load time 1.0 ms > {noformat} > After some digging, I found the cause is that {{LogThrottlingHelper}}'s are > declared as instance variables of all the enclosing classes, including > {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. > Therefore the logging frequency will not be limited across different > instances. For classes with only limited number of instances, such as > {{FSImage}}, this is fine. For others whose instances are created frequently, > such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will > result in plenty of logs. > This can be fixed by declaring {{LogThrottlingHelper}}'s as static members. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h.