date:20221219



 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Attachment: screenshot-1.png

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname

Daniel Ma created HDFS-16871:


 Summary: DiskBalancer process may throws IllegalArgumentException 
when the target DataNode has capital letter in hostname
 Key: HDFS-16871
 URL: https://issues.apache.org/jira/browse/HDFS-16871
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Daniel Ma
 Attachments: screenshot-1.png





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma reassigned HDFS-16871:


Assignee: Daniel Ma

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Description: 
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}



> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname,, there will be a 
> IllegalArgumentException as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Description: 
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
 !screenshot-2.png! 
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}



  was:
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}




> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
>  !screenshot-2.png! 
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname,, there will be a 
> IllegalArgumentException as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Attachment: screenshot-2.png

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname,, there will be a 
> IllegalArgumentException as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16871:
-
Description: 
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
 !screenshot-2.png! 
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname, when Balancer process try to 
migrate on it,  there will be a IllegalArgumentException thrown as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}



  was:
DiskBalancer process read DataNode hostname as lowercase letters,
 !screenshot-1.png! 
 but there is no letter case transform when getNodeByName.
 !screenshot-2.png! 
For a DataNode with lowercase hostname. everything is ok.
But for a DataNode with uppercase hostname,, there will be a 
IllegalArgumentException as below,

{code:java}
2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
java.lang.IllegalArgumentException: Unable to find the specified node. 
node-group-1YlRf0002
{code}




> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
>  !screenshot-2.png! 
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname, when Balancer process try to 
> migrate on it,  there will be a IllegalArgumentException thrown as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



[ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649213#comment-17649213
 ] 

ASF GitHub Bot commented on HDFS-16871:
---

Daniel-009497 opened a new pull request, #5240:
URL: https://github.com/apache/hadoop/pull/5240

   For a Datanode with lowercase letter in hostname, everyting is ok,
   but for a Datanode with uppercase hostname, when Balancer process try ro 
migrate on it, there will be a IllegalArgumentException thrown.
   
   For more details, Pls refer to jira HDFS-16871
   




> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
>  !screenshot-2.png! 
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname, when Balancer process try to 
> migrate on it,  there will be a IllegalArgumentException thrown as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



 [ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16871:
--
Labels: pull-request-available  (was: )

> DiskBalancer process may throws IllegalArgumentException when the target 
> DataNode has capital letter in hostname
> 
>
> Key: HDFS-16871
> URL: https://issues.apache.org/jira/browse/HDFS-16871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Daniel Ma
>Assignee: Daniel Ma
>Priority: Major
>  Labels: pull-request-available
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> DiskBalancer process read DataNode hostname as lowercase letters,
>  !screenshot-1.png! 
>  but there is no letter case transform when getNodeByName.
>  !screenshot-2.png! 
> For a DataNode with lowercase hostname. everything is ok.
> But for a DataNode with uppercase hostname, when Balancer process try to 
> migrate on it,  there will be a IllegalArgumentException thrown as below,
> {code:java}
> 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException: Unable to find the specified node. 
> node-group-1YlRf0002
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members

2022-12-19 Thread Chengbing Liu (Jira)

Chengbing Liu created HDFS-16872:


 Summary: Fix log throttling by declaring LogThrottlingHelper as 
static members
 Key: HDFS-16872
 URL: https://issues.apache.org/jira/browse/HDFS-16872
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.3.4
Reporter: Chengbing Liu


In our production cluster with Observer NameNode enabled, we have plenty of 
logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The 
{{LogThrottlingHelper}} doesn't seem to work.

{noformat}
2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
17686250688] maxTxnsToRead = 9223372036854775807
2022-10-25 09:26:50,380 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
17686250688]' to transaction ID 17686250688
2022-10-25 09:26:50,380 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
transaction ID 17686250688
2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 
17686250688], ByteStringEditLog[17686250688, 17686250688], 
ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 
1.0, total load time 0.0 ms

2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Start loading edits file ByteStringEditLog[17686250689, 17686250693], 
ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
17686250693] maxTxnsToRead = 9223372036854775807
2022-10-25 09:26:50,387 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], 
ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
17686250693]' to transaction ID 17686250689
2022-10-25 09:26:50,387 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to 
transaction ID 17686250689
2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 
17686250693], ByteStringEditLog[17686250689, 17686250693], 
ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 
5.0, total load time 1.0 ms
{noformat}

After some digging, I found the cause is that {{LogThrottlingHelper}}'s are 
declared as instance variables of all the enclosing classes, including 
{{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. Therefore 
the logging frequency will not be limited across different instances. For 
classes with only limited number of instances, such as {{FSImage}}, this is 
fine. For others whose instances are created frequently, such as 
{{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will result in 
plenty of logs.

This can be fixed by declaring {{LogThrottlingHelper}}'s as static members.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks



[ 
https://issues.apache.org/jira/browse/HDFS-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649323#comment-17649323
 ] 

ASF GitHub Bot commented on HDFS-16870:
---

hadoop-yetus commented on PR #5237:
URL: https://github.com/apache/hadoop/pull/5237#issuecomment-1357678231

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   2m  0s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 43s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 50s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   1m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  8s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 42s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 45s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  27m  9s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   1m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 55s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 43s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 29s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 503m 22s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5237/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 15s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 626m 49s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 |
   |   | hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5237/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5237 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 30f4e88eb6a3 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 05137dd0ffcd6ca5b4442228db70a75be696df01 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5237/2/testReport/ |
   | Max. process+thread count | 2087 (vs. u

[jira] [Commented] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname



[ 
https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649413#comment-17649413
 ] 

ASF GitHub Bot commented on HDFS-16871:
---

hadoop-yetus commented on PR #5240:
URL: https://github.com/apache/hadoop/pull/5240#issuecomment-1357978080

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  3s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  42m 36s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 33s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 10s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 35s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 12s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 48s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m 30s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 56s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 34s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 381m  0s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5240/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 500m  7s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestLeaseRecovery2 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5240/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5240 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 013a56b05a40 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 56adaeb28a02bde2599da949dc69ef1f339a44c9 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5240/1/testReport/ |
   | Max. process+thread count | 2108 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-h

[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create



[ 
https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649443#comment-17649443
 ] 

ASF GitHub Bot commented on HDFS-16867:
---

hadoop-yetus commented on PR #5203:
URL: https://github.com/apache/hadoop/pull/5203#issuecomment-1358132911

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 57s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 24s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 38s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  5s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 33s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  9s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 42s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 46s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 359m 24s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5203/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 55s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 479m 20s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestDFSInotifyEventInputStream |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5203/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5203 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux bd6ad7aa00f3 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 76f08187024b39e0c4035be3b30c35a24b2fa9be |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5203/2/testReport/ |
   | Max. process+thread count | 1883 (vs. ulimit of

[jira] [Created] (HDFS-16873) FileStatus compareTo does not specify ordering

2022-12-19 Thread DDillon (Jira)

DDillon created HDFS-16873:
--

 Summary: FileStatus compareTo does not specify ordering
 Key: HDFS-16873
 URL: https://issues.apache.org/jira/browse/HDFS-16873
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: DDillon


The Javadoc of FileStatus does not specify the field and manner in which 
objects are ordered. In order to use the Comparable interface, this is critical 
to understand to avoid making any assumptions. Inspection of code showed that 
it is by path name quite quickly, but we shouldn't have to go into code to 
confirm any obvious assumptions.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16873) FileStatus compareTo does not specify ordering

2022-12-19 Thread DDillon (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649457#comment-17649457
 ] 

DDillon commented on HDFS-16873:


https://github.com/apache/hadoop/pull/5219

> FileStatus compareTo does not specify ordering
> --
>
> Key: HDFS-16873
> URL: https://issues.apache.org/jira/browse/HDFS-16873
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: DDillon
>Priority: Trivial
>
> The Javadoc of FileStatus does not specify the field and manner in which 
> objects are ordered. In order to use the Comparable interface, this is 
> critical to understand to avoid making any assumptions. Inspection of code 
> showed that it is by path name quite quickly, but we shouldn't have to go 
> into code to confirm any obvious assumptions.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create



[ 
https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649458#comment-17649458
 ] 

ASF GitHub Bot commented on HDFS-16867:
---

Jing9 commented on code in PR #5203:
URL: https://github.com/apache/hadoop/pull/5203#discussion_r1052583684


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java:
##
@@ -161,6 +162,7 @@ public static void checkOtherInstanceRunning(boolean 
toCheck) {
   private final Path idPath;
   private OutputStream out;
   private final List targetPaths;
+  private final MoverMetrics moverMetrics;

Review Comment:
   NameNodeConnector will also be used by Balancer, while MoverMetrics is only 
used by Mover. So not sure if placing MoverMetrics directly in 
NameNodeConnector is a good way from the semantic perspective.



##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java:
##
@@ -160,7 +160,7 @@ Collections. emptySet(), movedWinWidth, 
moverThreads, 0,
 BlockStoragePolicySuite.ID_BIT_LENGTH];
 this.excludedPinnedBlocks = excludedPinnedBlocks;
 this.nnc = nnc;
-this.metrics = MoverMetrics.create(this);

Review Comment:
   If the main issue if the potential naming conflict caused by multiple mover 
instances, can we track the existing MoverMetrics instances and their NNC 
mappings at the class level (i.e. through a class static field) to avoid the 
duplication?



##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java:
##
@@ -160,7 +160,7 @@ Collections. emptySet(), movedWinWidth, 
moverThreads, 0,
 BlockStoragePolicySuite.ID_BIT_LENGTH];
 this.excludedPinnedBlocks = excludedPinnedBlocks;
 this.nnc = nnc;
-this.metrics = MoverMetrics.create(this);
+this.metrics = nnc.getMoverMetrics();

Review Comment:
   We also need to add some UTs to reproduce the issue (without your fix) and 
validate the fix.





> Exiting Mover due to an exception in MoverMetrics.create
> 
>
> Key: HDFS-16867
> URL: https://issues.apache.org/jira/browse/HDFS-16867
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZhiWei Shi
>Assignee: ZhiWei Shi
>Priority: Major
>  Labels: pull-request-available
>
> After the Mover process is started for a period of time, the process exits 
> unexpectedly and an error is reported in the log
> {code:java}
> [hdfs@${hostname} hadoop-3.3.2-nn]$ nohup bin/hdfs mover -p 
> /test-mover-jira9534 > mover.log.jira9534.20221209.2 &
> [hdfs@{hostname}  hadoop-3.3.2-nn]$ tail -f mover.log.jira9534.20221209.2
> ...
> 22/12/09 14:22:32 INFO balancer.Dispatcher: Start moving 
> blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to 
> ${ip1}:800:ARCHIVE through ${ip2}:800
> 22/12/09 14:22:32 INFO balancer.Dispatcher: Successfully moved 
> blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to 
> ${ip1}:800:ARCHIVE through ${ip2}:800
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Stopping Mover metrics 
> system...
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system stopped.
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system shutdown 
> complete.
> Dec 9, 2022, 2:22:42 PM  Mover took 13mins, 19sec
> 22/12/09 14:22:42 ERROR mover.Mover: Exiting Mover due to an exception
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Mover-${BlockpoolID} already exists!
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
> at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> at 
> org.apache.hadoop.hdfs.server.mover.MoverMetrics.create(MoverMetrics.java:49)
> at org.apache.hadoop.hdfs.server.mover.Mover.(Mover.java:162)
> at org.apache.hadoop.hdfs.server.mover.Mover.run(Mover.java:684)
> at org.apache.hadoop.hdfs.server.mover.Mover$Cli.run(Mover.java:826)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
> at org.apache.hadoop.hdfs.server.mover.Mover.main(Mover.java:908) 
> {code}
> 1、“final ExitStatus r = m.run()”return only after scheduled one of replica
> 2、“r == ExitStatus.IN_PROGRESS”,won’t run iter.remove()
> 3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” multiple 
> times for the same nnc，which leads to the error
> {code:java}
> //Mover.java
>  for (final StorageType t : diff.existing) {
>   for (final MLocation ml : locations) {
> final Source source = storages.getSource(ml);
> if (ml.storageType == t

[jira] [Created] (HDFS-16874) Improve DataNode decommission for Erasure Coding

Jing Zhao created HDFS-16874:


 Summary: Improve DataNode decommission for Erasure Coding
 Key: HDFS-16874
 URL: https://issues.apache.org/jira/browse/HDFS-16874
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ec, erasure-coding
Reporter: Jing Zhao
Assignee: Jing Zhao


There are a couple of issues with the current DataNode decommission 
implementation when large amounts of Erasure Coding data are involved in the 
data re-replication/reconstruction process:
 # Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode 
decommission if the internal EC block is still available. While this strategy 
reduces the CPU cost caused by EC reconstruction, it greatly limits the overall 
data recovery bandwidth, since there is only one single DataNode as the source. 
While high density HDD hosts are more and more widely used by HDFS especially 
along with Erasure Coding for warm data use case, this becomes a big pain for 
cluster management. In our production, to decommission a DataNode with several 
hundred TB EC data stored might take several days. HDFS-16613 provides 
optimization based on the existing mechanism, but more fundamentally we may 
want to allow EC reconstruction for DataNode decommission so as to achieve much 
larger recovery bandwidth.
 # The semantic of the existing EC reconstruction command (the 
BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The 
existing reconstruction command depends on the holes in the 
srcNodes/liveBlockIndices arrays to indicate the target internal blocks for 
recovery, while the holes can also be caused by the fact that the corresponding 
datanode is too busy so it cannot be used as the reconstruction source. This 
causes the later DataNode side reconstruction may not be consistent with the 
original intention. E.g., if the index of the missing block is 6, and the 
datanode storing block 0 is busy, the src nodes in the reconstruction command 
only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct 
the internal block 0 instead of 6. HDFS-16566 is working on this issue by 
indicating an excluding index list. More fundamentally we can follow the same 
path but go steps further by adding an optional field explicitly indicating the 
target block indices in the command protobuf msg. With the extension the 
DataNode will no longer use the holes in the src node array to "guess" the 
reconstruction targets.

Internally we have developed and applied fixes by following the above 
directions. We have seen significant improvement (100+ times) in terms of 
datanode decommission speed for EC data. The more clear semantic of the 
reconstruction command protobuf msg also help prevent potential data corruption 
during the EC reconstruction.

We will use this ticket to track the similar fixes for the Apache releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding

[
https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jing Zhao updated HDFS-16874:
-
Description:
There are a couple of issues with the current DataNode decommission
implementation when large amounts of Erasure Coding data are involved in the
data re-replication/reconstruction process:
# Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode
decommission if the internal EC block is still available. While this strategy
reduces the CPU cost caused by EC reconstruction, it greatly limits the overall
data recovery bandwidth, since there is only one single DataNode as the source.
While high density HDD hosts are more and more widely used by HDFS especially
along with Erasure Coding for warm data use case, this becomes a big pain for
cluster management. In our production, to decommission a DataNode with several
hundred TB EC data stored might take several days. HDFS-16613 provides
optimization based on the existing mechanism, but more fundamentally we may
want to allow EC reconstruction for DataNode decommission so as to achieve much
larger recovery bandwidth.
# The semantic of the existing EC reconstruction command (the
BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The
existing reconstruction command depends on the holes in the
srcNodes/liveBlockIndices arrays to indicate the target internal blocks for
recovery, while the holes can also be caused by the fact that the corresponding
datanode is too busy so it cannot be used as the reconstruction source. This
causes the later DataNode side reconstruction may not be consistent with the
original intention. E.g., if the index of the missing block is 6, and the
datanode storing block 0 is busy, the src nodes in the reconstruction command
only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct
the internal block 0 instead of 6. HDFS-16566 is working on this issue by
indicating an excluding index list. More fundamentally we can follow the same
path but go steps further by adding an optional field explicitly indicating the
target block indices in the command protobuf msg. With the extension the
DataNode will no longer use the holes in the src node array to "guess" the
reconstruction targets.

Internally we have developed and applied fixes by following the above
directions. We have seen significant improvement (100+ times speed up) in terms
of datanode decommission speed for EC data. The more clear semantic of the
reconstruction command protobuf msg also help prevent potential data corruption
during the EC reconstruction.

We will use this ticket to track the similar fixes for the Apache releases.

was:
There are a couple of issues with the current DataNode decommission
implementation when large amounts of Erasure Coding data are involved in the
data re-replication/reconstruction process:
# Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode
decommission if the internal EC block is still available. While this strategy
reduces the CPU cost caused by EC reconstruction, it greatly limits the overall
data recovery bandwidth, since there is only one single DataNode as the source.
While high density HDD hosts are more and more widely used by HDFS especially
along with Erasure Coding for warm data use case, this becomes a big pain for
cluster management. In our production, to decommission a DataNode with several
hundred TB EC data stored might take several days. HDFS-16613 provides
optimization based on the existing mechanism, but more fundamentally we may
want to allow EC reconstruction for DataNode decommission so as to achieve much
larger recovery bandwidth.
# The semantic of the existing EC reconstruction command (the
BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The
existing reconstruction command depends on the holes in the
srcNodes/liveBlockIndices arrays to indicate the target internal blocks for
recovery, while the holes can also be caused by the fact that the corresponding
datanode is too busy so it cannot be used as the reconstruction source. This
causes the later DataNode side reconstruction may not be consistent with the
original intention. E.g., if the index of the missing block is 6, and the
datanode storing block 0 is busy, the src nodes in the reconstruction command
only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct
the internal block 0 instead of 6. HDFS-16566 is working on this issue by
indicating an excluding index list. More fundamentally we can follow the same
path but go steps further by adding an optional field explicitly indicating the
target block indices in the command protobuf msg. With the extension the
DataNode will no longer use the holes in the src node array to "guess" the
reconstruction targets.

Internally we have developed and applied f

[jira] [Updated] (HDFS-16874) Improve DataNode decommission for Erasure Coding

[
https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jing Zhao updated HDFS-16874:
-
Description:
There are a couple of issues with the current DataNode decommission
implementation when large amounts of Erasure Coding data are involved in the
data re-replication/reconstruction process:
# Slowness. In HDFS-8786 we made a decision to use re-replication for DataNode
decommission if the internal EC block is still available. While this strategy
reduces the CPU cost caused by EC reconstruction, it greatly limits the overall
data recovery bandwidth, since there is only one single DataNode as the source.
While high density HDD hosts are more and more widely used by HDFS especially
along with Erasure Coding for warm data use case, this becomes a big pain for
cluster management. In our production, to decommission a DataNode with several
hundred TB EC data stored might take several days. HDFS-16613 provides
optimization based on the existing mechanism, but more fundamentally we may
want to allow EC reconstruction for DataNode decommission so as to achieve much
larger recovery bandwidth.
# The semantic of the existing EC reconstruction command (the
BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The
existing reconstruction command depends on the holes in the
srcNodes/liveBlockIndices arrays to indicate the target internal blocks for
recovery, while the holes can also be caused by the fact that the corresponding
datanode is too busy so it cannot be used as the reconstruction source. This
causes the later DataNode side reconstruction may not be consistent with the
original intention. E.g., if the index of the missing block is 6, and the
datanode storing block 0 is busy, the src nodes in the reconstruction command
only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target datanode may reconstruct
the internal block 0 instead of 6. HDFS-16566 is working on this issue by
indicating an excluding index list. More fundamentally we can follow the same
path but go a step further by adding an optional field explicitly indicating
the target block indices in the command protobuf msg. With the extension the
DataNode will no longer use the holes in the src node array to "guess" the
reconstruction targets.

We will use this ticket to track the similar fixes for the Apache releases.

Internally we have developed and applied

[jira] [Created] (HDFS-16875) Erasure Coding: data access proxy to allow old clients to read EC data

Jing Zhao created HDFS-16875:


 Summary: Erasure Coding: data access proxy to allow old clients to 
read EC data
 Key: HDFS-16875
 URL: https://issues.apache.org/jira/browse/HDFS-16875
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: ec, erasure-coding
Reporter: Jing Zhao
Assignee: Jing Zhao


Erasure Coding is only supported by Hadoop 3, while many production deployments 
still depend on Hadoop 2. Upgrading the whole data tech stack to the Hadoop 3 
release may involve big migration efforts and even reliability risks, 
considering the incompatibilities between these two Hadoop major releases as 
well as the potential uncovered issues and risks hidden in newer releases. 
Therefore, we need to find a solution, with the least amount of migration 
effort and risk, to adopt Erasure Coding for cost efficiency but still allow 
HDFS clients with old versions (Hadoop 2.x) to access EC data in a transparent 
manner.

Internally we have developed an EC access proxy which translates the EC data 
for old clients. We also extend the NameNode RPC so it can recognize HDFS 
clients with/without the EC support, and redirect the old clients to the proxy. 
With the proxy we set up separate Erasure Coding clusters storing hundreds of 
PB of data, while leaving other production clusters and all the upper layer 
applications untouched.

Considering some changes are made at fundamental components of HDFS (e.g., 
client-NN RPC header), we do not aim to merge the change to trunk. We will use 
this ticket to share the design and implementation details (including the code) 
and collect feedback. We may use a separate github repo to open source the 
implementation later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members



[ 
https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649564#comment-17649564
 ] 

ASF GitHub Bot commented on HDFS-16872:
---

ChengbingLiu opened a new pull request, #5246:
URL: https://github.com/apache/hadoop/pull/5246

   ### Description of PR
   
   In our production cluster with Observer NameNode enabled, we have plenty of 
logs printed by `FSEditLogLoader` and `RedundantEditLogInputStream`. The 
`LogThrottlingHelper` doesn't seem to work.
   
   ```
   2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
17686250688] maxTxnsToRead = 92233720368547758072022-10-25 09:26:50,380 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
17686250688]' to transaction ID 17686250688
   2022-10-25 09:26:50,380 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
transaction ID 17686250688
   2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 
17686250688], ByteStringEditLog[17686250688, 17686250688], 
ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 
1.0, total load time 0.0 ms
   
   2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Start loading edits file ByteStringEditLog[17686250689, 17686250693], 
ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
17686250693] maxTxnsToRead = 9223372036854775807
   2022-10-25 09:26:50,387 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], 
ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
17686250693]' to transaction ID 17686250689
   2022-10-25 09:26:50,387 INFO 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to 
transaction ID 17686250689
   2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 
17686250693], ByteStringEditLog[17686250689, 17686250693], 
ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 
5.0, total load time 1.0 ms
   ```
   
   After some digging, I found the cause is that `LogThrottlingHelper`'s are 
declared as instance variables of all the enclosing classes, including 
`FSImage`, `FSEditLogLoader` and `RedundantEditLogInputStream`. Therefore the 
logging frequency will not be limited across different instances. For classes 
with only limited number of instances, such as `FSImage`, this is fine. For 
others whose instances are created frequently, such as `FSEditLogLoader` and 
`RedundantEditLogInputStream`, it will result in plenty of logs.
   
   This can be fixed by declaring `LogThrottlingHelper`'s as static members.
   
   ### How was this patch tested?
   Through a test case.
   




> Fix log throttling by declaring LogThrottlingHelper as static members
> -
>
> Key: HDFS-16872
> URL: https://issues.apache.org/jira/browse/HDFS-16872
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Chengbing Liu
>Priority: Major
>
> In our production cluster with Observer NameNode enabled, we have plenty of 
> logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The 
> {{LogThrottlingHelper}} doesn't seem to work.
> {noformat}
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688]' to transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
> transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the las

[jira] [Updated] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members



 [ 
https://issues.apache.org/jira/browse/HDFS-16872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16872:
--
Labels: pull-request-available  (was: )

> Fix log throttling by declaring LogThrottlingHelper as static members
> -
>
> Key: HDFS-16872
> URL: https://issues.apache.org/jira/browse/HDFS-16872
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.4
>Reporter: Chengbing Liu
>Priority: Major
>  Labels: pull-request-available
>
> In our production cluster with Observer NameNode enabled, we have plenty of 
> logs printed by {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. The 
> {{LogThrottlingHelper}} doesn't seem to work.
> {noformat}
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688], ByteStringEditLog[17686250688, 
> 17686250688]' to transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250688, 17686250688]' to 
> transaction ID 17686250688
> 2022-10-25 09:26:50,380 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250688, 
> 17686250688], ByteStringEditLog[17686250688, 17686250688], 
> ByteStringEditLog[17686250688, 17686250688]) of total size 527.0, total edits 
> 1.0, total load time 0.0 ms
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Start loading edits file ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693] maxTxnsToRead = 9223372036854775807
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693], ByteStringEditLog[17686250689, 
> 17686250693]' to transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream: 
> Fast-forwarding stream 'ByteStringEditLog[17686250689, 17686250693]' to 
> transaction ID 17686250689
> 2022-10-25 09:26:50,387 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Loaded 1 edits file(s) (the last named ByteStringEditLog[17686250689, 
> 17686250693], ByteStringEditLog[17686250689, 17686250693], 
> ByteStringEditLog[17686250689, 17686250693]) of total size 890.0, total edits 
> 5.0, total load time 1.0 ms
> {noformat}
> After some digging, I found the cause is that {{LogThrottlingHelper}}'s are 
> declared as instance variables of all the enclosing classes, including 
> {{FSImage}}, {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}. 
> Therefore the logging frequency will not be limited across different 
> instances. For classes with only limited number of instances, such as 
> {{FSImage}}, this is fine. For others whose instances are created frequently, 
> such as {{FSEditLogLoader}} and {{RedundantEditLogInputStream}}, it will 
> result in plenty of logs.
> This can be fixed by declaring {{LogThrottlingHelper}}'s as static members.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16872) Fix log throttling by declaring LogThrottlingHelper as static members