[jira] [Work logged] (HDFS-16583) DatanodeAdminDefaultMonitor can get stuck in an infinite loop

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16583?focusedWorklogId=774408&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774408
 ]

ASF GitHub Bot logged work on HDFS-16583:
-

Author: ASF GitHub Bot
Created on: 25/May/22 08:08
Start Date: 25/May/22 08:08
Worklog Time Spent: 10m 
  Work Description: sodonnel commented on PR #4332:
URL: https://github.com/apache/hadoop/pull/4332#issuecomment-1136926780

   @jojochuang I don't think the test failure is related. Could you have 
another look and see if this change is good? Thanks!




Issue Time Tracking
---

Worklog Id: (was: 774408)
Time Spent: 1h 10m  (was: 1h)

> DatanodeAdminDefaultMonitor can get stuck in an infinite loop
> -
>
> Key: HDFS-16583
> URL: https://issues.apache.org/jira/browse/HDFS-16583
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We encountered a case where the decommission monitor in the namenode got 
> stuck for about 6 hours. The logs give:
> {code}
> 2022-05-15 01:09:25,490 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.132:50010
> 2022-05-15 01:10:20,918 INFO org.apache.hadoop.http.HttpServer2: Process 
> Thread Dump: jsp requested
> 
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753665_3428271426
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753659_3428271420
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753662_3428271423
> 2022-05-15 01:19:06,810 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> PendingReconstructionMonitor timed out blk_4501753663_3428271424
> 2022-05-15 06:00:57,281 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager: Stopping 
> maintenance of dead node 10.185.3.34:50010
> 2022-05-15 06:00:58,105 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock 
> held for 17492614 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:263)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:220)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1601)
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:496)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
>   Number of suppressed write-lock reports: 0
>   Longest write-lock held interval: 17492614
> {code}
> We only have the one thread dump triggered by the FC:
> {code}
> Thread 80 (DatanodeAdminMonitor-0):
>   State: RUNNABLE
>   Blocked count: 16
>   Waited count: 453693
>   Stack:
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.check(DatanodeAdminManager.java:538)
> 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager$Monitor.run(DatanodeAdminManager.java:494)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> This was the line of code:
> {code}
> private void check() {
>   final Iterator>>
>   it = new CyclicItera

[jira] [Commented] (HDFS-13522) RBF: Support observer node from Router-Based Federation

2022-05-25 Thread zhengchenyu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541887#comment-17541887
 ] 

zhengchenyu commented on HDFS-13522:


Thanks for this good patch!

But I have a question. In this patch, router hide the client's state id. I 
think it means we can not enable or disable this feature, and can not configure.

As I know, Observer NameNode can't keep absolute consistency. So user may need 
to configure auto-mysnc-period and some other parameter so that meet different 
user's demand.

I think it had better to use client's state id in client rpc call level. (Note: 
of course, we need more change, especially mount multi nameservice. )

> RBF: Support observer node from Router-Based Federation
> ---
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-13522.001.patch, HDFS-13522.002.patch, 
> HDFS-13522_WIP.patch, RBF_ Observer support.pdf, Router+Observer RPC 
> clogging.png, ShortTerm-Routers+Observer.png
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16593) Correct inaccurate BlocksRemoved metric on DataNode side

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16593?focusedWorklogId=774428&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774428
 ]

ASF GitHub Bot logged work on HDFS-16593:
-

Author: ASF GitHub Bot
Created on: 25/May/22 08:55
Start Date: 25/May/22 08:55
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4353:
URL: https://github.com/apache/hadoop/pull/4353#issuecomment-1136978477

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  37m 22s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 42s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m 25s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 45s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 40s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 10s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 250m  0s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4353/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 14s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 359m  6s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl |
   |   | 
hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4353/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4353 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 7673a3cf7782 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 
23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / e8b4ab83cb5ae4dd3ed1e727aaac59d4607d2fe9 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-open

[jira] [Updated] (HDFS-15225) RBF: Add snapshot counts to content summary in router

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15225:
--
Labels: pull-request-available  (was: )

> RBF: Add snapshot counts to content summary in router
> -
>
> Key: HDFS-15225
> URL: https://issues.apache.org/jira/browse/HDFS-15225
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Quan Li
>Assignee: Quan Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15225) RBF: Add snapshot counts to content summary in router

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15225?focusedWorklogId=774471&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774471
 ]

ASF GitHub Bot logged work on HDFS-15225:
-

Author: ASF GitHub Bot
Created on: 25/May/22 10:48
Start Date: 25/May/22 10:48
Worklog Time Spent: 10m 
  Work Description: ayushtkn opened a new pull request, #4356:
URL: https://github.com/apache/hadoop/pull/4356

   ### Description of PR
   Add snapshot counts in getContentSummary
   
   ### How was this patch tested?
   UT
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?




Issue Time Tracking
---

Worklog Id: (was: 774471)
Remaining Estimate: 0h
Time Spent: 10m

> RBF: Add snapshot counts to content summary in router
> -
>
> Key: HDFS-15225
> URL: https://issues.apache.org/jira/browse/HDFS-15225
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Quan Li
>Assignee: Quan Li
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15225) RBF: Add snapshot counts to content summary in router

2022-05-25 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena reassigned HDFS-15225:
---

Assignee: Ayush Saxena  (was: Quan Li)

> RBF: Add snapshot counts to content summary in router
> -
>
> Key: HDFS-15225
> URL: https://issues.apache.org/jira/browse/HDFS-15225
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Quan Li
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15225) RBF: Add snapshot counts to content summary in router

2022-05-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541970#comment-17541970
 ] 

Ayush Saxena commented on HDFS-15225:
-

Taking this up. Have raised a PR

> RBF: Add snapshot counts to content summary in router
> -
>
> Key: HDFS-15225
> URL: https://issues.apache.org/jira/browse/HDFS-15225
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Quan Li
>Assignee: Ayush Saxena
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15225) RBF: Add snapshot counts to content summary in router

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15225?focusedWorklogId=774552&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774552
 ]

ASF GitHub Bot logged work on HDFS-15225:
-

Author: ASF GitHub Bot
Created on: 25/May/22 13:12
Start Date: 25/May/22 13:12
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on PR #4356:
URL: https://github.com/apache/hadoop/pull/4356#issuecomment-1137221132

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 53s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  66m 43s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 45s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   0m 40s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   0m 33s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  trunk passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m  5s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed with JDK 
Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m  0s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  21m 25s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 142m 46s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4356/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4356 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 760dae6db650 4.15.0-169-generic #177-Ubuntu SMP Thu Feb 3 
10:50:38 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 84e23e1be7a1c33b268da8d97653b2c9ddd93a50 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private 
Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4356/1/testReport/ |
   | Max. process+thread count | 2069 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4356/1/console |
 

[jira] [Work logged] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?focusedWorklogId=774728&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774728
 ]

ASF GitHub Bot logged work on HDFS-16592:
-

Author: ASF GitHub Bot
Created on: 25/May/22 17:54
Start Date: 25/May/22 17:54
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on code in PR #4351:
URL: https://github.com/apache/hadoop/pull/4351#discussion_r881957388


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/BalancingPolicy.java:
##
@@ -28,7 +28,7 @@
  * Balancing policy.
  * Since a datanode may contain multiple block pools,
  * {@link Pool} implies {@link Node}
- * but NOT the other way around
+ * but not the other way around

Review Comment:
   I think this is just changing the case of Not, this much we can tolerate and 
feels like it was intentionally done to increase the weight of not or bring 
into notice. this isn't a typo nor a grammatical error by my books, just a way 
of writing which in general world is acceptable.





Issue Time Tracking
---

Worklog Id: (was: 774728)
Time Spent: 40m  (was: 0.5h)

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, documentation, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16594:
---

 Summary: Many RpcCalls are blocked for a while while Decommission 
works
 Key: HDFS-16594
 URL: https://issues.apache.org/jira/browse/HDFS-16594
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.9.2
Reporter: JiangHua Zhu
 Attachments: image-2022-05-26-02-05-38-878.png

When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.145.92:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.99.152:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.146.56:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.82.106:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#274471 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.149.175:46419
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#73375524 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.81.46:34241
'
Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
be longer than usual. A very large number of RpcCalls were affected during this 
time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16594:
---

Assignee: JiangHua Zhu

> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.145.92:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.99.152:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.146.56:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.82.106:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.149.175:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.81.46:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?focusedWorklogId=774756&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774756
 ]

ASF GitHub Bot logged work on HDFS-16592:
-

Author: ASF GitHub Bot
Created on: 25/May/22 18:26
Start Date: 25/May/22 18:26
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on code in PR #4351:
URL: https://github.com/apache/hadoop/pull/4351#discussion_r881983491


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/BalancingPolicy.java:
##
@@ -28,7 +28,7 @@
  * Balancing policy.
  * Since a datanode may contain multiple block pools,
  * {@link Pool} implies {@link Node}
- * but NOT the other way around
+ * but not the other way around

Review Comment:
   OK
   I think I learned something new.
   Thanks @asymc for the guidance.
   I will update the status of jira later.





Issue Time Tracking
---

Worklog Id: (was: 774756)
Time Spent: 50m  (was: 40m)

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, documentation, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?focusedWorklogId=774757&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774757
 ]

ASF GitHub Bot logged work on HDFS-16592:
-

Author: ASF GitHub Bot
Created on: 25/May/22 18:27
Start Date: 25/May/22 18:27
Worklog Time Spent: 10m 
  Work Description: jianghuazhu commented on code in PR #4351:
URL: https://github.com/apache/hadoop/pull/4351#discussion_r881984629


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/BalancingPolicy.java:
##
@@ -28,7 +28,7 @@
  * Balancing policy.
  * Since a datanode may contain multiple block pools,
  * {@link Pool} implies {@link Node}
- * but NOT the other way around
+ * but not the other way around

Review Comment:
   I think I learned something new.
   Thanks @ayushtkn  for the guidance.
   I will update the status of jira later.





Issue Time Tracking
---

Worklog Id: (was: 774757)
Time Spent: 1h  (was: 50m)

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, documentation, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542190#comment-17542190
 ] 

JiangHua Zhu commented on HDFS-16594:
-

In my opinion, the priority of RpcCall being processed on time is relatively 
high, and the time that DatanodeAdminDefaultMonitor holds FSNamesystemLock 
cannot be too long.
Here are 2 ways to optimize:
1. The default value of ${dfs.namenode.decommission.blocks.per.interval} can be 
lowered, such as 1 or 2.
2. When DatanodeAdminDefaultMonitor is working, increase the time slice 
processing. For example, when DatanodeAdminDefaultMonitor works for more than 
500ms, it is forced to sleep for 10ms, and then restarts.

We can choose one of these 2 methods.
[~weichiu]  [~ayushtkn], do you guys have some good suggestions?


> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.145.92:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.99.152:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.146.56:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.82.106:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.149.175:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.81.46:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542230#comment-17542230
 ] 

Stephen O'Donnell commented on HDFS-16594:
--

Some people have reported good results with the DatanodeAdminBackoffMonitor. 
What about giving it a try and see if the locking it better?

> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.145.92:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.99.152:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.146.56:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.82.106:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.149.175:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.81.46:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16587) Allow configuring Handler number for the JournalNodeRpcServer

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16587?focusedWorklogId=774802&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774802
 ]

ASF GitHub Bot logged work on HDFS-16587:
-

Author: ASF GitHub Bot
Created on: 25/May/22 20:36
Start Date: 25/May/22 20:36
Worklog Time Spent: 10m 
  Work Description: ayushtkn commented on code in PR #4339:
URL: https://github.com/apache/hadoop/pull/4339#discussion_r882092952


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java:
##
@@ -90,13 +91,15 @@ public class JournalNodeRpcServer implements 
QJournalProtocol,
 new QJournalProtocolServerSideTranslatorPB(this);
 BlockingService service = QJournalProtocolService
 .newReflectiveBlockingService(translator);
+int handler = conf.getInt(DFS_JOURNALNODE_HANDLER_COUNT_KEY,
+DFS_JOURNALNODE_HANDLER_COUNT_DEFAULT);

Review Comment:
   Add some validation here for a valid value range, as in shouldn't be 
negative or so.
   If it is an invalid entry, Add a warn log and use the default value.
   Post that have a INFO log telling about the handler count...





Issue Time Tracking
---

Worklog Id: (was: 774802)
Time Spent: 40m  (was: 0.5h)

> Allow configuring Handler number for the JournalNodeRpcServer
> -
>
> Key: HDFS-16587
> URL: https://issues.apache.org/jira/browse/HDFS-16587
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We can allow configuring the handler number for the JournalNodeRpcServer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16586?focusedWorklogId=774847&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774847
 ]

ASF GitHub Bot logged work on HDFS-16586:
-

Author: ASF GitHub Bot
Created on: 25/May/22 23:35
Start Date: 25/May/22 23:35
Worklog Time Spent: 10m 
  Work Description: saintstack commented on PR #4348:
URL: https://github.com/apache/hadoop/pull/4348#issuecomment-1137960069

   Each run has a different test fail. This PR does not change functionality. 
Merging.




Issue Time Tracking
---

Worklog Id: (was: 774847)
Time Spent: 2h 50m  (was: 2h 40m)

> Purge FsDatasetAsyncDiskService threadgroup; it causes 
> BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
> exception and exit' 
> -
>
> Key: HDFS-16586
> URL: https://issues.apache.org/jira/browse/HDFS-16586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0, 3.2.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when 
> it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] 
> impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
> FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = 
> /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = 
> file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
>  for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
> (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> top.TopAuditLogger(78): --- logged event for top service: 
> allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
> src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
>   dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS 
> downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
> encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.(Thread.java:430) ~[?:?]
>   at java.lang.Thread.(Thread.java:704) ~[?:?]
>   at java.lang.Thread.(Thread.java:525) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:623)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184)
>  ~[hadoop-hdfs-3.2.3.jar:

[jira] [Work logged] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16586?focusedWorklogId=774848&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774848
 ]

ASF GitHub Bot logged work on HDFS-16586:
-

Author: ASF GitHub Bot
Created on: 25/May/22 23:35
Start Date: 25/May/22 23:35
Worklog Time Spent: 10m 
  Work Description: saintstack merged PR #4348:
URL: https://github.com/apache/hadoop/pull/4348




Issue Time Tracking
---

Worklog Id: (was: 774848)
Time Spent: 3h  (was: 2h 50m)

> Purge FsDatasetAsyncDiskService threadgroup; it causes 
> BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
> exception and exit' 
> -
>
> Key: HDFS-16586
> URL: https://issues.apache.org/jira/browse/HDFS-16586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0, 3.2.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when 
> it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] 
> impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
> FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = 
> /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = 
> file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
>  for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
> (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> top.TopAuditLogger(78): --- logged event for top service: 
> allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
> src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
>   dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS 
> downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
> encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.(Thread.java:430) ~[?:?]
>   at java.lang.Thread.(Thread.java:704) ~[?:?]
>   at java.lang.Thread.(Thread.java:525) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:623)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103)
>  ~[hadoo

[jira] [Work logged] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16586?focusedWorklogId=774854&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774854
 ]

ASF GitHub Bot logged work on HDFS-16586:
-

Author: ASF GitHub Bot
Created on: 26/May/22 00:02
Start Date: 26/May/22 00:02
Worklog Time Spent: 10m 
  Work Description: saintstack merged PR #4347:
URL: https://github.com/apache/hadoop/pull/4347




Issue Time Tracking
---

Worklog Id: (was: 774854)
Time Spent: 3h 20m  (was: 3h 10m)

> Purge FsDatasetAsyncDiskService threadgroup; it causes 
> BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
> exception and exit' 
> -
>
> Key: HDFS-16586
> URL: https://issues.apache.org/jira/browse/HDFS-16586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0, 3.2.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when 
> it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] 
> impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
> FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = 
> /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = 
> file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
>  for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
> (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> top.TopAuditLogger(78): --- logged event for top service: 
> allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
> src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
>   dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS 
> downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
> encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.(Thread.java:430) ~[?:?]
>   at java.lang.Thread.(Thread.java:704) ~[?:?]
>   at java.lang.Thread.(Thread.java:525) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:623)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103)
> 

[jira] [Work logged] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16586?focusedWorklogId=774853&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774853
 ]

ASF GitHub Bot logged work on HDFS-16586:
-

Author: ASF GitHub Bot
Created on: 26/May/22 00:02
Start Date: 26/May/22 00:02
Worklog Time Spent: 10m 
  Work Description: saintstack commented on PR #4347:
URL: https://github.com/apache/hadoop/pull/4347#issuecomment-1137995656

   TestRollingUpgrade failed on each of the three test runs. Running it in a 
loop locally, it does not fail w/ the patch in place. Looking at the test 
failures, the complaint is from shutdown handler in QJM or hang waiting on JMX 
bean change, unrelated to this non-functional change inside the datanode.




Issue Time Tracking
---

Worklog Id: (was: 774853)
Time Spent: 3h 10m  (was: 3h)

> Purge FsDatasetAsyncDiskService threadgroup; it causes 
> BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
> exception and exit' 
> -
>
> Key: HDFS-16586
> URL: https://issues.apache.org/jira/browse/HDFS-16586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0, 3.2.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when 
> it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] 
> impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
> FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = 
> /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = 
> file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
>  for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
> (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> top.TopAuditLogger(78): --- logged event for top service: 
> allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
> src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
>   dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS 
> downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
> encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.(Thread.java:430) ~[?:?]
>   at java.lang.Thread.(Thread.java:704) ~[?:?]
>   at java.lang.Thread.(Thread.java:525) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:623)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.dele

[jira] [Resolved] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

2022-05-25 Thread Michael Stack (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HDFS-16586.
--
Fix Version/s: 3.4.0
   3.2.4
   3.3.4
 Hadoop Flags: Reviewed
   Resolution: Fixed

Merged to branch-3, branch-3.3, and to branch-3.2. Thank you for the review 
[~hexiaoqiao] 

> Purge FsDatasetAsyncDiskService threadgroup; it causes 
> BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal 
> exception and exit' 
> -
>
> Key: HDFS-16586
> URL: https://issues.apache.org/jira/browse/HDFS-16586
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0, 3.2.3
>Reporter: Michael Stack
>Assignee: Michael Stack
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when 
> it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] 
> impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica 
> FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = 
> /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = 
> file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840
>  for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 
> (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] 
> top.TopAuditLogger(78): --- logged event for top service: 
> allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  
> src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp
>   dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS 
> downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: 
> BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): 
> PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, 
> type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] 
> datanode.BPServiceActor$CommandProcessingThread(1276): Command processor 
> encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.(Thread.java:430) ~[?:?]
>   at java.lang.Thread.(Thread.java:704) ~[?:?]
>   at java.lang.Thread.(Thread.java:525) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:623)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912)
>  ~[?:?]
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) 
> ~[?:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103)
>  ~[hadoop-hdfs-3.2.3.jar:?]
>   at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736)
>  ~[hadoop-hdf

[jira] [Updated] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16594:

Description: 
When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#274471 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:46419
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#73375524 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:34241
'
Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
be longer than usual. A very large number of RpcCalls were affected during this 
time.

  was:
When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.145.92:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.99.152:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.146.56:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.82.106:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 

[jira] [Created] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits

2022-05-25 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16595:
---

 Summary: Slow peer metrics - add median, mad and upper latency 
limits
 Key: HDFS-16595
 URL: https://issues.apache.org/jira/browse/HDFS-16595
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani


Slow datanode metrics include slow node and it's reporting node details. With 
HDFS-16582, we added the aggregate latency that is perceived by the reporting 
nodes.

In order to get more insights into how the outlier slownode's latencies differ 
from the rest of the nodes, we should also expose median, median absolute 
deviation and the calculated upper latency limit details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16595?focusedWorklogId=774883&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774883
 ]

ASF GitHub Bot logged work on HDFS-16595:
-

Author: ASF GitHub Bot
Created on: 26/May/22 02:26
Start Date: 26/May/22 02:26
Worklog Time Spent: 10m 
  Work Description: virajjasani opened a new pull request, #4357:
URL: https://github.com/apache/hadoop/pull/4357

   ### Description of PR
   Slow datanode metrics include slow node and it's reporting node details. 
With HDFS-16582, we added the aggregate latency that is perceived by the 
reporting nodes.
   
   In order to get more insights into how the outlier slownode's latencies 
differ from the rest of the nodes, we should also expose median, median 
absolute deviation and the calculated upper latency limit details.
   
   ### How was this patch tested?
   UTs and Dev cluster testing:
   
   https://user-images.githubusercontent.com/34790606/170402806-e3ee106b-93e0-42f1-a4d4-2695dd679e98.png";>
   
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   




Issue Time Tracking
---

Worklog Id: (was: 774883)
Remaining Estimate: 0h
Time Spent: 10m

> Slow peer metrics - add median, mad and upper latency limits
> 
>
> Key: HDFS-16595
> URL: https://issues.apache.org/jira/browse/HDFS-16595
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Slow datanode metrics include slow node and it's reporting node details. With 
> HDFS-16582, we added the aggregate latency that is perceived by the reporting 
> nodes.
> In order to get more insights into how the outlier slownode's latencies 
> differ from the rest of the nodes, we should also expose median, median 
> absolute deviation and the calculated upper latency limit details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16595) Slow peer metrics - add median, mad and upper latency limits

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16595:
--
Labels: pull-request-available  (was: )

> Slow peer metrics - add median, mad and upper latency limits
> 
>
> Key: HDFS-16595
> URL: https://issues.apache.org/jira/browse/HDFS-16595
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Slow datanode metrics include slow node and it's reporting node details. With 
> HDFS-16582, we added the aggregate latency that is perceived by the reporting 
> nodes.
> In order to get more insights into how the outlier slownode's latencies 
> differ from the rest of the nodes, we should also expose median, median 
> absolute deviation and the calculated upper latency limit details.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-14750) RBF: Improved isolation for downstream name nodes. {Dynamic}

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14750?focusedWorklogId=774884&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774884
 ]

ASF GitHub Bot logged work on HDFS-14750:
-

Author: ASF GitHub Bot
Created on: 26/May/22 02:27
Start Date: 26/May/22 02:27
Worklog Time Spent: 10m 
  Work Description: kokonguyen191 commented on PR #4307:
URL: https://github.com/apache/hadoop/pull/4307#issuecomment-1138077437

   @goiri Can you help review the PR? Thanks!




Issue Time Tracking
---

Worklog Id: (was: 774884)
Time Spent: 3h 50m  (was: 3h 40m)

> RBF: Improved isolation for downstream name nodes. {Dynamic}
> 
>
> Key: HDFS-14750
> URL: https://issues.apache.org/jira/browse/HDFS-14750
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> This Jira tracks the work around dynamic allocation of resources in routers 
> for downstream hdfs clusters. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-25 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu resolved HDFS-16592.
-
Resolution: Not A Problem

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, documentation, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?focusedWorklogId=774888&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774888
 ]

ASF GitHub Bot logged work on HDFS-16592:
-

Author: ASF GitHub Bot
Created on: 26/May/22 02:50
Start Date: 26/May/22 02:50
Worklog Time Spent: 10m 
  Work Description: jianghuazhu closed pull request #4351: HDFS-16592.Fix 
typo for BalancingPolicy.
URL: https://github.com/apache/hadoop/pull/4351




Issue Time Tracking
---

Worklog Id: (was: 774888)
Time Spent: 1h 10m  (was: 1h)

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover, documentation, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542319#comment-17542319
 ] 

Wei-Chiu Chuang commented on HDFS-16594:


Stephen's very experience in this area of code so I'd take heed of his 
suggestions :) 

There were a few changes made to improve around this area. 2.9.2 is old and I 
am not sure if it has all the updates we made the last 2-3 years.



> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16561) Handle error returned by strtol

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16561?focusedWorklogId=774905&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774905
 ]

ASF GitHub Bot logged work on HDFS-16561:
-

Author: ASF GitHub Bot
Created on: 26/May/22 05:39
Start Date: 26/May/22 05:39
Worklog Time Spent: 10m 
  Work Description: GauthamBanasandra merged PR #4287:
URL: https://github.com/apache/hadoop/pull/4287




Issue Time Tracking
---

Worklog Id: (was: 774905)
Time Spent: 2h 50m  (was: 2h 40m)

> Handle error returned by strtol
> ---
>
> Key: HDFS-16561
> URL: https://issues.apache.org/jira/browse/HDFS-16561
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Affects Versions: 3.4.0
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> *strtol* is used in 
> [hdfs-chmod.cc|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tools/hdfs-chmod/hdfs-chmod.cc#L144].
>  The call to strtol could error out when an invalid input is provided. Need 
> to handle the error given out by strtol.
> Tasks to do -
> 1. Detect the error returned by strtol. The [strtol documentation 
> |https://en.cppreference.com/w/cpp/string/byte/strtol]explains how to do so.
> 2. Return false to the caller if the error is detected.
> 3. Extend 
> [this|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tests/tools/hdfs-chmod-mock.cc]
>  unit test and add a case which exercises this by passing an invalid input. 
> Please refer to this PR to get more context on how this unit test is written 
> - https://github.com/apache/hadoop/pull/3588.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16561) Handle error returned by strtol

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16561?focusedWorklogId=774906&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774906
 ]

ASF GitHub Bot logged work on HDFS-16561:
-

Author: ASF GitHub Bot
Created on: 26/May/22 05:41
Start Date: 26/May/22 05:41
Worklog Time Spent: 10m 
  Work Description: GauthamBanasandra commented on PR #4287:
URL: https://github.com/apache/hadoop/pull/4287#issuecomment-1138174893

   Congratulations on your first PR @rishabh1704 🎉Welcome to the Hadoop 
community 😊




Issue Time Tracking
---

Worklog Id: (was: 774906)
Time Spent: 3h  (was: 2h 50m)

> Handle error returned by strtol
> ---
>
> Key: HDFS-16561
> URL: https://issues.apache.org/jira/browse/HDFS-16561
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Affects Versions: 3.4.0
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> *strtol* is used in 
> [hdfs-chmod.cc|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tools/hdfs-chmod/hdfs-chmod.cc#L144].
>  The call to strtol could error out when an invalid input is provided. Need 
> to handle the error given out by strtol.
> Tasks to do -
> 1. Detect the error returned by strtol. The [strtol documentation 
> |https://en.cppreference.com/w/cpp/string/byte/strtol]explains how to do so.
> 2. Return false to the caller if the error is detected.
> 3. Extend 
> [this|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tests/tools/hdfs-chmod-mock.cc]
>  unit test and add a case which exercises this by passing an invalid input. 
> Please refer to this PR to get more context on how this unit test is written 
> - https://github.com/apache/hadoop/pull/3588.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16561) Handle error returned by strtol

2022-05-25 Thread Gautham Banasandra (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautham Banasandra updated HDFS-16561:
--
Fix Version/s: 3.4.0

> Handle error returned by strtol
> ---
>
> Key: HDFS-16561
> URL: https://issues.apache.org/jira/browse/HDFS-16561
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Affects Versions: 3.4.0
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> *strtol* is used in 
> [hdfs-chmod.cc|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tools/hdfs-chmod/hdfs-chmod.cc#L144].
>  The call to strtol could error out when an invalid input is provided. Need 
> to handle the error given out by strtol.
> Tasks to do -
> 1. Detect the error returned by strtol. The [strtol documentation 
> |https://en.cppreference.com/w/cpp/string/byte/strtol]explains how to do so.
> 2. Return false to the caller if the error is detected.
> 3. Extend 
> [this|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tests/tools/hdfs-chmod-mock.cc]
>  unit test and add a case which exercises this by passing an invalid input. 
> Please refer to this PR to get more context on how this unit test is written 
> - https://github.com/apache/hadoop/pull/3588.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16561) Handle error returned by strtol

2022-05-25 Thread Gautham Banasandra (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautham Banasandra resolved HDFS-16561.
---
Resolution: Fixed

Merged PR https://github.com/apache/hadoop/pull/4287 to trunk. Thank you 
[~__rishuu__] for your contribution.

> Handle error returned by strtol
> ---
>
> Key: HDFS-16561
> URL: https://issues.apache.org/jira/browse/HDFS-16561
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: libhdfs++
>Affects Versions: 3.4.0
>Reporter: Gautham Banasandra
>Assignee: Gautham Banasandra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> *strtol* is used in 
> [hdfs-chmod.cc|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tools/hdfs-chmod/hdfs-chmod.cc#L144].
>  The call to strtol could error out when an invalid input is provided. Need 
> to handle the error given out by strtol.
> Tasks to do -
> 1. Detect the error returned by strtol. The [strtol documentation 
> |https://en.cppreference.com/w/cpp/string/byte/strtol]explains how to do so.
> 2. Return false to the caller if the error is detected.
> 3. Extend 
> [this|https://github.com/apache/hadoop/blob/6dddbd42edd57cc26279c678756386a47c040af5/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/tests/tools/hdfs-chmod-mock.cc]
>  unit test and add a case which exercises this by passing an invalid input. 
> Please refer to this PR to get more context on how this unit test is written 
> - https://github.com/apache/hadoop/pull/3588.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542370#comment-17542370
 ] 

JiangHua Zhu commented on HDFS-16594:
-

Thanks [~sodonnell] and [~weichiu] for your comments and following.

> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16587) Allow configuring Handler number for the JournalNodeRpcServer

2022-05-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16587?focusedWorklogId=774910&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-774910
 ]

ASF GitHub Bot logged work on HDFS-16587:
-

Author: ASF GitHub Bot
Created on: 26/May/22 06:27
Start Date: 26/May/22 06:27
Worklog Time Spent: 10m 
  Work Description: ZanderXu commented on code in PR #4339:
URL: https://github.com/apache/hadoop/pull/4339#discussion_r882356503


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java:
##
@@ -90,13 +91,15 @@ public class JournalNodeRpcServer implements 
QJournalProtocol,
 new QJournalProtocolServerSideTranslatorPB(this);
 BlockingService service = QJournalProtocolService
 .newReflectiveBlockingService(translator);
+int handler = conf.getInt(DFS_JOURNALNODE_HANDLER_COUNT_KEY,
+DFS_JOURNALNODE_HANDLER_COUNT_DEFAULT);

Review Comment:
   @ayushtkn Thanks for your comment, it helped me a lot. And the patch has 
been updated, please help me review it again. Thanks





Issue Time Tracking
---

Worklog Id: (was: 774910)
Time Spent: 50m  (was: 40m)

> Allow configuring Handler number for the JournalNodeRpcServer
> -
>
> Key: HDFS-16587
> URL: https://issues.apache.org/jira/browse/HDFS-16587
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We can allow configuring the handler number for the JournalNodeRpcServer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org