[jira] [Commented] (HDFS-16100) HA: Improve performance of Standby node transition to Active

2021-07-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376251#comment-17376251
 ] 

Xiaoqiao He commented on HDFS-16100:


[^HDFS-16100.001.patch] it is safe to checkin and improve it for me. Kindly 
ping [~weichiu], [~inigoiri] do you mind to give another reviews?
CI seems work not well and trigger it manually, refer: 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/671/

>  HA: Improve performance of Standby node transition to Active
> -
>
> Key: HDFS-16100
> URL: https://issues.apache.org/jira/browse/HDFS-16100
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.1
>Reporter: wudeyu
>Assignee: wudeyu
>Priority: Major
> Attachments: HDFS-16100.001.patch, HDFS-16100.patch
>
>
> pendingDNMessages in Standby is used to support process postponed block 
> reports. Block reports in pendingDNMessages would be processed:
>  # If GS of replica is in the future, Standby Node will process it when 
> corresponding edit log(e.g add_block) is loaded.
>  # If replica is corrupted, Standby Node will process it while it transfer to 
> Active.
>  # If DataNode is removed, corresponding of block reports will be removed in 
> pendingDNMessages.
> Obviously, if num of corrupted replica grows, more time cost during 
> transferring. In out situation, there're 60 millions block reports in 
> pendingDNMessages before transfer. Processing block reports cost almost 7mins 
> and it's killed by zkfc. The replica state of the most block reports is RBW 
> with wrong GS(less than storedblock in Standby Node).
> In my opinion, Standby Node could ignore the block reports that replica state 
> is RBW with wrong GS. Because Active node/DataNode will remove it later.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16116?focusedWorklogId=619748=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619748
 ]

ASF GitHub Bot logged work on HDFS-16116:
-

Author: ASF GitHub Bot
Created on: 07/Jul/21 04:28
Start Date: 07/Jul/21 04:28
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3181:
URL: https://github.com/apache/hadoop/pull/3181#issuecomment-875267101


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  17m 39s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 32s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  23m  1s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 49s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  15m 33s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  mvnsite  |   1m 35s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  8s |  |  No new issues.  |
   | +1 :green_heart: |  shadedclient  |  15m 33s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 31s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   0m 20s |  |  hadoop-federation-balance in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 29s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  93m 34s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3181 |
   | Optional Tests | dupname asflicense mvnsite unit codespell shellcheck 
shelldocs markdownlint |
   | uname | Linux 3dac8065161a 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 
05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 981002ac1ee636b739e797256ec1762ee6ebd91c |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/3/testReport/ |
   | Max. process+thread count | 572 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-tools/hadoop-federation-balance U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/3/console |
   | versions | git=2.25.1 maven=3.6.3 shellcheck=0.7.0 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619748)
Time Spent: 40m  (was: 0.5h)

> Fix Hadoop FedBalance shell and federationBanance markdown bug.
> ---
>
> Key: HDFS-16116
> URL: https://issues.apache.org/jira/browse/HDFS-16116
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: panlijie
>Assignee: panlijie
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Fix Hadoop FedBalance shell and federationBanance 

[jira] [Work logged] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16116?focusedWorklogId=619742=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619742
 ]

ASF GitHub Bot logged work on HDFS-16116:
-

Author: ASF GitHub Bot
Created on: 07/Jul/21 04:13
Start Date: 07/Jul/21 04:13
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3181:
URL: https://github.com/apache/hadoop/pull/3181#issuecomment-875261420


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  12m  7s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m  1s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 19s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 53s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  13m 13s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  mvnsite  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  7s |  |  No new issues.  |
   | +1 :green_heart: |  shadedclient  |  13m  5s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 39s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   0m 24s |  |  hadoop-federation-balance in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  81m 30s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3181 |
   | Optional Tests | dupname asflicense mvnsite unit codespell shellcheck 
shelldocs markdownlint |
   | uname | Linux 17f3b7f8e5f8 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 981002ac1ee636b739e797256ec1762ee6ebd91c |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/2/testReport/ |
   | Max. process+thread count | 720 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-tools/hadoop-federation-balance U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/2/console |
   | versions | git=2.25.1 maven=3.6.3 shellcheck=0.7.0 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619742)
Time Spent: 0.5h  (was: 20m)

> Fix Hadoop FedBalance shell and federationBanance markdown bug.
> ---
>
> Key: HDFS-16116
> URL: https://issues.apache.org/jira/browse/HDFS-16116
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: panlijie
>Assignee: panlijie
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Fix Hadoop FedBalance shell and federationBanance 

[jira] [Updated] (HDFS-16094) HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16094:
-
Target Version/s: 3.1.1  (was: 3.4.0)

> HDFS balancer process start failed owing to daemon pid file is not cleared in 
> some exception senario
> 
>
> Key: HDFS-16094
> URL: https://issues.apache.org/jira/browse/HDFS-16094
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: scripts
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Major
>
> HDFS balancer process start failed owing to daemon pid file is not cleared in 
> some exception senario, but there is no useful information in log to trouble 
> shoot as below.
> {code:java}
> //代码占位符
> hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}")
> {code}
> but actually, the process is not running as the error msg details above.
> Therefore, some more explicit information should be print in error log to 
> guide  users to clear the pid file and where the pid file location is.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16116?focusedWorklogId=619702=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619702
 ]

ASF GitHub Bot logged work on HDFS-16116:
-

Author: ASF GitHub Bot
Created on: 07/Jul/21 02:11
Start Date: 07/Jul/21 02:11
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3181:
URL: https://github.com/apache/hadoop/pull/3181#issuecomment-875214589


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 34s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  shelldocs  |   0m  0s |  |  Shelldocs was not available.  |
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 55s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m  9s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 54s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  13m 22s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  mvnsite  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  shellcheck  |   0m  9s |  |  No new issues.  |
   | +1 :green_heart: |  shadedclient  |  12m 59s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   1m 38s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  |   0m 25s |  |  hadoop-federation-balance in 
the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  69m 40s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3181 |
   | Optional Tests | dupname asflicense mvnsite unit codespell shellcheck 
shelldocs markdownlint |
   | uname | Linux c18d15971a47 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 725b76f468575b31fa8539a3b5ea9abf4aca8a71 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/1/testReport/ |
   | Max. process+thread count | 693 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-tools/hadoop-federation-balance U: . |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3181/1/console |
   | versions | git=2.25.1 maven=3.6.3 shellcheck=0.7.0 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619702)
Time Spent: 20m  (was: 10m)

> Fix Hadoop FedBalance shell and federationBanance markdown bug.
> ---
>
> Key: HDFS-16116
> URL: https://issues.apache.org/jira/browse/HDFS-16116
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: panlijie
>Assignee: panlijie
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Fix Hadoop FedBalance shell and federationBanance 

[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376146#comment-17376146
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/7/21, 2:01 AM:
---

[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread caught a non-fatal error or exception, there will 
be 5 time retry instead of simply interrup it,  and after it reach the max 
retry times , we need to stop the corresponding BPServiceActor thread as well. 

In HDFS-15651, no matter what kind of the error is , just simply close the 
thread, but there are many non-fatal errors  that probably recover 
automatically like "cannot create native thread error", when the thread in os 
drop, the BPServiceActor service still dead can not recover by itself.

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 


was (Author: daniel ma):
[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread is dead, we need to stop the corresponding 
BPServiceActor thread as well. 

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> 

[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/7/21, 1:47 AM:
---

Hello [~brahmareddy],[~hemant]

Could you pls help to review this patch. thanks.


was (Author: daniel ma):
Hello [~brahmareddy],[~hemant]

Pls help to review this patch. thanks.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/7/21, 1:46 AM:
---

Hello [~brahmareddy],[~hemant]

Pls help to review this patch. thanks.


was (Author: daniel ma):
Hello [~brahmareddy],

[~ayush]

Pls help to review this patch. thanks.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread panlijie (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

panlijie updated HDFS-16116:

Fix Version/s: 3.4.0

> Fix Hadoop FedBalance shell and federationBanance markdown bug.
> ---
>
> Key: HDFS-16116
> URL: https://issues.apache.org/jira/browse/HDFS-16116
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: panlijie
>Assignee: panlijie
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fix Hadoop FedBalance shell and federationBanance markdown bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376146#comment-17376146
 ] 

Daniel Ma commented on HDFS-16115:
--

[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread is dead, we need to stop the corresponding 
BPServiceActor thread as well. 

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:717)
> at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,
{code:java}
//代码占位符
2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
process queue BPServiceActor.java:1393
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)

{code}
currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPServiceActor 
thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special 

[jira] [Updated] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16116:
--
Labels: pull-request-available  (was: )

> Fix Hadoop FedBalance shell and federationBanance markdown bug.
> ---
>
> Key: HDFS-16116
> URL: https://issues.apache.org/jira/browse/HDFS-16116
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: panlijie
>Assignee: panlijie
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fix Hadoop FedBalance shell and federationBanance markdown bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16116?focusedWorklogId=619681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619681
 ]

ASF GitHub Bot logged work on HDFS-16116:
-

Author: ASF GitHub Bot
Created on: 07/Jul/21 01:00
Start Date: 07/Jul/21 01:00
Worklog Time Spent: 10m 
  Work Description: xiaoxiaopan118 opened a new pull request #3181:
URL: https://github.com/apache/hadoop/pull/3181


   …n bug.
   
   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619681)
Remaining Estimate: 0h
Time Spent: 10m

> Fix Hadoop FedBalance shell and federationBanance markdown bug.
> ---
>
> Key: HDFS-16116
> URL: https://issues.apache.org/jira/browse/HDFS-16116
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: panlijie
>Assignee: panlijie
>Priority: Critical
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Fix Hadoop FedBalance shell and federationBanance markdown bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread panlijie (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

panlijie reassigned HDFS-16116:
---

Assignee: panlijie

> Fix Hadoop FedBalance shell and federationBanance markdown bug.
> ---
>
> Key: HDFS-16116
> URL: https://issues.apache.org/jira/browse/HDFS-16116
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: panlijie
>Assignee: panlijie
>Priority: Critical
>
> Fix Hadoop FedBalance shell and federationBanance markdown bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16116) Fix Hadoop FedBalance shell and federationBanance markdown bug.

2021-07-06 Thread panlijie (Jira)
panlijie created HDFS-16116:
---

 Summary: Fix Hadoop FedBalance shell and federationBanance 
markdown bug.
 Key: HDFS-16116
 URL: https://issues.apache.org/jira/browse/HDFS-16116
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Reporter: panlijie


Fix Hadoop FedBalance shell and federationBanance markdown bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?focusedWorklogId=619550=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619550
 ]

ASF GitHub Bot logged work on HDFS-16114:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 18:44
Start Date: 06/Jul/21 18:44
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3179:
URL: https://github.com/apache/hadoop/pull/3179#issuecomment-874997298


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  1s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  37m 13s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m 17s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 11s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 59s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  21m 17s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 59s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 38s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   4m  8s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 12s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 373m 50s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3179/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 45s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 478m 53s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   |   | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3179/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3179 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 685d6a45fd47 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / cec6f8fac0014e0c88f299e3729f9402900ddc7f |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 

[jira] [Updated] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap

2021-07-06 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-16101:

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk. Thanx [~lei w] for the contribution!!!

> Remove unuse variable and IoException in ProvidedStorageMap
> ---
>
> Key: HDFS-16101
> URL: https://issues.apache.org/jira/browse/HDFS-16101
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: HDFS-16101.001.patch
>
>
> Remove unuse variable and IoException in ProvidedStorageMap



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16101) Remove unuse variable and IoException in ProvidedStorageMap

2021-07-06 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375907#comment-17375907
 ] 

Ayush Saxena commented on HDFS-16101:
-

+1

> Remove unuse variable and IoException in ProvidedStorageMap
> ---
>
> Key: HDFS-16101
> URL: https://issues.apache.org/jira/browse/HDFS-16101
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16101.001.patch
>
>
> Remove unuse variable and IoException in ProvidedStorageMap



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?focusedWorklogId=619465=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619465
 ]

ASF GitHub Bot logged work on HDFS-16114:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 16:38
Start Date: 06/Jul/21 16:38
Worklog Time Spent: 10m 
  Work Description: hemanthboyina edited a comment on pull request #3179:
URL: https://github.com/apache/hadoop/pull/3179#issuecomment-874913168


   +1 LGTM, will wait for jenkins


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619465)
Time Spent: 40m  (was: 0.5h)

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>  Labels: balancer, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> public String toString() {
>  return String.format("%s.%s [%s," + " threshold = %s,"
>  + " max idle iteration = %s," + " #excluded nodes = %s,"
>  + " #included nodes = %s," + " #source nodes = %s,"
>  + " #blockpools = %s," + " run during upgrade = %s,"
>  {color:#FF}+ " hot block time interval = %s]"{color}
> {color:#FF} + " sort top nodes = %s",{color}
>  Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
>  threshold, maxIdleIteration, excludedNodes.size(),
>  includedNodes.size(), sourceNodes.size(), blockpools.size(),
>  runDuringUpgrade, {color:#FF}sortTopNodes, hotBlockTimeInterval{color});
> }
> print error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?focusedWorklogId=619462=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619462
 ]

ASF GitHub Bot logged work on HDFS-16114:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 16:37
Start Date: 06/Jul/21 16:37
Worklog Time Spent: 10m 
  Work Description: hemanthboyina commented on pull request #3179:
URL: https://github.com/apache/hadoop/pull/3179#issuecomment-874913168


   +1 LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619462)
Time Spent: 0.5h  (was: 20m)

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>  Labels: balancer, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> public String toString() {
>  return String.format("%s.%s [%s," + " threshold = %s,"
>  + " max idle iteration = %s," + " #excluded nodes = %s,"
>  + " #included nodes = %s," + " #source nodes = %s,"
>  + " #blockpools = %s," + " run during upgrade = %s,"
>  {color:#FF}+ " hot block time interval = %s]"{color}
> {color:#FF} + " sort top nodes = %s",{color}
>  Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
>  threshold, maxIdleIteration, excludedNodes.size(),
>  includedNodes.size(), sourceNodes.size(), blockpools.size(),
>  runDuringUpgrade, {color:#FF}sortTopNodes, hotBlockTimeInterval{color});
> }
> print error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=619390=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619390
 ]

ASF GitHub Bot logged work on HDFS-16088:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 14:41
Start Date: 06/Jul/21 14:41
Worklog Time Spent: 10m 
  Work Description: tomscut edited a comment on pull request #3140:
URL: https://github.com/apache/hadoop/pull/3140#issuecomment-874820122


   This failed unit test `TestDecommissioningStatusWithBackoffMonitor` is 
unrelated to the change. I filed a JIRA 
[HDFS-16112](https://issues.apache.org/jira/browse/HDFS-16112).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619390)
Time Spent: 3h 50m  (was: 3h 40m)

> Standby NameNode process getLiveDatanodeStorageReport request to reduce 
> Active load
> ---
>
> Key: HDFS-16088
> URL: https://issues.apache.org/jira/browse/HDFS-16088
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: standyby-ipcserver.jpg
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also 
> request to SNN to reduce the ANN load.
> There are two points that need to be mentioned:
>  1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, 
> so we can access SNN directly.
>  2. We can share the same UT(testBalancerRequestSBNWithHA) with 
> NameNodeConnector#getBlocks().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=619389=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619389
 ]

ASF GitHub Bot logged work on HDFS-16088:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 14:40
Start Date: 06/Jul/21 14:40
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3140:
URL: https://github.com/apache/hadoop/pull/3140#issuecomment-874820122


   This failed unit test is unrelated to the change. I filed a JIRA 
[HDFS-16112](https://issues.apache.org/jira/browse/HDFS-16112).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619389)
Time Spent: 3h 40m  (was: 3.5h)

> Standby NameNode process getLiveDatanodeStorageReport request to reduce 
> Active load
> ---
>
> Key: HDFS-16088
> URL: https://issues.apache.org/jira/browse/HDFS-16088
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: standyby-ipcserver.jpg
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also 
> request to SNN to reduce the ANN load.
> There are two points that need to be mentioned:
>  1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, 
> so we can access SNN directly.
>  2. We can share the same UT(testBalancerRequestSBNWithHA) with 
> NameNodeConnector#getBlocks().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16095) Add lsQuotaList command and getQuotaListing api for hdfs quota

2021-07-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375756#comment-17375756
 ] 

Xiaoqiao He commented on HDFS-16095:


[~zhuxiangyi] Thanks for involving me here. 
{quote}It has a potential to hold the fsn/fsd lock for a long time and cause 
service outage or delays.{quote}
[~kihwal] has leaved comment at Github PR. I total support that. It is fatal 
for NameNode to hold global lock for long time. I heard many many cases report 
that NameNode out of service because invoke quotaUsage/count/du to a huge path. 
So we should avoid to involve more heavy operation for NameNode before 
fine-grained locking solution is ready. Thanks again.

> Add lsQuotaList command and getQuotaListing api for hdfs quota
> --
>
> Key: HDFS-16095
> URL: https://issues.apache.org/jira/browse/HDFS-16095
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently hdfs does not support obtaining all quota information. The 
> administrator may need to check which quotas have been added to a certain 
> directory, or the quotas of the entire cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375741#comment-17375741
 ] 

Xiaoqiao He commented on HDFS-16115:


[~Daniel Ma] Thanks for your report. I am not sure if HDFS-15651 could solve 
this issue. IMO If this thread meet some error DataNode process will exit too. 
Do you mind to offer some more information or log for this issue? Thanks.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=619376=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619376
 ]

ASF GitHub Bot logged work on HDFS-16088:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 14:07
Start Date: 06/Jul/21 14:07
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3140:
URL: https://github.com/apache/hadoop/pull/3140#issuecomment-874792521


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 18s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  3s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 23s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 54s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 21s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  18m 55s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 42s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 360m 59s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3140/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 54s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 453m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.TestDecommissioningStatusWithBackoffMonitor |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3140/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3140 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 9663799cf29b 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 
05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / d80028d40add35d007191d18a972441307f7a814 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3140/6/testReport/ |
   | Max. process+thread count | 1956 (vs. ulimit of 5500) |
   | modules | C: 

[jira] [Work logged] (HDFS-16087) RBF balance process is stuck at DisableWrite stage

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16087?focusedWorklogId=619342=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619342
 ]

ASF GitHub Bot logged work on HDFS-16087:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 13:06
Start Date: 06/Jul/21 13:06
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3141:
URL: https://github.com/apache/hadoop/pull/3141#issuecomment-874743045


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  3s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 44s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  25m  4s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  27m 37s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  24m  6s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   4m 29s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 43s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 52s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 34s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  19m 28s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  26m 47s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  26m 47s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  23m  1s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |  23m  1s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m 28s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/3/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 41 new + 1 unchanged - 0 fixed = 42 total (was 1) 
 |
   | +1 :green_heart: |  mvnsite  |   1m 36s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 52s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 55s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  18m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  20m 53s | 
[/patch-unit-hadoop-tools_hadoop-federation-balance.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/3/artifact/out/patch-unit-hadoop-tools_hadoop-federation-balance.txt)
 |  hadoop-federation-balance in the patch passed.  |
   | -1 :x: |  unit  |  40m 22s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m  8s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 271m  6s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.tools.fedbalance.procedure.TestBalanceProcedureScheduler |
   |   | hadoop.hdfs.rbfbalance.TestRouterDistCpProcedure |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3141 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 

[jira] [Work logged] (HDFS-16087) RBF balance process is stuck at DisableWrite stage

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16087?focusedWorklogId=619327=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619327
 ]

ASF GitHub Bot logged work on HDFS-16087:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 12:21
Start Date: 06/Jul/21 12:21
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3141:
URL: https://github.com/apache/hadoop/pull/3141#issuecomment-874712397


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 32s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 33s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 14s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  23m 44s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  19m 23s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   3m 49s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 45s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 47s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 21s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  15m 14s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 27s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  21m  7s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  21m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  18m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |  18m 42s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   3m 50s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/5/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 39 new + 1 unchanged - 0 fixed = 40 total (was 1) 
 |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 44s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 49s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  15m 17s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   6m 50s |  |  hadoop-federation-balance in 
the patch passed.  |
   | +1 :green_heart: |  unit  |  20m 39s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 58s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 203m 56s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3141 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 7beea82002b2 4.15.0-136-generic #140-Ubuntu SMP Thu Jan 28 
05:20:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 97f2f3502975b60a8304b87bc7e4ca9b9914db0c |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | 

[jira] [Work logged] (HDFS-16087) RBF balance process is stuck at DisableWrite stage

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16087?focusedWorklogId=619326=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619326
 ]

ASF GitHub Bot logged work on HDFS-16087:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 12:19
Start Date: 06/Jul/21 12:19
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3141:
URL: https://github.com/apache/hadoop/pull/3141#issuecomment-874710808


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 40s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 35s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  22m 10s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  23m 17s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  19m  0s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   3m 51s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 43s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 19s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  15m 19s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 28s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  22m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  22m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  19m 11s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |  19m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   3m 46s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/4/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 39 new + 1 unchanged - 0 fixed = 40 total (was 1) 
 |
   | +1 :green_heart: |  mvnsite  |   1m 44s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 52s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 52s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  16m 51s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   7m 19s |  |  hadoop-federation-balance in 
the patch passed.  |
   | +1 :green_heart: |  unit  |  21m 57s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 59s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 210m 20s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3141 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux cd3278928f94 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 83fa66c987843b8549d997b7213d3f3f3c3e0957 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | 

[jira] [Updated] (HDFS-16100) HA: Improve performance of Standby node transition to Active

2021-07-06 Thread wudeyu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wudeyu updated HDFS-16100:
--
Attachment: HDFS-16100.001.patch

>  HA: Improve performance of Standby node transition to Active
> -
>
> Key: HDFS-16100
> URL: https://issues.apache.org/jira/browse/HDFS-16100
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.1
>Reporter: wudeyu
>Assignee: wudeyu
>Priority: Major
> Attachments: HDFS-16100.001.patch, HDFS-16100.patch
>
>
> pendingDNMessages in Standby is used to support process postponed block 
> reports. Block reports in pendingDNMessages would be processed:
>  # If GS of replica is in the future, Standby Node will process it when 
> corresponding edit log(e.g add_block) is loaded.
>  # If replica is corrupted, Standby Node will process it while it transfer to 
> Active.
>  # If DataNode is removed, corresponding of block reports will be removed in 
> pendingDNMessages.
> Obviously, if num of corrupted replica grows, more time cost during 
> transferring. In out situation, there're 60 millions block reports in 
> pendingDNMessages before transfer. Processing block reports cost almost 7mins 
> and it's killed by zkfc. The replica state of the most block reports is RBW 
> with wrong GS(less than storedblock in Standby Node).
> In my opinion, Standby Node could ignore the block reports that replica state 
> is RBW with wrong GS. Because Active node/DataNode will remove it later.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16087) RBF balance process is stuck at DisableWrite stage

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16087?focusedWorklogId=619310=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619310
 ]

ASF GitHub Bot logged work on HDFS-16087:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 11:51
Start Date: 06/Jul/21 11:51
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3141:
URL: https://github.com/apache/hadoop/pull/3141#issuecomment-874693976


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 45s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  12m 45s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  20m 40s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  22m 39s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |  19m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   3m 50s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 26s |  |  trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 27s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m  6s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   0m 59s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  21m 28s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |  21m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  20m  8s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |  20m  8s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   3m 46s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/2/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 41 new + 1 unchanged - 0 fixed = 42 total (was 1) 
 |
   | +1 :green_heart: |  mvnsite  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 47s |  |  the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   2m 51s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  15m 59s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   7m  0s |  |  hadoop-federation-balance in 
the patch passed.  |
   | +1 :green_heart: |  unit  |  22m  9s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | -1 :x: |  asflicense  |   1m  0s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/2/artifact/out/results-asflicense.txt)
 |  The patch generated 1 ASF License warnings.  |
   |  |   | 208m  7s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3141/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3141 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 7cd7d12d662d 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / c5486937397250a74b25e7bd7954af3313fa8644 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 

[jira] [Work logged] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16110?focusedWorklogId=619282=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619282
 ]

ASF GitHub Bot logged work on HDFS-16110:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 11:42
Start Date: 06/Jul/21 11:42
Worklog Time Spent: 10m 
  Work Description: jojochuang merged pull request #3174:
URL: https://github.com/apache/hadoop/pull/3174


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619282)
Time Spent: 1.5h  (was: 1h 20m)

> Remove unused method reportChecksumFailure in DFSClient
> ---
>
> Key: HDFS-16110
> URL: https://issues.apache.org/jira/browse/HDFS-16110
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Remove unused method reportChecksumFailure and fix some code styles by the 
> way in DFSClient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=619221=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619221
 ]

ASF GitHub Bot logged work on HDFS-16088:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 11:33
Start Date: 06/Jul/21 11:33
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3140:
URL: https://github.com/apache/hadoop/pull/3140#issuecomment-874412101






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619221)
Time Spent: 3h 20m  (was: 3h 10m)

> Standby NameNode process getLiveDatanodeStorageReport request to reduce 
> Active load
> ---
>
> Key: HDFS-16088
> URL: https://issues.apache.org/jira/browse/HDFS-16088
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: standyby-ipcserver.jpg
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also 
> request to SNN to reduce the ANN load.
> There are two points that need to be mentioned:
>  1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, 
> so we can access SNN directly.
>  2. We can share the same UT(testBalancerRequestSBNWithHA) with 
> NameNodeConnector#getBlocks().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?focusedWorklogId=619211=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619211
 ]

ASF GitHub Bot logged work on HDFS-16114:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 11:32
Start Date: 06/Jul/21 11:32
Worklog Time Spent: 10m 
  Work Description: JiaguodongF opened a new pull request #3179:
URL: https://github.com/apache/hadoop/pull/3179


   public String toString() {
   return String.format("%s.%s [%s," + " threshold = %s,"
   + " max idle iteration = %s," + " #excluded nodes = %s,"
   + " #included nodes = %s," + " #source nodes = %s,"
   + " #blockpools = %s," + " run during upgrade = %s,"
   + " hot block time interval = %s]"
   + " sort top nodes = %s",
   Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
   threshold, maxIdleIteration, excludedNodes.size(),
   includedNodes.size(), sourceNodes.size(), blockpools.size(),
   runDuringUpgrade, sortTopNodes, hotBlockTimeInterval);
   }
   
   print error.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619211)
Time Spent: 20m  (was: 10m)

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>  Labels: balancer, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> public String toString() {
>  return String.format("%s.%s [%s," + " threshold = %s,"
>  + " max idle iteration = %s," + " #excluded nodes = %s,"
>  + " #included nodes = %s," + " #source nodes = %s,"
>  + " #blockpools = %s," + " run during upgrade = %s,"
>  {color:#FF}+ " hot block time interval = %s]"{color}
> {color:#FF} + " sort top nodes = %s",{color}
>  Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
>  threshold, maxIdleIteration, excludedNodes.size(),
>  includedNodes.size(), sourceNodes.size(), blockpools.size(),
>  runDuringUpgrade, {color:#FF}sortTopNodes, hotBlockTimeInterval{color});
> }
> print error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16110) Remove unused method reportChecksumFailure in DFSClient

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16110?focusedWorklogId=619180=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619180
 ]

ASF GitHub Bot logged work on HDFS-16110:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 11:28
Start Date: 06/Jul/21 11:28
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3174:
URL: https://github.com/apache/hadoop/pull/3174#issuecomment-874394011






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 619180)
Time Spent: 1h 20m  (was: 1h 10m)

> Remove unused method reportChecksumFailure in DFSClient
> ---
>
> Key: HDFS-16110
> URL: https://issues.apache.org/jira/browse/HDFS-16110
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Remove unused method reportChecksumFailure and fix some code styles by the 
> way in DFSClient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=619157=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-619157
 ]

ASF GitHub Bot logged work on HDFS-16088:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 11:24
Start Date: 06/Jul/21 11:24
Worklog Time Spent: 10m 
  Work Description: Hexiaoqiao commented on a change in pull request #3140:
URL: https://github.com/apache/hadoop/pull/3140#discussion_r664253973



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithHANameNodes.java
##
@@ -236,4 +241,93 @@ private void testBalancerWithObserver(boolean 
withObserverFailure)
   }
 }
   }
+
+  /**
+   * Comparing the results of getLiveDatanodeStorageReport()
+   * from the active and standby NameNodes,
+   * the results should be the same.
+   */
+  @Test(timeout = 6)
+  public void testGetLiveDatanodeStorageReport() throws Exception {
+Configuration conf = new HdfsConfiguration();
+TestBalancer.initConf(conf);
+assertEquals(TEST_CAPACITIES.length, TEST_RACKS.length);
+NNConf nn1Conf = new MiniDFSNNTopology.NNConf("nn1");
+nn1Conf.setIpcPort(HdfsClientConfigKeys.DFS_NAMENODE_RPC_PORT_DEFAULT);
+Configuration copiedConf = new Configuration(conf);
+// Try capture NameNodeConnector log.
+LogCapturer log =LogCapturer.captureLogs(
+LoggerFactory.getLogger(NameNodeConnector.class));
+// We needs to assert datanode info from ANN and SNN, so the
+// heartbeat should disabled for the duration of method execution
+copiedConf.setInt(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, 6);
+cluster = new MiniDFSCluster.Builder(copiedConf)
+.nnTopology(MiniDFSNNTopology.simpleHATopology())
+.numDataNodes(TEST_CAPACITIES.length)
+.racks(TEST_RACKS)
+.simulatedCapacities(TEST_CAPACITIES)
+.build();
+HATestUtil.setFailoverConfigurations(cluster, conf);
+try {
+  cluster.waitActive();
+  cluster.transitionToActive(0);
+  URI namenode = (URI) DFSUtil.getInternalNsRpcUris(conf)
+  .toArray()[0];
+  String nsId = DFSUtilClient.getNameServiceIds(conf)
+  .toArray()[0].toString();
+
+  // request to active namenode
+  NameNodeConnector nncActive = new NameNodeConnector(
+  "nncActive", namenode,
+  nsId, new Path("/test"),
+  null, conf, NameNodeConnector.DEFAULT_MAX_IDLE_ITERATIONS);
+  DatanodeStorageReport[] ldspFromAnn =

Review comment:
   `ldspFromAnn` here is not very explicit IMO, is `datanodeStorageReports` 
more clear here?

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/balancer/TestBalancerWithHANameNodes.java
##
@@ -236,4 +241,93 @@ private void testBalancerWithObserver(boolean 
withObserverFailure)
   }
 }
   }
+
+  /**
+   * Comparing the results of getLiveDatanodeStorageReport()
+   * from the active and standby NameNodes,
+   * the results should be the same.
+   */
+  @Test(timeout = 6)
+  public void testGetLiveDatanodeStorageReport() throws Exception {
+Configuration conf = new HdfsConfiguration();
+TestBalancer.initConf(conf);
+assertEquals(TEST_CAPACITIES.length, TEST_RACKS.length);
+NNConf nn1Conf = new MiniDFSNNTopology.NNConf("nn1");
+nn1Conf.setIpcPort(HdfsClientConfigKeys.DFS_NAMENODE_RPC_PORT_DEFAULT);
+Configuration copiedConf = new Configuration(conf);
+// Try capture NameNodeConnector log.
+LogCapturer log =LogCapturer.captureLogs(
+LoggerFactory.getLogger(NameNodeConnector.class));
+// We needs to assert datanode info from ANN and SNN, so the
+// heartbeat should disabled for the duration of method execution
+copiedConf.setInt(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, 6);
+cluster = new MiniDFSCluster.Builder(copiedConf)
+.nnTopology(MiniDFSNNTopology.simpleHATopology())
+.numDataNodes(TEST_CAPACITIES.length)
+.racks(TEST_RACKS)
+.simulatedCapacities(TEST_CAPACITIES)
+.build();
+HATestUtil.setFailoverConfigurations(cluster, conf);
+try {
+  cluster.waitActive();
+  cluster.transitionToActive(0);
+  URI namenode = (URI) DFSUtil.getInternalNsRpcUris(conf)
+  .toArray()[0];
+  String nsId = DFSUtilClient.getNameServiceIds(conf)
+  .toArray()[0].toString();
+
+  // request to active namenode

Review comment:
   It is better to begin with uppercase character and end with period for 
annotation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:

[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375453#comment-17375453
 ] 

Stephen O'Donnell commented on HDFS-15796:
--

[~Daniel Ma] I understand how a ConcurrentModificationException occurs, but 
targets and excluded nodes are local variables to the method, and the calls to 
getTargets is already synchronized (at least on trunk).

Can you please highlight the line the exception occurs on so we can see the 
exact call which is giving the problem? Thanks.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16114:
--
Labels: balancer pull-request-available  (was: balancer)

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>  Labels: balancer, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> public String toString() {
>  return String.format("%s.%s [%s," + " threshold = %s,"
>  + " max idle iteration = %s," + " #excluded nodes = %s,"
>  + " #included nodes = %s," + " #source nodes = %s,"
>  + " #blockpools = %s," + " run during upgrade = %s,"
>  {color:#FF}+ " hot block time interval = %s]"{color}
> {color:#FF} + " sort top nodes = %s",{color}
>  Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
>  threshold, maxIdleIteration, excludedNodes.size(),
>  includedNodes.size(), sourceNodes.size(), blockpools.size(),
>  runDuringUpgrade, {color:#FF}sortTopNodes, hotBlockTimeInterval{color});
> }
> print error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?focusedWorklogId=618990=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618990
 ]

ASF GitHub Bot logged work on HDFS-16114:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 10:18
Start Date: 06/Jul/21 10:18
Worklog Time Spent: 10m 
  Work Description: JiaguodongF opened a new pull request #3179:
URL: https://github.com/apache/hadoop/pull/3179


   public String toString() {
   return String.format("%s.%s [%s," + " threshold = %s,"
   + " max idle iteration = %s," + " #excluded nodes = %s,"
   + " #included nodes = %s," + " #source nodes = %s,"
   + " #blockpools = %s," + " run during upgrade = %s,"
   + " hot block time interval = %s]"
   + " sort top nodes = %s",
   Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
   threshold, maxIdleIteration, excludedNodes.size(),
   includedNodes.size(), sourceNodes.size(), blockpools.size(),
   runDuringUpgrade, sortTopNodes, hotBlockTimeInterval);
   }
   
   print error.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 618990)
Remaining Estimate: 0h
Time Spent: 10m

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>  Labels: balancer
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> public String toString() {
>  return String.format("%s.%s [%s," + " threshold = %s,"
>  + " max idle iteration = %s," + " #excluded nodes = %s,"
>  + " #included nodes = %s," + " #source nodes = %s,"
>  + " #blockpools = %s," + " run during upgrade = %s,"
>  {color:#FF}+ " hot block time interval = %s]"{color}
> {color:#FF} + " sort top nodes = %s",{color}
>  Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
>  threshold, maxIdleIteration, excludedNodes.size(),
>  includedNodes.size(), sourceNodes.size(), blockpools.size(),
>  runDuringUpgrade, {color:#FF}sortTopNodes, hotBlockTimeInterval{color});
> }
> print error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16098) ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375427#comment-17375427
 ] 

Daniel Ma commented on HDFS-16098:
--

[~wangyanfu]

Thanks for reporting this issue.

Could you pls share more details about the error stack.

> ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException
> ---
>
> Key: HDFS-16098
> URL: https://issues.apache.org/jira/browse/HDFS-16098
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: diskbalancer
>Affects Versions: 2.6.0
> Environment: VERSION info:
> Hadoop 2.6.0-cdh5.14.4
>Reporter: wangyanfu
>Priority: Blocker
>  Labels: diskbalancer
> Fix For: 2.6.0
>
> Attachments: image-2021-07-01-18-34-54-905.png, on-branch-3.1.jpg
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> when i tried to run 
> hdfs diskbalancer -plan $(hostname -f)
>  
>  
>  
>  i get this notice:
> 21/06/30 11:30:41 ERROR tools.DiskBalancerCLI: 
> java.lang.IllegalArgumentException
>  
> then i tried write the real hostname into my command , not work and same 
> error notice
> i also tried  use --plan instead of -plan , not work and same error notice
> i found this 
> [link|https://community.cloudera.com/t5/Support-Questions/Error-trying-to-balance-disks-on-node/m-p/59989#M54850]
>   but there's no resolve solution , can somebody help me?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375421#comment-17375421
 ] 

Daniel Ma edited comment on HDFS-15796 at 7/6/21, 9:57 AM:
---

[~sodonnell]

Thanks for reviewing, Actually you missed the for loop here:
{code:java}
//代码占位符
synchronized (pendingReconstruction) {
  List targets = pendingReconstruction
  .getTargets(rw.getBlock());
  if (targets != null) {
for (DatanodeStorageInfo dn : targets) {
  if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
excludedNodes.add(dn.getDatanodeDescriptor());
  }
}
  }
}
{code}
The problem happens when the code above try to travel the DataNodes stored in 
pendingReconstruction object, while the DataNode list is also been modifing 
elsewhere.

In other words, if you modify a List(delete or add an element) and visit it in 
the same time, ConcurrentModificationException will be casted.


was (Author: daniel ma):
[~sodonnell]

Thanks for reviewing, Actually you missed the for loop here:
{code:java}
//代码占位符
synchronized (pendingReconstruction) {
  List targets = pendingReconstruction
  .getTargets(rw.getBlock());
  if (targets != null) {
for (DatanodeStorageInfo dn : targets) {
  if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
excludedNodes.add(dn.getDatanodeDescriptor());
  }
}
  }
}
{code}
The problem happens when the code above try to travel the DataNodes stored in 
pendingReconstruction object, while the DataNode list is also be modified 
elsewhere.

In other words, if you modify a List(delete or add an element) and visit it in 
the same time, ConcurrentModificationException will be casted.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375421#comment-17375421
 ] 

Daniel Ma commented on HDFS-15796:
--

[~sodonnell]

Thanks for reviewing, Actually you missed the for loop here:
{code:java}
//代码占位符
synchronized (pendingReconstruction) {
  List targets = pendingReconstruction
  .getTargets(rw.getBlock());
  if (targets != null) {
for (DatanodeStorageInfo dn : targets) {
  if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
excludedNodes.add(dn.getDatanodeDescriptor());
  }
}
  }
}
{code}
The problem happens when the code above try to travel the DataNodes stored in 
pendingReconstruction object, while the DataNode list is also be modified 
elsewhere.

In other words, if you modify a List(delete or add an element) and visit it in 
the same time, ConcurrentModificationException will be casted.

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPServiceActor 
thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much torlerance instead of simply shudown the thread and never recover 
> automatically, because the non-fatal errors mentioned above probably can be 
> recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefor, in this patch, two things was be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply remove from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPService 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, of 
which BPServiceActor cannot aware and still keep put commands from namenode 
into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma edited comment on HDFS-16115 at 7/6/21, 9:33 AM:
---

Hello [~brahmareddy],

[~ayush]

Pls help to review this patch. thanks.


was (Author: daniel ma):
[~brahmareddy]

[~ayush]

Pls help to review this patch.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep put commands from 
> namenode into queues waiting to be handled by CommandProcessThread, actually 
> CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401
 ] 

Daniel Ma commented on HDFS-16115:
--

[~brahmareddy]

[~ayush]

Pls help to review this patch.

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep put commands from 
> namenode into queues waiting to be handled by CommandProcessThread, actually 
> CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread jiaguodong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaguodong updated HDFS-16114:
--
Description: 
public String toString() {
 return String.format("%s.%s [%s," + " threshold = %s,"
 + " max idle iteration = %s," + " #excluded nodes = %s,"
 + " #included nodes = %s," + " #source nodes = %s,"
 + " #blockpools = %s," + " run during upgrade = %s,"
 {color:#FF}+ " hot block time interval = %s]"{color}
{color:#FF} + " sort top nodes = %s",{color}
 Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
 threshold, maxIdleIteration, excludedNodes.size(),
 includedNodes.size(), sourceNodes.size(), blockpools.size(),
 runDuringUpgrade, {color:#FF}sortTopNodes, hotBlockTimeInterval{color});
}

print error.

 

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>  Labels: balancer
>
> public String toString() {
>  return String.format("%s.%s [%s," + " threshold = %s,"
>  + " max idle iteration = %s," + " #excluded nodes = %s,"
>  + " #included nodes = %s," + " #source nodes = %s,"
>  + " #blockpools = %s," + " run during upgrade = %s,"
>  {color:#FF}+ " hot block time interval = %s]"{color}
> {color:#FF} + " sort top nodes = %s",{color}
>  Balancer.class.getSimpleName(), getClass().getSimpleName(), policy,
>  threshold, maxIdleIteration, excludedNodes.size(),
>  includedNodes.size(), sourceNodes.size(), blockpools.size(),
>  runDuringUpgrade, {color:#FF}sortTopNodes, hotBlockTimeInterval{color});
> }
> print error.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, of 
which BPServiceActor cannot aware and still keep put commands from namenode 
into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, 
which is not aware of it and still keep put command from namenode into queues 
to be handled by CommandProcessThread

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep put commands from 
> namenode into queues waiting to be handled by CommandProcessThread, actually 
> CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread jiaguodong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaguodong updated HDFS-16114:
--
Labels: balancer  (was: )

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>  Labels: balancer
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread jiaguodong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaguodong updated HDFS-16114:
--
Priority: Minor  (was: Major)

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16114) the balancer parameters print error

2021-07-06 Thread jiaguodong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaguodong updated HDFS-16114:
--
Summary: the balancer parameters print error  (was: the balancer will exit)

> the balancer parameters print error
> ---
>
> Key: HDFS-16114
> URL: https://issues.apache.org/jira/browse/HDFS-16114
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: jiaguodong
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Attachment: 0001-HDFS-16115.patch

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, which is not aware of it and still keep put command from namenode into 
> queues to be handled by CommandProcessThread
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)
Daniel Ma created HDFS-16115:


 Summary: Asynchronously handle BPServiceActor command mechanism 
may result in BPServiceActor never fails even CommandProcessingThread is closed 
with fatal error.
 Key: HDFS-16115
 URL: https://issues.apache.org/jira/browse/HDFS-16115
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.3.1
Reporter: Daniel Ma
 Fix For: 3.3.1


It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, 
which is not aware of it and still keep put command from namenode into queues 
to be handled by CommandProcessThread

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16114) the balancer will exit

2021-07-06 Thread jiaguodong (Jira)
jiaguodong created HDFS-16114:
-

 Summary: the balancer will exit
 Key: HDFS-16114
 URL: https://issues.apache.org/jira/browse/HDFS-16114
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: jiaguodong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375386#comment-17375386
 ] 

Stephen O'Donnell commented on HDFS-15796:
--

I don't think the patch will help anything here. All the methods on 
PendingReconstruction are already synchronized, eg:

{code}
  List getTargets(BlockInfo block) {
synchronized (pendingReconstructions) {
  PendingBlockInfo found = pendingReconstructions.get(block);
  if (found != null) {
return found.targets;
  }
}
return null;
  }
{code}

You posted a code snippet above, but there are no line numbers with it, so I 
cannot see what line the exception was thrown. Can you tell me which is line is 
1907 in your code snippet above?

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375331#comment-17375331
 ] 

Daniel Ma commented on HDFS-15796:
--

[~weichiu],[~hexiaoqiao] 

Pls help to review this patch, thanks

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Attachment: 0001-HDFS-15796.patch

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-15796:
-
Target Version/s: 3.3.1  (was: 3.4.0)

> ConcurrentModificationException error happens on NameNode occasionally
> --
>
> Key: HDFS-15796
> URL: https://issues.apache.org/jira/browse/HDFS-15796
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1
>Reporter: Daniel Ma
>Priority: Critical
> Attachments: 0001-HDFS-15796.patch
>
>
> ConcurrentModificationException error happens on NameNode occasionally.
>  
> {code:java}
> 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor 
> thread received Runtime exception.  | BlockManager.java:4746
> java.util.ConcurrentModificationException
>   at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909)
>   at java.util.ArrayList$Itr.next(ArrayList.java:859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16095) Add lsQuotaList command and getQuotaListing api for hdfs quota

2021-07-06 Thread Xiangyi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375286#comment-17375286
 ] 

Xiangyi Zhu commented on HDFS-16095:


[~weichiu],[~ayushtkn],[~hexiaoqiao],[~kihwal]  Looking forward to your 
comments.

> Add lsQuotaList command and getQuotaListing api for hdfs quota
> --
>
> Key: HDFS-16095
> URL: https://issues.apache.org/jira/browse/HDFS-16095
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently hdfs does not support obtaining all quota information. The 
> administrator may need to check which quotas have been added to a certain 
> directory, or the quotas of the entire cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16088) Standby NameNode process getLiveDatanodeStorageReport request to reduce Active load

2021-07-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16088?focusedWorklogId=618891=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-618891
 ]

ASF GitHub Bot logged work on HDFS-16088:
-

Author: ASF GitHub Bot
Created on: 06/Jul/21 06:35
Start Date: 06/Jul/21 06:35
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #3140:
URL: https://github.com/apache/hadoop/pull/3140#issuecomment-874499453


   Thanks @Hexiaoqiao for your comments and suggestions. I fixed the problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 618891)
Time Spent: 3h  (was: 2h 50m)

> Standby NameNode process getLiveDatanodeStorageReport request to reduce 
> Active load
> ---
>
> Key: HDFS-16088
> URL: https://issues.apache.org/jira/browse/HDFS-16088
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: standyby-ipcserver.jpg
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> As with HDFS-13183, NameNodeConnector#getLiveDatanodeStorageReport() can also 
> request to SNN to reduce the ANN load.
> There are two points that need to be mentioned:
>  1. FSNamesystem#getDatanodeStorageReport() is OperationCategory.UNCHECKED, 
> so we can access SNN directly.
>  2. We can share the same UT(testBalancerRequestSBNWithHA) with 
> NameNodeConnector#getBlocks().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org