[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844494#comment-17844494
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-2099626750

   hi @slfan1989  @dineshchitlangia  All details about this PR have been 
processed,  If you have time to review it again.  Thank you a lot.




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-05-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844405#comment-17844405
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-2098960920

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   3m 48s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 39s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 34s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 17s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 36s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 21s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 2 fixed = 66 total (was 68)  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 24s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  78m 53s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 168m 38s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/16/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle |
   | uname | Linux 51964cb3034d 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / fcdd4fddd6ed005e750afd4c399fb98ab976b7a1 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/16/testReport/ |
   | Max. process+thread count | 942 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/16/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-04-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839515#comment-17839515
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-2068639212

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   3m 59s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 3 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 16s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 43s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 19s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 2 fixed = 66 total (was 68)  |
   | +1 :green_heart: |  mvnsite  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 25s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 50s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  78m 38s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/15/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 168m 30s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMHA |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/15/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle |
   | uname | Linux a78ebb5045b9 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 2e5c5a262f54e8b1994b39b4e4537b16fd525da6 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/15/testReport/ |
   | Max. process+thread count | 941 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/15/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
>  

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-04-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839256#comment-17839256
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-2067716035

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 4 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 26s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 15s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 57s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | -1 :x: |  mvninstall  |   0m 30s | 
[/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/14/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | -1 :x: |  compile  |   0m 28s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | -1 :x: |  javac  |   0m 28s | 
[/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/14/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 9 fixed = 66 total (was 75)  |
   | -1 :x: |  mvnsite  |   0m 31s | 
[/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/14/artifact/out/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   0m 30s | 
[/patch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/14/artifact/out/patch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | -1 :x: |  shadedclient  |   8m  7s |  |  patch has errors when building 
and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |   0m 31s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/14/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch failed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   |  70m 57s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838431#comment-17838431
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-2062890178

   @dineshchitlangia 
   I have submitted the latest code, please review it again.
   The results and exceptions will be tracked in 
TransitionToActiveStandbyRunner.call method.
   About failed TestCases:
   TestApplication MasterLauncher can pass locally.
   but TestFSConfigToCSConfigConverterMain mishandled "System.exit" and caused 
the process to exit.
   I will fix TestFSConfigToCSConfigConverterMain separately afterward.
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-04-17 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838268#comment-17838268
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-2061626619

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 42s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 15s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  22m 25s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 1 fixed = 66 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 22s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 10s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  78m 50s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/13/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 26s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 166m 35s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/13/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 809d9c31990d 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / dd7de21ff1b1db9f214b2c14379cd6f8aab70b41 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/13/testReport/ |
   | Max. process+thread count | 962 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/13/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-04-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837649#comment-17837649
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-2058742919

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   3m 53s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 43s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 42s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 15s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  22m  1s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 1 fixed = 66 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed  |
   | +1 :green_heart: |  spotbugs  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  23m  5s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  77m 28s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/12/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 22s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 169m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/12/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux cdda08d8ad0b 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / e71bf8e6c219a1f9a6ab0c8c032631c8f4dd6425 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/12/testReport/ |
   | Max. process+thread count | 964 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/12/console |
   | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-04-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834527#comment-17834527
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1554610415


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1123,36 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**
+  /**
* Transition to standby state in a new thread. The transition operation is
* asynchronous to avoid deadlock caused by cyclic dependency.
*/
-  private void handleTransitionToStandByInNewThread() {
-Thread standByTransitionThread =
-new Thread(activeServices.standByTransitionRunnable);
-standByTransitionThread.setName("StandByTransitionThread");
-standByTransitionThread.start();
+  void handleTransitionToStandByInNewThread() {
+toActiveStandbyExecutor.submit(
+new RMFatalToStandbyRunner(ResourceManager.getClusterTimeStamp()));

Review Comment:
   @dineshchitlangia Thank you for your review! Next, I will handle it.





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-04-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834120#comment-17834120
 ] 

ASF GitHub Bot commented on YARN-11622:
---

dineshchitlangia commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1552669244


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1123,36 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**
+  /**
* Transition to standby state in a new thread. The transition operation is
* asynchronous to avoid deadlock caused by cyclic dependency.
*/
-  private void handleTransitionToStandByInNewThread() {
-Thread standByTransitionThread =
-new Thread(activeServices.standByTransitionRunnable);
-standByTransitionThread.setName("StandByTransitionThread");
-standByTransitionThread.start();
+  void handleTransitionToStandByInNewThread() {
+toActiveStandbyExecutor.submit(
+new RMFatalToStandbyRunner(ResourceManager.getClusterTimeStamp()));

Review Comment:
   Based on the spotbugs issue definition, L1131 returns a value that we are 
not checking. In theory, it may return a bad value and it could go unnoticed.
   
   In this case, you are using ExecutorService.submit() and that could 
potentially throw NPE or RejectedExecutionException.
   The method signature does not specify the `throws` definition. This could be 
a reason you have the spotbugs issue.
   
   Reference - 
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/ExecutorService.html#submit(java.util.concurrent.Callable)
   
   
   Please modify the code here to accommodate the needs of submit().
   





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-19 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808866#comment-17808866
 ] 

wangzhihui commented on YARN-11622:
---

hi, [~slfan1989] 

I'm so sorry, I've been busy lately and haven't been able to promptly handle 
questions or details about the current issue.
I have added a testTransitionedToStandbyShouldNotNPE test case to reproduce the 
problem described in YARN-11622.
So far, we still have a Spotbug prompt that needs to be discussed on how to 
handle it.
Looking forward to your reply, thank you.

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808800#comment-17808800
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1901074068

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   6m  7s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  46m 10s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 56s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 44s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m  1s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 56s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  35m 46s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 56s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 67 unchanged - 1 fixed = 67 total (was 68)  |
   | +1 :green_heart: |  mvnsite  |   0m 52s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 34s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 59s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/11/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  35m 22s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  91m 27s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 35s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 228m  4s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1131] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/11/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux d0db6dadbf15 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 59252a76d39ea9520acc7cfd39216bbb01a36767 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/11/testReport/ |
   | Max. process+thread count | 948 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-18 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808332#comment-17808332
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1898963967

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   4m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 44s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 40s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 14s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 17s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 19s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 1 fixed = 66 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 22s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 15s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  21m 33s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  78m 12s | 
[/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 23s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 167m 20s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1131] |
   | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMHA |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f16f271e28e6 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 5ae791898e1e8d053e7aebefd0532ff533b09087 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17804337#comment-17804337
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1881191578

   > > 
/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html
   > 
   > @hiwangzhihui I'll take a look at this later. I've been a little busy 
lately.
   
   Me too, I will solve the above problems and questions as soon as I have time
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17803526#comment-17803526
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1878567308

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  66m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   2m 20s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 12s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  22m 16s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 20s |  |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 66 unchanged - 1 fixed = 66 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 21s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 14s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/9/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  21m 34s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  75m 34s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 23s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 196m 13s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1131] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/9/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 8243ad94cb2d 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / b1202a8f8f6e6d94a0319dfa54264a0a31e3825a |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/9/testReport/ |
   | Max. process+thread count | 939 (vs. ulimit of 5500) |
   | modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   | Console output | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801959#comment-17801959
 ] 

ASF GitHub Bot commented on YARN-11622:
---

slfan1989 commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1874688435

   > 
/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html
   
   @hiwangzhihui I'll take a look at this later. I've been a little busy lately.




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801794#comment-17801794
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1874062645

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 59s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 26s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m  8s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  22m 31s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 17s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/8/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 66 unchanged - 1 fixed = 68 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 21s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 10s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/8/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  22m 29s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  79m 12s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 25s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 167m 22s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1131] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/8/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 28e7cb248cd3 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / b96e1b5c12775549a674caeb440b3f1cd4c93ac2 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/8/testReport/ |
   | Max. 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801793#comment-17801793
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1874055222

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   4m  7s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 32s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 24s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 36s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m  9s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  22m 12s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 21s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/7/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 66 unchanged - 1 fixed = 68 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 17s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 14s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/7/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  22m 35s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  77m 24s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 168m 28s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1131] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 8cf1e1a7058b 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / 5b713ed148e3724626a80fe27bedc28ac2d42957 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/7/testReport/ |
   | Max. 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801745#comment-17801745
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1873902545

   > @hiwangzhihui Can we fix `checkstyle` and `spotbugs` issue ? Because this 
PR involves changes to RM, we need to be more careful. Can we reproduce this 
issue? Can you provide some configuration and provide some steps?
   
   The Spotbug warning is to expect RMFatalToStandbyRunner to submit and wait 
for its execution result.
   If waiting for the execution result synchronously results in a "cyclic 
dependency" issue;
   However, in the call method of TransitionToActiveStandbyRunnern, both 
execution results and exceptions have been uniformly processed and log printed.
   The RMFatalToStandbyRunner execution results only have two results: ① 
successful execution ② Execution exception failed, RM process exited.
   My opinion is this warning can be ignored in this scene, As adding a thread 
to wait for the result would be redundant.
   @slfan1989 How does view and handle this warning? I would like to hear your 
opinion again.




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801743#comment-17801743
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1438289028


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1124,25 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**

Review Comment:
   The Spotbug warning is to expect RMFatalToStandbyRunner to submit and wait 
for its execution result.
   If waiting for the execution result synchronously results in a "cyclic 
dependency" issue; 
   However, in the call method of TransitionToActiveStandbyRunnern, both 
execution results and exceptions have been uniformly processed and log printed. 
   The RMFatalToStandbyRunner execution results only have two results: ① 
successful execution ② Execution exception failed, RM process exited.
   My opinion is this warning can be ignored in this scene,  As adding a thread 
to wait for the result would be redundant.
   @slfan1989  How does view and handle this warning? I would like to hear your 
opinion again.
   





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801742#comment-17801742
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1438289028


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1124,25 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**

Review Comment:
   The Spotbug warning is to expect RMFatalToStandbyRunner to submit and wait 
for its execution result.
   If waiting for the execution result synchronously results in a "cyclic 
dependency" issue; 
   However, in the call method of TransitionToActiveStandbyRunnern, both 
execution results and exceptions have been uniformly processed and log printed. 
   The RMFatalToStandbyRunner execution results only have two results: ① 
successful execution ② Execution exception failed, RM process exited.
   My opinion is this warning can be ignored in this scene,  As adding a thread 
to wait for the result would be redundant.
   @slfan1989  How does view and handle this warning? I would like to hear your 
opinion again.
   





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2024-01-02 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801741#comment-17801741
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1436500920


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1124,25 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**

Review Comment:
   Thanks for your reminder! It is necessary to address the Stopbug notices. I 
need to add a better design to track task execution results.  The checkstyle 
issues will also be addressed together in the follow-up.





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801195#comment-17801195
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1438289028


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1124,25 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**

Review Comment:
   The StopBug warning is to expect RMFatalToStandbyRunner to submit and wait 
for its execution result.
   If waiting for the execution result synchronously results in a "cyclic 
dependency" issue; 
   However, in the call method of TransitionToActiveStandbyRunnern, both 
execution results and exceptions have been uniformly processed and log printed. 
   The RMFatalToStandbyRunner execution results only have two results: ① 
successful execution ② Execution exception failed, RM process exited.
   My opinion is this warning can be ignored in this scene,  As adding a thread 
to wait for the result would be redundant.
   @slfan1989  How does view and handle this warning? I would like to hear your 
opinion again.
   





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-29 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801193#comment-17801193
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1438289028


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1124,25 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**

Review Comment:
   The SotopBug warning is to expect RMFatalToStandbyRunner to submit and wait 
for its execution result.
   If waiting for the execution result synchronously results in a "cyclic 
dependency" issue; 
   However, in the call method of TransitionToActiveStandbyRunnern, both 
execution results and exceptions have been uniformly processed and log printed. 
   The RMFatalToStandbyRunner execution results only have two results: ① 
successful execution ② Execution exception failed, RM process exited.
   My opinion is this warning can be ignored in this scene,  As adding a thread 
to wait for the result would be redundant.
   @slfan1989  How does view and handle this warning? I would like to hear your 
opinion again.
   





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800508#comment-17800508
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1436500920


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1124,25 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**

Review Comment:
   Thanks for your reminder! It is necessary to address the Stopbug notices. I 
need to add a better design to track task execution results.  The checkstyle 
issues will also be addressed together in the follow-up.





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797036#comment-17797036
 ] 

ASF GitHub Bot commented on YARN-11622:
---

slfan1989 commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1427570096


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java:
##
@@ -1118,38 +1124,25 @@ protected void serviceStop() throws Exception {
 }
   }
 
-/**

Review Comment:
   The code needs to retain comments, why should we delete this part of the 
comments?





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17797034#comment-17797034
 ] 

ASF GitHub Bot commented on YARN-11622:
---

slfan1989 commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1857310239

   @hiwangzhihui Can we fix `checkstyle` and `spotbugs` issue ? Because this PR 
involves changes to RM, we need to be more careful. Can we reproduce this 
issue? Can you provide some configuration and provide some steps?




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796776#comment-17796776
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855967221

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 20s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 47s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 26s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 38s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 14s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 23s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 19s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/6/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 65 new + 67 unchanged - 0 fixed = 132 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 17s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/6/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  21m 15s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  78m 10s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 163m  1s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(RMFatalEvent)
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(RMFatalEvent)
  At ResourceManager.java:[line 1005] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 5090dd5d03a3 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | branch-3.3 / b3f4933c2a30fa3435b18da52526421de6085692 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/6/testReport/ |
   | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796696#comment-17796696
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855705561

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 32s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  47m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 42s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   1m  0s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 55s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  37m 36s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 31s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/5/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 65 new + 68 unchanged - 0 fixed = 133 total (was 68)  |
   | +1 :green_heart: |  mvnsite  |   0m 50s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 59s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/5/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  40m 19s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  91m 48s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 230m 35s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1132] |
   |  |  Should 
org.apache.hadoop.yarn.server.resourcemanager.TransitionToActiveStandbyRunner$TransitionToActiveStandbyResult
 be a _static_ inner class?  At TransitionToActiveStandbyRunner.java:inner 
class?  At TransitionToActiveStandbyRunner.java:[lines 64-78] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 1e0849ac3c48 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796686#comment-17796686
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855670300

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   3m 44s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  48m 55s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 51s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 42s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 59s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 55s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  36m 50s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 32s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/4/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 65 new + 68 unchanged - 0 fixed = 133 total (was 68)  |
   | +1 :green_heart: |  mvnsite  |   0m 51s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 58s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/4/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  37m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  91m 40s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 35s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 231m 36s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1132] |
   |  |  Should 
org.apache.hadoop.yarn.server.resourcemanager.TransitionToActiveStandbyRunner$TransitionToActiveStandbyResult
 be a _static_ inner class?  At TransitionToActiveStandbyRunner.java:inner 
class?  At TransitionToActiveStandbyRunner.java:[lines 64-78] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 05d22095f668 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796639#comment-17796639
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855501964

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 49s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 23s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m  8s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  23m 19s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 28s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 17s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/3/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 65 new + 67 unchanged - 0 fixed = 132 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 20s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 10s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/3/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  23m 12s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  78m 27s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 168m 24s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1132] |
   |  |  Should 
org.apache.hadoop.yarn.server.resourcemanager.TransitionToActiveStandbyRunner$TransitionToActiveStandbyResult
 be a _static_ inner class?  At TransitionToActiveStandbyRunner.java:inner 
class?  At TransitionToActiveStandbyRunner.java:[lines 64-78] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 22546f611ee5 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796635#comment-17796635
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855484061

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  35m 22s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 24s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 28s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 10s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  22m 36s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 27s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 16s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/2/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 64 new + 67 unchanged - 0 fixed = 131 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 19s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m  9s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/2/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  23m 22s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  77m 21s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 24s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 166m 58s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1132] |
   |  |  Should 
org.apache.hadoop.yarn.server.resourcemanager.TransitionToActiveStandbyRunner$TransitionToActiveStandbyResult
 be a _static_ inner class?  At TransitionToActiveStandbyRunner.java:inner 
class?  At TransitionToActiveStandbyRunner.java:[lines 64-78] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 1409d555494d 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796597#comment-17796597
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui opened a new pull request, #6352:
URL: https://github.com/apache/hadoop/pull/6352

   
   
   ### Description of PR
   YARN-11622 Fix ResourceManager asynchronous switch from Standy to Active 
exception
   
   ### How was this patch tested?
   add TestRMHA.testTransitionToActiveFailedAfterToStandbyNotSkip
   add TestRMHA.testLessEpochRMFatalToStandbyRunnerShouldNotExecute
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796598#comment-17796598
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855319775

   > @hiwangzhihui Thank you for your contribution! I have a question, why did 
active RM lose contact with ZK? Is it because Active RM has full gc? Even if 
the situation described in your JIRA occurs, the cluster should have completed 
the HA switch. Has the original standby RM changed to active RM?
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796596#comment-17796596
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855305666

   > @hiwangzhihui Thank you for your contribution! I have a question, why did 
active RM lose contact with ZK? Is it because Active RM has full gc? Even if 
the situation described in your JIRA occurs, the cluster should have completed 
the HA switch. Has the original standby RM changed to active RM?
   
   In the issue YARN-11625, The Active ResourceManager fails to switch when 
happens an exception.
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796591#comment-17796591
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui closed pull request #6352: YARN-11622. Fix ResourceManager 
asynchronous switch from Standy to Active exception
URL: https://github.com/apache/hadoop/pull/6352




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796590#comment-17796590
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855282113

   > @hiwangzhihui Thank you for your contribution! I have a question, why did 
active RM lose contact with ZK? Is it because Active RM has full gc? Even if 
the situation described in your JIRA occurs, the cluster should have completed 
the HA switch. Has the original standby RM changed to active RM?
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796589#comment-17796589
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1855280766

   > @hiwangzhihui Thank you for your contribution! I have a question, why did 
active RM lose contact with ZK? Is it because Active RM has full gc? Even if 
the situation described in your JIRA occurs, the cluster should have completed 
the HA switch. Has the original standby RM changed to active RM?
   
   The cause is the Zookeeper server is unstable.  Standby RM has successfully 
changed to active Rm.
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796579#comment-17796579
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1426253410


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java:
##
@@ -339,8 +360,29 @@ public synchronized void transitionToActive(
   }
 
   @Override
-  public synchronized void transitionToStandby(
+  public void transitionToStandby(
   HAServiceProtocol.StateChangeRequestInfo reqInfo) throws IOException {
+if(rm.rmContext.isHAEnabled()){

Review Comment:
   Thanks for your review.  I have already formatted it.





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796279#comment-17796279
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hadoop-yetus commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1853961644

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   4m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ branch-3.3 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  33m 44s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  checkstyle  |   0m 27s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  mvnsite  |   0m 37s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  spotbugs  |   1m 12s |  |  branch-3.3 passed  |
   | +1 :green_heart: |  shadedclient  |  21m 43s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 19s | 
[/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/1/artifact/out/results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 70 new + 67 unchanged - 0 fixed = 137 total (was 67)  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 22s |  |  the patch passed  |
   | -1 :x: |  spotbugs  |   1m 13s | 
[/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/1/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html)
 |  
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0)  |
   | +1 :green_heart: |  shadedclient  |  21m 23s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  79m 53s |  |  
hadoop-yarn-server-resourcemanager in the patch passed.  |
   | -1 :x: |  asflicense  |   0m 24s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/1/artifact/out/results-asflicense.txt)
 |  The patch generated 1 ASF License warnings.  |
   |  |   | 169m 32s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | SpotBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
   |  |  Exceptional return value of 
java.util.concurrent.ExecutorService.submit(Callable) ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:ignored in 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread()
  At ResourceManager.java:[line 1133] |
   |  |  Should 
org.apache.hadoop.yarn.server.resourcemanager.TransitionToActiveStandbyRunner$TransitionToActiveStandbyResult
 be a _static_ inner class?  At TransitionToActiveStandbyRunner.java:inner 
class?  At TransitionToActiveStandbyRunner.java:[lines 46-60] |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6352 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 0a5f4c65cda5 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796250#comment-17796250
 ] 

ASF GitHub Bot commented on YARN-11622:
---

slfan1989 commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1425283550


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java:
##
@@ -47,30 +34,40 @@
 import org.apache.hadoop.security.AccessControlException;
 import org.apache.hadoop.security.UserGroupInformation;
 import org.apache.hadoop.service.AbstractService;
+import org.apache.hadoop.test.GenericTestUtils;
 import org.apache.hadoop.yarn.conf.HAUtil;
 import org.apache.hadoop.yarn.conf.YarnConfiguration;
 import org.apache.hadoop.yarn.event.Dispatcher;
 import org.apache.hadoop.yarn.event.DrainDispatcher;
 import org.apache.hadoop.yarn.event.Event;
 import org.apache.hadoop.yarn.event.EventHandler;
 import org.apache.hadoop.yarn.exceptions.YarnRuntimeException;
-import 
org.apache.hadoop.yarn.server.resourcemanager.recovery.records.ApplicationStateData;
 import 
org.apache.hadoop.yarn.server.resourcemanager.recovery.MemoryRMStateStore;
 import 
org.apache.hadoop.yarn.server.resourcemanager.recovery.StoreFencedException;
+import 
org.apache.hadoop.yarn.server.resourcemanager.recovery.records.ApplicationStateData;
 import org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMApp;
 import 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttempt;
 import 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptState;
+import org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNode;
 import org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetrics;
+import 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent;
 import org.codehaus.jettison.json.JSONException;
 import org.codehaus.jettison.json.JSONObject;
 import org.junit.Assert;
 import org.junit.Before;
 import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
-import com.sun.jersey.api.client.Client;
-import com.sun.jersey.api.client.ClientResponse;
-import com.sun.jersey.api.client.WebResource;
-import com.sun.jersey.api.client.config.DefaultClientConfig;
+import javax.ws.rs.core.MediaType;
+import java.io.IOException;
+import java.net.InetSocketAddress;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.function.Supplier;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.junit.Assert.*;

Review Comment:
   avoid `*`





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796236#comment-17796236
 ] 

ASF GitHub Bot commented on YARN-11622:
---

slfan1989 commented on code in PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#discussion_r1425282821


##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java:
##
@@ -339,8 +360,29 @@ public synchronized void transitionToActive(
   }
 
   @Override
-  public synchronized void transitionToStandby(
+  public void transitionToStandby(
   HAServiceProtocol.StateChangeRequestInfo reqInfo) throws IOException {
+if(rm.rmContext.isHAEnabled()){

Review Comment:
   We need to pay attention to the  indentation  of the code



##
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java:
##
@@ -1105,4 +1147,44 @@ private boolean validateForInvalidNode(String node,
 }
 return isKnown;
   }
+
+  private class AdminServiceToActiveRunner extends 
TransitionToActiveStandbyRunner{
+HAServiceProtocol.StateChangeRequestInfo reqInfo;
+
+public AdminServiceToActiveRunner(long clusterTimeStamp 
,StateChangeRequestInfo reqInfo) {
+  super(clusterTimeStamp);
+  this.reqInfo = reqInfo;
+}
+
+@Override
+public void doTransaction() throws Exception{
+  innerTransitionToActive(reqInfo);
+}
+

Review Comment:
   avoid





> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796221#comment-17796221
 ] 

ASF GitHub Bot commented on YARN-11622:
---

slfan1989 commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1853819722

   @hiwangzhihui Thank you for your contribution! I have a question, why did 
active RM lose contact with ZK? Is it because Active RM has full gc?  Even if 
the situation described in your JIRA occurs, the cluster should have completed 
the HA switch. Has the original standby RM changed to active RM?




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
>  Labels: pull-request-available
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796159#comment-17796159
 ] 

ASF GitHub Bot commented on YARN-11622:
---

hiwangzhihui opened a new pull request, #6352:
URL: https://github.com/apache/hadoop/pull/6352

   …tive exception
   
   
   
   ### Description of PR
   YARN-11622 Fix ResourceManager asynchronous switch from Standy to Active 
exception
   
   ### How was this patch tested?
   add TestRMHA.testTransitionToActiveFailedAfterToStandbyNotSkip
   add TestRMHA.testLessEpochRMFatalToStandbyRunnerShouldNotExecute
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-06 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793675#comment-17793675
 ] 

Xiaoqiao He commented on YARN-11622:


Add [~hiwangzhihui] to contributor list and assign this ticket to him.

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Assignee: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-05 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793182#comment-17793182
 ] 

wangzhihui commented on YARN-11622:
---

hi, [~slfan1989]  Could you please review another related question?

https://issues.apache.org/jira/browse/YARN-11625

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-04 Thread wangzhihui (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793117#comment-17793117
 ] 

wangzhihui commented on YARN-11622:
---

[~hexiaoqiao]  
[~elgoiri]   [~slfan1989] Thank you all, I will start the relevant repairs soon.

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-04 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793106#comment-17793106
 ] 

Xiaoqiao He commented on YARN-11622:


Great, thanks [~slfan1989] and [~elgoiri].
[~hiwangzhihui] would you mind to try submit PR via GitHub, we will follow and 
move this bugfix forwards.

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the 

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

2023-12-04 Thread Jira


[ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793035#comment-17793035
 ] 

Íñigo Goiri commented on YARN-11622:


Not having a single place to track the locks is obviously an issue.
Adding this entity tracking all the access makes sense to me.
The onyl concern for me would be performance, let's add some evaluation for 
that once we have the implementation.

> ResourceManager asynchronous switch from Standy to Active exception
> ---
>
> Key: YARN-11622
> URL: https://issues.apache.org/jira/browse/YARN-11622
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>Reporter: wangzhihui
>Priority: Major
> Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases:
> h2. The first case:
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure, Thread_1 during the toStandby process , 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case:
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred