[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808332#comment-17808332 ]
ASF GitHub Bot commented on YARN-11622: --------------------------------------- hadoop-yetus commented on PR #6352: URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1898963967 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 4m 20s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | |||| _ branch-3.3 Compile Tests _ | | +1 :green_heart: | mvninstall | 33m 44s | | branch-3.3 passed | | +1 :green_heart: | compile | 0m 35s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 0m 28s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 0m 40s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 0m 30s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 1m 14s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 21m 17s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 29s | | the patch passed | | +1 :green_heart: | javac | 0m 29s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 19s | | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 66 unchanged - 1 fixed = 66 total (was 67) | | +1 :green_heart: | mvnsite | 0m 29s | | the patch passed | | +1 :green_heart: | javadoc | 0m 22s | | the patch passed | | -1 :x: | spotbugs | 1m 15s | [/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html) | hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) | | +1 :green_heart: | shadedclient | 21m 33s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | -1 :x: | unit | 78m 12s | [/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt) | hadoop-yarn-server-resourcemanager in the patch passed. | | +1 :green_heart: | asflicense | 0m 23s | | The patch does not generate ASF License warnings. | | | | 167m 20s | | | | Reason | Tests | |-------:|:------| | SpotBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Exceptional return value of java.util.concurrent.ExecutorService.submit(Callable) ignored in org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread() At ResourceManager.java:ignored in org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread() At ResourceManager.java:[line 1131] | | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMHA | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6352 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux f16f271e28e6 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / 5ae791898e1e8d053e7aebefd0532ff533b09087 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~18.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/testReport/ | | Max. process+thread count | 934 (vs. ulimit of 5500) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6352/10/console | | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > ResourceManager asynchronous switch from Standy to Active exception > ------------------------------------------------------------------- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0 > Reporter: wangzhihui > Assignee: wangzhihui > Priority: Major > Labels: pull-request-available > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the Scheduler configuration. At this time, the > csConfProvider property of the CapacityScheduler is not initialized and its > value is null. As a result. when the reinitialize method is executed > csConfProvider is used, triggering a NullPointerException and causing Thread_ > 1 transition to active fail. > !yuque_diagram (1).jpg|width=568,height=155! > h1. Solution > Due to the limited scope of lock control in ResourceMmanger’s > transitionToActive and transitionToStandby methods, different events > triggered asynchronously outside this lock scope can influence each other, > leading to unpredictable issues. The proposed solution is to encapsulate > different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue > them in a queue to be executed in order by a SingleThreadExecutor. This > approach resolves the asynchronous problem and provides clearer and more > controllable switching of to active and standby processes. > !rm_ha_solution.png|width=362,height=353! > h2. TransitionToActiveStandbyRunner and Subclasses > h3. TransitionToActiveStandbyRunner > TransitionToActiveStandbyRunner is a template class where the logic for > different scenarios is placed and executed within the doTransaction method. > {code:java} > public abstract class TransitionToActiveStandbyRunner implements > Callable<TransitionToActiveStandbyResult> { @Override > public TransitionToActiveStandbyResult call() throws Exception { > ... before log ... > TransitionToActiveStandbyResult result = doTransaction(); > ... after log ... > return result; > } public abstract TransitionToActiveStandbyResult > doTransaction();}{code} > h3. Subclasses > *AdminServiceToActiveRunner* > AdminServiceToActiveRunner encapsulates the logic of the transitionToActive > method in AdminService, handling the requests from clients and > ActiveStandbyElector to transition to the active state. > *AdminServiceToStandbyRunner* > AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby > method in AdminService, handling the requests from clients and > ActiveStandbyElector to transition to the standby state. > *RmStartAndStopToStandby* > RmStartAndStopToStandby is used for transitioning the ResourceManager service > to standby when it is stopping or starting > > *RMStartToActiveRunner* > RMStartToActiveRunner is used for transitioning the ResourceManager service > to active when it is stopping. > > *RMFatalToStandbyRunner* > RMFatalToStandbyRunner is used to handle RMFatalEvent during Yarn open HA > mode for transitioning to standby. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org