[ https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
wangzhihui updated YARN-11622: ------------------------------ Affects Version/s: 3.1.3 > ResourceManager asynchronous switch to Standy、Active exception > -------------------------------------------------------------- > > Key: YARN-11622 > URL: https://issues.apache.org/jira/browse/YARN-11622 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 3.0.0, 3.1.3 > Reporter: wangzhihui > Priority: Major > Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, > yuque_diagram.jpg > > > h1. Two exception cases: > h2. The first case: > *The exception desc:* > {code:java} > 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) > - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748){{}} * {code} > > * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at > 14:52:57, > Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2. > * As shown in the following figure, Thread_1 during the toStandby process , > reinitializes the activeServices to null. At this point, Thread_2 will use > the "activeServices" when executing the handleTransitionToStandByInNewThread > method ultimately resulting in a NullPointerException and the Reosurcemanager > server exit. > !yuque_diagram.jpg|width=629,height=100! > h2. The second case: > *The exception desc:* > {code:java} > 06:17:35,913 WARN ha.ActiveStandbyElector > (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the > winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll > during transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at > org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) > ... 4 more > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation > failed > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 5 more > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754) > ... 6 more > 06:17:35,917 ERROR resourcemanager.ResourceManager > (ResourceManager.java:handle(898)) - Received RMFatalEvent of type > TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration > settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera > tion failed{{}} {code} > * ActiveStandbyElector and ZKRMStateStore triggered toActive event and > toStandby event at 06:17:35, Two asynchronous events are respectively > referred to as Thread_ 1、Thread_ 2. > * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is > called to refresh the Scheduler configuration. At this time, the > csConfProvider property of the CapacityScheduler is not initialized and its > value is null. As a result. when the reinitialize method is executed > csConfProvider is used, triggering a NullPointerException and causing Thread_ > 1 transition to active fail. > !yuque_diagram (1).jpg|width=568,height=155! > h1. Solution > Due to the limited scope of lock control in ResourceMmanger’s > transitionToActive and transitionToStandby methods, different events > triggered asynchronously outside this lock scope can influence each other, > leading to unpredictable issues. The proposed solution is to encapsulate > different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue > them in a queue to be executed in order by a SingleThreadExecutor. This > approach resolves the asynchronous problem and provides clearer and more > controllable switching of to active and standby processes. > !rm_ha_solution.png|width=362,height=353! > h2. TransitionToActiveStandbyRunner and Subclasses > h3. TransitionToActiveStandbyRunner > TransitionToActiveStandbyRunner is a template class where the logic for > different scenarios is placed and executed within the doTransaction method. > {code:java} > public abstract class TransitionToActiveStandbyRunner implements > Callable<TransitionToActiveStandbyResult> { @Override > public TransitionToActiveStandbyResult call() throws Exception { > ... before log ... > TransitionToActiveStandbyResult result = doTransaction(); > ... after log ... > return result; > } public abstract TransitionToActiveStandbyResult > doTransaction();}{code} > h3. Subclasses > *AdminServiceToActiveRunner* > AdminServiceToActiveRunner encapsulates the logic of the transitionToActive > method in AdminService, handling the requests from clients and > ActiveStandbyElector to transition to the active state. > *AdminServiceToStandbyRunner* > AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby > method in AdminService, handling the requests from clients and > ActiveStandbyElector to transition to the standby state. > *RmStartAndStopToStandby* > RmStartAndStopToStandby is used for transitioning the ResourceManager service > to standby when it is stopping or starting > > *RMStartToActiveRunner* > RMStartToActiveRunner is used for transitioning the ResourceManager service > to active when it is stopping. > > *RMFatalToStandbyRunner* > RMFatalToStandbyRunner is used to handle RMFatalEvent during Yarn open HA > mode for transitioning to standby. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org