[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

ASF GitHub Bot (Jira) Wed, 13 Dec 2023 04:23:15 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796221#comment-17796221
 ]


ASF GitHub Bot commented on YARN-11622:
---------------------------------------

slfan1989 commented on PR #6352:
URL: https://github.com/apache/hadoop/pull/6352#issuecomment-1853819722

   @hiwangzhihui Thank you for your contribution! I have a question, why did 
active RM lose contact with ZK? Is it because Active RM has full gc?  Even if 
the situation described in your JIRA occurs, the cluster should have completed 
the HA switch. Has the original standby RM changed to active RM?




> ResourceManager asynchronous switch from Standy to Active exception
> -------------------------------------------------------------------
>
>                 Key: YARN-11622
>                 URL: https://issues.apache.org/jira/browse/YARN-11622
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>            Reporter: wangzhihui
>            Assignee: wangzhihui
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases：
> h2. The first case：
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure， Thread_1 during the toStandby process ， 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case：
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler configuration. At this time, the 
> csConfProvider property of the CapacityScheduler is not initialized and its 
> value is null. As a result. when the reinitialize method is executed 
> csConfProvider is used, triggering a NullPointerException and causing Thread_ 
> 1 transition to active fail.
> !yuque_diagram (1).jpg|width=568,height=155!
> h1. Solution
> Due to the limited scope of lock control in ResourceMmanger’s 
> transitionToActive and transitionToStandby methods, different events 
> triggered asynchronously outside this lock scope can influence each other, 
> leading to unpredictable issues. The proposed solution is to encapsulate 
> different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue 
> them in a queue to be executed in order by a SingleThreadExecutor. This 
> approach resolves the asynchronous problem and provides clearer and more 
> controllable switching of to active and standby processes.
> !rm_ha_solution.png|width=362,height=353!
> h2. TransitionToActiveStandbyRunner and Subclasses
> h3. TransitionToActiveStandbyRunner
>  TransitionToActiveStandbyRunner is a template class where the logic for 
> different scenarios is placed and executed within the doTransaction method.
> {code:java}
> public abstract class TransitionToActiveStandbyRunner implements  
> Callable<TransitionToActiveStandbyResult> {    @Override
>     public TransitionToActiveStandbyResult call() throws Exception {
>         ... before log ...
>      TransitionToActiveStandbyResult result = doTransaction();
>         ... after log ...
>         return result;
>     }    public abstract  TransitionToActiveStandbyResult  
> doTransaction();}{code}
> h3. Subclasses
> *AdminServiceToActiveRunner*
> AdminServiceToActiveRunner encapsulates the logic of the transitionToActive 
> method in AdminService, handling the requests from clients and 
> ActiveStandbyElector to transition to the active state.
> *AdminServiceToStandbyRunner*
> AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby 
> method in AdminService, handling the requests from clients and 
> ActiveStandbyElector to transition to the standby state.
> *RmStartAndStopToStandby*
> RmStartAndStopToStandby is used for transitioning the ResourceManager service 
> to standby when it is stopping or starting
>  
> *RMStartToActiveRunner*
> RMStartToActiveRunner is used for transitioning the ResourceManager service 
> to active when it is stopping.
>  
> *RMFatalToStandbyRunner*
> RMFatalToStandbyRunner is used to handle RMFatalEvent during Yarn open HA 
> mode for transitioning to standby.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

Reply via email to