[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

Jira Mon, 04 Dec 2023 13:13:07 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793035#comment-17793035
 ]


Íñigo Goiri commented on YARN-11622:
------------------------------------

Not having a single place to track the locks is obviously an issue.
Adding this entity tracking all the access makes sense to me.
The onyl concern for me would be performance, let's add some evaluation for 
that once we have the implementation.

> ResourceManager asynchronous switch from Standy to Active exception
> -------------------------------------------------------------------
>
>                 Key: YARN-11622
>                 URL: https://issues.apache.org/jira/browse/YARN-11622
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha4, 3.1.1, 3.3.0
>            Reporter: wangzhihui
>            Priority: Major
>         Attachments: rm_ha_solution.png, yuque_diagram (1).jpg, 
> yuque_diagram.jpg
>
>
> h1. Two exception cases：
> h2. The first case：
> *The exception desc:*
> {code:java}
> 14:52:57,426 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) 
> - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.access$1200(ResourceManager.java:610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.handleTransitionToStandByInNewThread(ResourceManager.java:941)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.access$1100(ResourceManager.java:144)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:902)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:892)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748){{}} * {code}
>  
>  * ActiveStandbyElector and ZKRMStateStore triggered toStandy event at 
> 14:52:57,
> Two asynchronous events are respectively referred to as Thread_ 1、Thread_ 2.
>  * As shown in the following figure， Thread_1 during the toStandby process ， 
> reinitializes the activeServices to null. At this point, Thread_2 will use 
> the "activeServices" when executing the handleTransitionToStandByInNewThread 
> method ultimately resulting in a NullPointerException and the Reosurcemanager 
> server exit.
> !yuque_diagram.jpg|width=629,height=100!
> h2. The second case：
> *The exception desc:* 
> {code:java}
> 06:17:35,913 WARN ha.ActiveStandbyElector 
> (ActiveStandbyElector.java:becomeActive(900)) - Exception handling the 
> winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:896)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:543)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:558)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error on refreshAll 
> during transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation 
> failed
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:765)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> ... 5 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:467)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:423)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:754)
> ... 6 more
> 06:17:35,917 ERROR resourcemanager.ResourceManager 
> (ResourceManager.java:handle(898)) - Received RMFatalEvent of type 
> TRANSITION_TO_ACTIVE_FAILED, caused by failure to refresh configuration 
> settings: org.apache.hadoop.ha.ServiceFailedException: RefreshAll opera
> tion failed{{}} {code}
>  * ActiveStandbyElector and ZKRMStateStore triggered toActive event and 
> toStandby event at 06:17:35, Two asynchronous events are respectively 
> referred to as Thread_ 1、Thread_ 2.
>  * During the execution of Thread_ 1 the CapacityScheduler.reinitialize is 
> called to refresh the Scheduler configuration. At this time, the 
> csConfProvider property of the CapacityScheduler is not initialized and its 
> value is null. As a result. when the reinitialize method is executed 
> csConfProvider is used, triggering a NullPointerException and causing Thread_ 
> 1 transition to active fail.
> !yuque_diagram (1).jpg|width=568,height=155!
> h1. Solution
> Due to the limited scope of lock control in ResourceMmanger’s 
> transitionToActive and transitionToStandby methods, different events 
> triggered asynchronously outside this lock scope can influence each other, 
> leading to unpredictable issues. The proposed solution is to encapsulate 
> different asynchronous tasks as TransitionToActiveStandbyRunner and enqueue 
> them in a queue to be executed in order by a SingleThreadExecutor. This 
> approach resolves the asynchronous problem and provides clearer and more 
> controllable switching of to active and standby processes.
> !rm_ha_solution.png|width=362,height=353!
> h2. TransitionToActiveStandbyRunner and Subclasses
> h3. TransitionToActiveStandbyRunner
>  TransitionToActiveStandbyRunner is a template class where the logic for 
> different scenarios is placed and executed within the doTransaction method.
> {code:java}
> public abstract class TransitionToActiveStandbyRunner implements  
> Callable<TransitionToActiveStandbyResult> {    @Override
>     public TransitionToActiveStandbyResult call() throws Exception {
>         ... before log ...
>      TransitionToActiveStandbyResult result = doTransaction();
>         ... after log ...
>         return result;
>     }    public abstract  TransitionToActiveStandbyResult  
> doTransaction();}{code}
> h3. Subclasses
> *AdminServiceToActiveRunner*
> AdminServiceToActiveRunner encapsulates the logic of the transitionToActive 
> method in AdminService, handling the requests from clients and 
> ActiveStandbyElector to transition to the active state.
> *AdminServiceToStandbyRunner*
> AdminServiceToStandbyRunner encapsulates the logic of the transitionToStandby 
> method in AdminService, handling the requests from clients and 
> ActiveStandbyElector to transition to the standby state.
> *RmStartAndStopToStandby*
> RmStartAndStopToStandby is used for transitioning the ResourceManager service 
> to standby when it is stopping or starting
>  
> *RMStartToActiveRunner*
> RMStartToActiveRunner is used for transitioning the ResourceManager service 
> to active when it is stopping.
>  
> *RMFatalToStandbyRunner*
> RMFatalToStandbyRunner is used to handle RMFatalEvent during Yarn open HA 
> mode for transitioning to standby.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11622) ResourceManager asynchronous switch from Standy to Active exception

Reply via email to