[jira] [Commented] (YARN-5333) Some recovered apps are put into default queue when RM HA

Jun Gong (JIRA) Tue, 02 Aug 2016 03:34:38 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403741#comment-15403741
 ]


Jun Gong commented on YARN-5333:
--------------------------------

Thanks [~rohithsharma], [~jianhe] for the review and comments!

bq. 1. Should private boolean isTransitingToActive = false; is volatile?
Yes, it needs be volatile. I'll update it.

{quote}
2. Since none of the refreshXXX methods are synchronized, patch introduces a 
concurrency issue. If there is an explicit admin call for refreshing at the 
time of transitionToActive, then checkRMStatus will be executed for other admin 
calls. Until RM transition-to-active completely, explicit admin commands should 
not allowed to refresh. I think, we should incorporate similar to 
refreshAdminAcl method.
{quote}
How about adding {{synchronized}} to each refresh functions? It avoids adding 
more logic. When admin command comes, we could just call corresponding refresh 
functions. I think it does not matter to call refresh function many times.

bq. 3. I think flag checkRMHAState can be passed to method checkRMStatus.
I was thinking it. If adding checkRMHAState to checkRMStatus, we need add this 
parameter(checkRMHAState) to all refresh functions too(which is similar to 
refreshAdminAcl), there are a lot of places that call refresh functions. It 
might be better to just add a check before checkRMStatus?

bq. I think if you can simulate test for generally instead of specific to fair 
scheduler, this test can be moved to class TestRMHA. There is already test 
TestRMHA#testTransitionedToActiveRefreshFail, probable the same test can be 
changed?
Thanks. I'll update the test case.

{quote}
Instead of reusing the existing refreshAll method, I checked each refresh 
method, it should be cleaner to just create a new method which includes all 
necessary reconfig steps. This also avoids unnecessary audit logs, acl checks.
{quote}
Yes, it will be more clear to add a new method to include all reconfig steps. 
My doubt is that there will be two places that do similar reconfig things(the 
one is in refresh functions, the other is in the new added method). Then we 
need to modify both places if there is some change for one of them.

> Some recovered apps are put into default queue when RM HA
> ---------------------------------------------------------
>
>                 Key: YARN-5333
>                 URL: https://issues.apache.org/jira/browse/YARN-5333
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-5333.01.patch, YARN-5333.02.patch, 
> YARN-5333.03.patch, YARN-5333.04.patch, YARN-5333.05.patch
>
>
> Enable RM HA and use FairScheduler, 
> {{yarn.scheduler.fair.allow-undeclared-pools}} is set to false, 
> {{yarn.scheduler.fair.user-as-default-queue}} is set to false.
> Reproduce steps:
> 1. Start two RMs.
> 2. After RMs are running, change both RM's file 
> {{etc/hadoop/fair-scheduler.xml}}, then add some queues.
> 3. Submit some apps to the new added queues.
> 4. Stop the active RM, then the standby RM will transit to active and recover 
> apps.
> However the new active RM will put recovered apps into default queue because 
> it might have not loaded the new {{fair-scheduler.xml}}. We need call 
> {{initScheduler}} before start active services or bring {{refreshAll()}} in 
> front of {{rm.transitionToActive()}}. *It seems it is also important for 
> other scheduler*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5333) Some recovered apps are put into default queue when RM HA

Reply via email to