[ 
https://issues.apache.org/jira/browse/YUNIKORN-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448620#comment-17448620
 ] 

Peter Bacsko commented on YUNIKORN-936:
---------------------------------------

[~wilfreds] [~yuchaoran2011]
I untangled what happens during recovery. Not saying it was easy... :) But it 
was worth it!

{noformat}
KubernetesShim.recoverSchedulerState() [shim]
  AppManagementService.waitForRecovery() [shim]
    AppManagementService.recoverApps() [shim]  <for each app>
      Application.TriggerAppRecovery() [shim]
          Application.handleRecoverApplicationEvent()  [shim]
            RMProxy.UpdateApplication() [core]
              Scheduler.handleRMEvent() [core]
                ClusterContext.handleRMUpdateApplicationEvent() [core]
                  RMProxy.handleRMEvents() [core]
                    RMProxy.processApplicationUpdateEvent() [core]
                      AsyncRMCallback.UpdateApplication() [shim]
                        ...
{noformat}

A lot of event handling happens on separate goroutines, so in 
{{AppManagementService.waitForRecovery()}} we return earlier from 
{{recoverApps()}}. Then we call {{AppManagementService.waitForAppRecovery()}}, 
which waits until all applications transition into {{Accepted}} state. 

Only at this point we go ahead and perform node recovery. 

{noformat}
KubernetesShim.recoverSchedulerState() [shim]
  Context.WaitForRecovery() [shim]
    Context.recover() [shim]
      utils.WaitForCondition() [shim]
        <anon function> [shim]
          <dispatcher call> [shim]
            SchedulerNode.handle() [shim]
              SchedulerNode.handleNodeRecovery() [shim]
                RMProxy.UpdateNode()  [core]
                  ... <skip some call sites>
                    PartitionContext.AddNode() [core]
                      PartitionContext.AddAllocation() [core]
{noformat}      

In the unit test, I could reproduce the issue when I removed the line 
{{ms.mockRM.waitForAcceptedApplication(t, appID1, 1000)}} and the problem 
occurred in {{AddAllocation()}} - at that point, the application wasn't stored 
yet. But as far as I can see it, this cannot happen during the normal 
production code path, because by the time we get there, the application has 
been added by {{ClusterContext.handleRMUpdateApplicationEvent()}}.




> app and node recovery event ordering
> ------------------------------------
>
>                 Key: YUNIKORN-936
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-936
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - common
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Peter Bacsko
>            Priority: Major
>
> While working on YUNIKORN-905 a number of unit tests failed due to event 
> ordering. Looking at the change we might have had an issue in the RMProxy for 
> a long time.
> An update request could contain apps, asks and nodes. Processing was ordered 
> like that too. During recovery the order was/is important. There was never an 
> order requirement on the events send by a shim or a use of complex updates 
> events to support this ordering by the shim.
> An event to recover a node could be a separate UpdateRequest from the 
> applications that should be recovered. That means we relied on the go routine 
> and event ordering to hopefully do things correctly: i.e. events send by the 
> shim to create new apps would be processed before node recovery started. Even 
> in the previous implementation there was no guarantee that all the 
> application were added before a node was recovered. The unit tests in the 
> core used the order processing dependency to make sure it worked.
> That is not the real world scenario. and thus a dangerous assumption.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to