[ 
https://issues.apache.org/jira/browse/YUNIKORN-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545525#comment-17545525
 ] 

Peter Bacsko edited comment on YUNIKORN-1217 at 6/2/22 3:48 PM:
----------------------------------------------------------------

During today's sync-up, we agreed with [~wilfreds] that the simplest approach 
is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are 
created earlier than executor, this will always work.

We just have to sort the pod slice in 
{{{}AppManagementService.recoverApps(){}}}:
{noformat}
pods, err := m.ListPods()
if err != nil {
        log.Logger().Error("failed to list apps", zap.Error(err))
        return recoveringApps, err
}

// put new sort code here

for _, pod := range pods {
        app := svc.podEventHandler.HandleEvent(general.AddPod, 
general.Recovery, pod)
        recoveringApps[app.GetApplicationID()] = app
}
{noformat}


 


was (Author: pbacsko):
During today's sync-up, we agreed with [~wilfreds] that the simplest approach 
is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are 
created earlier than executor, this will always work.

We just have to sort the pod slice in 
{{{}AppManagementService.recoverApps(){}}}:
{noformat}
                        pods, err := m.ListPods()
                        if err != nil {
                                log.Logger().Error("failed to list apps", 
zap.Error(err))
                                return recoveringApps, err
                        }

                        // put new sort code here

                        for _, pod := range pods {
                                app := 
svc.podEventHandler.HandleEvent(general.AddPod, general.Recovery, pod)
                                recoveringApps[app.GetApplicationID()] = app
                        }
{noformat}


 

> Ensure that Spark driver pod is processed before executor pods during recovery
> ------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-1217
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1217
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> When running a Spark workload with gang scheduling, the driver and executor 
> pods have different annotations.
> It is critical that we process the driver first, because it has the task 
> group definitions. Based on 
> [https://yunikorn.apache.org/docs/next/user_guide/gang_scheduling/,] the 
> executor only needs {{{}yunikorn.apache.org/taskGroupName{}}}.
> So when we add the pods in the recovery code path, we have to start with the 
> driver.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to