[ https://issues.apache.org/jira/browse/YUNIKORN-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545525#comment-17545525 ]
Peter Bacsko edited comment on YUNIKORN-1217 at 6/2/22 3:48 PM: ---------------------------------------------------------------- During today's sync-up, we agreed with [~wilfreds] that the simplest approach is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are created earlier than executor, this will always work. We just have to sort the pod slice in {{{}AppManagementService.recoverApps(){}}}: {noformat} pods, err := m.ListPods() if err != nil { log.Logger().Error("failed to list apps", zap.Error(err)) return recoveringApps, err } // put new sort code here for _, pod := range pods { app := svc.podEventHandler.HandleEvent(general.AddPod, general.Recovery, pod) recoveringApps[app.GetApplicationID()] = app } {noformat} was (Author: pbacsko): During today's sync-up, we agreed with [~wilfreds] that the simplest approach is to sort the retrieved pods based on {{{}CreationTime{}}}. Since drivers are created earlier than executor, this will always work. We just have to sort the pod slice in {{{}AppManagementService.recoverApps(){}}}: {noformat} pods, err := m.ListPods() if err != nil { log.Logger().Error("failed to list apps", zap.Error(err)) return recoveringApps, err } // put new sort code here for _, pod := range pods { app := svc.podEventHandler.HandleEvent(general.AddPod, general.Recovery, pod) recoveringApps[app.GetApplicationID()] = app } {noformat} > Ensure that Spark driver pod is processed before executor pods during recovery > ------------------------------------------------------------------------------ > > Key: YUNIKORN-1217 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1217 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Major > > When running a Spark workload with gang scheduling, the driver and executor > pods have different annotations. > It is critical that we process the driver first, because it has the task > group definitions. Based on > [https://yunikorn.apache.org/docs/next/user_guide/gang_scheduling/,] the > executor only needs {{{}yunikorn.apache.org/taskGroupName{}}}. > So when we add the pods in the recovery code path, we have to start with the > driver. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org