[ https://issues.apache.org/jira/browse/YUNIKORN-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Craig Condit updated YUNIKORN-2099: ----------------------------------- Labels: release-notes (was: ) > [Umbrella] State initialisation simplification (phase 2) > -------------------------------------------------------- > > Key: YUNIKORN-2099 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2099 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler, shim - kubernetes > Reporter: Craig Condit > Assignee: Craig Condit > Priority: Major > Labels: release-notes > Fix For: 1.5.0 > > > Startup rebuilds all state of the cluster. This is called recovery. The name > is a bit misleading as it is not really recovery as it is loading the current > state. State initialisation is a better term to use. > The current recovery code links the loading of applications and tasks (pods) > to node loading. This makes the recovery code complex and thus fragile. It > could, in a worst case scenario, lead to a pod not being recovered correctly. > Recovery should be a step by step process that has boundaries and steps: > * load node > ** register nodes with the core > * load pods > ** create applications in core > ** register running pods as allocations with the core > ** register pending pods as asks with the core > * process changes for nodes and pods > * start scheduling > No nodes, applications or asks on existing apps should be declined. Even if > theĀ queue does not exist a running application must be added and handled. > The current rejection of an application if it cannot be placed in the queue > is an incorrect behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org