Craig Condit created YUNIKORN-1670:
--------------------------------------

             Summary: Application recovery can fail if app is rejected
                 Key: YUNIKORN-1670
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1670
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
            Reporter: Craig Condit
            Assignee: Craig Condit


During application recovery, the current code waits up to 30 seconds for all 
applications to transition to "Accepted". However, if an application is 
rejected, or if the cluster is large enough, recovery will not succeed.

Similar to how informer sync was recently updated, we should modify the logic 
to keep trying, but log periodically. Additionally, we should not look 
specifically for Accepted state, but for state != New and != Recovering. This 
ensures that we have processed all the applicaitons.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to