Craig Condit created YUNIKORN-1670: -------------------------------------- Summary: Application recovery can fail if app is rejected Key: YUNIKORN-1670 URL: https://issues.apache.org/jira/browse/YUNIKORN-1670 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Craig Condit Assignee: Craig Condit
During application recovery, the current code waits up to 30 seconds for all applications to transition to "Accepted". However, if an application is rejected, or if the cluster is large enough, recovery will not succeed. Similar to how informer sync was recently updated, we should modify the logic to keep trying, but log periodically. Additionally, we should not look specifically for Accepted state, but for state != New and != Recovering. This ensures that we have processed all the applicaitons. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org