Weiwei Yang created YUNIKORN-1642:
-------------------------------------

             Summary: Scheduler recovery failed due to listing operation timeout
                 Key: YUNIKORN-1642
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1642
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
            Reporter: Weiwei Yang


The listing operation in the recovery phase: 
https://github.com/apache/yunikorn-k8shim/blob/c25ac60ffbc175c4966f917da21d184f34dea7b4/pkg/client/apifactory.go#L225.
 This could sometimes fail on some large clusters, the response time from API 
server is not guaranteed. And we see logs like this

{noformat}
2023-03-16T07:00:46.181Z        WARN    client/apifactory.go:218        Failed 
to sync informers        {"error": "timeout waiting for condition"}
2023-03-16T07:00:46.182Z        INFO    general/general.go:344  Pod list 
retrieved from api server      {"nr of pods": 0}
2023-03-16T07:00:46.182Z        INFO    general/general.go:365  Application 
recovery statistics {"nr of recoverable apps": 0, "nr of total pods": 0, "nr of 
pods without application metadata": 0, "nr of pods to be recovered": 0}
I0316 07:00:51.319100       1 trace.go:205] Trace[140954425]: "Reflector 
ListAndWatch" 
name:pkg/mod/k8s.io/client-go@v0.20.11/tools/cache/reflector.go:167 
(16-Mar-2023 07:00:16.168) (total time: 35150ms):
{noformat}

Since it is a WARN, it continues but the informers did not return anything. 
This confuses the scheduler that nothing needs to be recovered, and it goes 
ahead doing the scheduling. This causes subsequential scheduler failures.  And 
eventually, nothing can be scheduled anymore.

This should be a FATAL error. So the scheduler can be restarted to retry 
recoverying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to