[ 
https://issues.apache.org/jira/browse/YUNIKORN-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-1642.
------------------------------------
     Fix Version/s: 1.3.0
    Target Version: 1.3.0
        Resolution: Fixed

Merged to master.

As an update, the direction of the fix was changed to remove the timeout, but 
log every 10 seconds that we are still waiting for sync to complete.

> Scheduler recovery failed due to listing operation timeout
> ----------------------------------------------------------
>
>                 Key: YUNIKORN-1642
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1642
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.3.0
>
>
> The listing operation in the recovery phase: 
> https://github.com/apache/yunikorn-k8shim/blob/c25ac60ffbc175c4966f917da21d184f34dea7b4/pkg/client/apifactory.go#L225.
>  This could sometimes fail on some large clusters, the response time from API 
> server is not guaranteed. And we see logs like this
> {noformat}
> 2023-03-16T07:00:46.181Z      WARN    client/apifactory.go:218        Failed 
> to sync informers        {"error": "timeout waiting for condition"}
> 2023-03-16T07:00:46.182Z      INFO    general/general.go:344  Pod list 
> retrieved from api server      {"nr of pods": 0}
> 2023-03-16T07:00:46.182Z      INFO    general/general.go:365  Application 
> recovery statistics {"nr of recoverable apps": 0, "nr of total pods": 0, "nr 
> of pods without application metadata": 0, "nr of pods to be recovered": 0}
> I0316 07:00:51.319100       1 trace.go:205] Trace[140954425]: "Reflector 
> ListAndWatch" 
> name:pkg/mod/k8s.io/client-go@v0.20.11/tools/cache/reflector.go:167 
> (16-Mar-2023 07:00:16.168) (total time: 35150ms):
> {noformat}
> Since it is a WARN, it continues but the informers did not return anything. 
> This confuses the scheduler that nothing needs to be recovered, and it goes 
> ahead doing the scheduling. This causes subsequential scheduler failures.  
> And eventually, nothing can be scheduled anymore.
> This should be a FATAL error. So the scheduler can be restarted to retry 
> recoverying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to