[ https://issues.apache.org/jira/browse/YUNIKORN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kinga Marton reassigned YUNIKORN-584: ------------------------------------- Assignee: Weiwei Yang > App recovery is skipped when applicationID is not set in pods' label > -------------------------------------------------------------------- > > Key: YUNIKORN-584 > URL: https://issues.apache.org/jira/browse/YUNIKORN-584 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes > Reporter: Chaoran Yu > Assignee: Weiwei Yang > Priority: Critical > Labels: pull-request-available > Fix For: 0.10 > > > There are cases when YK may think that the cluster doesn't have enough > resources even though that's not actually the case. This has happened twice > to me after running YK in a cluster for a few days and then one day, the > [nodes endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes] > shows that the cluster only has one node (i.e. the node that YK itself is > running on), despite that the K8s cluster has 10 nodes in total. And if I try > to schedule a workload that requires more resources than available on that > node, YK will make pods pending with an event like below: > {quote}Normal PodUnschedulable 41s yunikorn Task <namespace>/<pod> is > pending for the requested resources become available{quote} > because it's not aware that other nodes in the cluster has available > resources. > All of this can be fixed by just restarting YK (scaling down the replica to 0 > and then back up to 1). So it seems that an issue with cache is causing the > issue, although it's not yet clear to me the exact conditions that triggered > this bug. > My environment is on AWS EKS with K8s 1.17, if that matters. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org