[ https://issues.apache.org/jira/browse/YUNIKORN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304143#comment-17304143 ]
Wilfred Spiegelenburg commented on YUNIKORN-584: ------------------------------------------------ Do you have a little more information for us to work with? We run clusters for multiple days and have not seen this before. The YK logs and listing of the nodes from k8s would be a good start. The reason I ask is that YK does not remove or add a node by itself ever. The shim uses listeners and gets informed by k8s when nodes get added or removed. The node add and remove are then passed on to the core. We do not remove or add nodes unless k8s tells us to. All these actions are clearly logged in the log at multiple points: in the shim and the core. Even if the listener fails the node just stays as it is. When it happens again before scaling the deployment grab the logs for us. > The node information could become out of sync with the underlying cluster > resources > ----------------------------------------------------------------------------------- > > Key: YUNIKORN-584 > URL: https://issues.apache.org/jira/browse/YUNIKORN-584 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes > Reporter: Chaoran Yu > Priority: Critical > Fix For: 0.10 > > > There are cases when YK may think that the cluster doesn't have enough > resources even though that's not actually the case. This has happened twice > to me after running YK in a cluster for a few days and then one day, the > [nodes endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes] > shows that the cluster only has one node (i.e. the node that YK itself is > running on), despite that the K8s cluster has 10 nodes in total. And if I try > to schedule a workload that requires more resources than available on that > node, YK will make pods pending with an event like below: > {quote}Normal PodUnschedulable 41s yunikorn Task <namespace>/<pod> is > pending for the requested resources become available{quote} > because it's not aware that other nodes in the cluster has available > resources. > All of this can be fixed by just restarting YK (scaling down the replica to 0 > and then back up to 1). So it seems that an issue with cache is causing the > issue, although it's not yet clear to me the exact conditions that triggered > this bug. > My environment is on AWS EKS with K8s 1.17, if that matters. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org