[ https://issues.apache.org/jira/browse/YUNIKORN-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495314#comment-17495314 ]
Chaoran Yu commented on YUNIKORN-1085: -------------------------------------- Our deployment has the environment variable DISABLE_RESERVATION set to true. So reservation should have been disabled. But yes I'll collect the node information next time I run into the problem > DaemonSet pods may fail to be scheduled on new nodes added during autoscaling > ----------------------------------------------------------------------------- > > Key: YUNIKORN-1085 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1085 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes > Affects Versions: 0.12.2 > Environment: Amazon EKS, K8s 1.20, Cluster Autoscaler > Reporter: Chaoran Yu > Priority: Blocker > Attachments: sampleNode.txt, samplePod.yaml > > > After YUNIKORN-704 was done, YuniKorn should have the same mechanism as the > default scheduler when it comes to scheduling DaemonSet pods. That's the case > most times in our deployments. But recently we have found that DaemonSet > scheduling became problematic again: When K8s Cluster Autoscaler adds new > nodes in response to pending pods in the cluster, EKS will automatically > create a CNI DaemonSet (Amazon's container networking module), one pod on > each newly created node. But YuniKorn could not schedule these pods > successfully. There's no informative error messages. The default queue that > these pods belong to have available resources too. Because they couldn't be > scheduled, EKS refuses to mark the new nodes as ready, they then get stuck in > NotReady state. This issue is not always reproducible, but it has happened a > few times. The root cause needs to be further researched. > Note that when this bug happened, the mitigation that worked was to disable > the YuniKorn admission controller, delete all the pending DaemonSet pods, > wait for the default scheduler will schedule them all, then the new nodes > will become Ready. So it seems that there are edge cases that haven't been > covered by the previous work where YuniKorn handles DaemonSet differently > compared to the default scheduler -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org