[jira] [Commented] (YUNIKORN-1085) DaemonSet pods may fail to be scheduled on new nodes added during autoscaling

Chaoran Yu (Jira) Sun, 20 Feb 2022 20:23:07 -0800


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-1085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495314#comment-17495314
 ]


Chaoran Yu commented on YUNIKORN-1085:
--------------------------------------

Our deployment has the environment variable DISABLE_RESERVATION set to true. So 
reservation should have been disabled. But yes I'll collect the node 
information next time I run into the problem

> DaemonSet pods may fail to be scheduled on new nodes added during autoscaling
> -----------------------------------------------------------------------------
>
>                 Key: YUNIKORN-1085
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1085
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>    Affects Versions: 0.12.2
>         Environment: Amazon EKS, K8s 1.20, Cluster Autoscaler
>            Reporter: Chaoran Yu
>            Priority: Blocker
>         Attachments: sampleNode.txt, samplePod.yaml
>
>
> After YUNIKORN-704 was done, YuniKorn should have the same mechanism as the 
> default scheduler when it comes to scheduling DaemonSet pods. That's the case 
> most times in our deployments. But recently we have found that DaemonSet 
> scheduling became problematic again: When K8s Cluster Autoscaler adds new 
> nodes in response to pending pods in the cluster, EKS will automatically 
> create a CNI DaemonSet (Amazon's container networking module), one pod on 
> each newly created node. But YuniKorn could not schedule these pods 
> successfully. There's no informative error messages. The default queue that 
> these pods belong to have available resources too. Because they couldn't be 
> scheduled, EKS refuses to mark the new nodes as ready, they then get stuck in 
> NotReady state. This issue is not always reproducible, but it has happened a 
> few times. The root cause needs to be further researched.
> Note that when this bug happened, the mitigation that worked was to disable 
> the YuniKorn admission controller, delete all the pending DaemonSet pods, 
> wait for the default scheduler will schedule them all, then the new nodes 
> will become Ready. So it seems that there are edge cases that haven't been 
> covered by the previous work where YuniKorn handles DaemonSet differently 
> compared to the default scheduler



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Commented] (YUNIKORN-1085) DaemonSet pods may fail to be scheduled on new nodes added during autoscaling

Reply via email to