[jira] [Commented] (YUNIKORN-704) [Umbrella] Use the same mechanism to schedule daemon set pods as the default scheduler

Wilfred Spiegelenburg (Jira) Wed, 18 Aug 2021 00:54:07 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17400874#comment-17400874
 ]


Wilfred Spiegelenburg commented on YUNIKORN-704:
------------------------------------------------

{quote}My concern is: will that break the cordon node scenario? When we cordon 
a node, it sets unschedulable flag in the node spec, not a taint 
(node.kubernetes.io/unreachable). In this case, if we ignore unschedulable flag 
completely, we will continue to schedule pods onto a cordoned node, isn't it?
{quote}
See here for [how the node cordon 
works|https://kubernetes.io/docs/concepts/architecture/nodes/#manual-node-administration].
 The toleration on the daemon set pod indirectly ignores the unschedulable 
flag. Being unschedulable is not a taint on a node. However the node controller 
converts certain [conditions into 
taints|https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions].
 So in the end the node that is unschedulable will have the taint also.
 The draw back of completely ignoring the unschedulable flag on the core side 
is a performance drop. Instead of skipping the node, which can only be used for 
special pods, the predicates are used reject many more pods that are checked 
against the node while scheduling.

My preference would be to use the 2nd option:
 * pull the node info from the daemon set pod
 * update the AllocationAsk with the information:
 ** a tag with key _UnschedulableNode_
 ** set the value of the tag to the node name
 * in the scheduler special case the placement based on the tag
 ** directly check the node without iterating over the nodes
 ** follow the standard pre-alloc and allocation checks

> [Umbrella] Use the same mechanism to schedule daemon set pods as the default 
> scheduler
> --------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-704
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-704
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: shim - kubernetes
>            Reporter: Chaoran Yu
>            Assignee: Ting Yao,Huang
>            Priority: Blocker
>             Fix For: 1.0.0
>
>         Attachments: fluent-bit-describe.yaml, fluent-bit.yaml
>
>
> We sometimes see DaemonSet pods fail to be scheduled. Please see attached 
> files for the YAML and _kubectl describe_ output of one such pod. We 
> originally suspected [node 
> reservation|https://github.com/apache/incubator-yunikorn-core/blob/v0.10.0/pkg/scheduler/context.go#L41]
>  was to blame. But even after setting the DISABLE_RESERVATION environment 
> variable to true, we still see such scheduling failures. The issue is 
> especially severe when K8s nodes have disk pressure that causes lots of pods 
> to be evicted. Newly created pods will stay in pending forever. We have to 
> temporarily uninstall YuniKorn and let the default scheduler do the 
> scheduling for these pods. 
> This issue is critical because lots of important pods belong to a DaemonSet, 
> such as Fluent Bit, a common logging solution. This is probably the last 
> remaining roadblock for us to have the confidence to have YuniKorn entirely 
> replace the default scheduler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Commented] (YUNIKORN-704) [Umbrella] Use the same mechanism to schedule daemon set pods as the default scheduler

Reply via email to