[
https://issues.apache.org/jira/browse/YUNIKORN-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508916#comment-17508916
]
Manikandan R commented on YUNIKORN-1117:
On digging deeper into this, learned that it is better to depend on "Taints and
Tolerations" instead of "Conditions" as earlier one superseded the later
because of its flexibility (Namespaces etc). For all conditions documented
above, k8s automatically creates a taints against that particular node. Each
taint has effect as well. Effects are NoSchedule, NoExecute etc. For example,
For "Ready == false" condition, "node.kubernetes.io/not-ready" taint would be
created against that node with NoSchedule effect. NoSchedule means that node
should not be picked up for scheduling.
We can do the changes in phase wise manner as described below:
Phase 1:
As part of handling NodeUpdate event, Shim should able to receive the
\{{Taints}} info from \{{*v1.Node}} and passes it to core through
\{{*si.NodeInfo}} to do 2 things: 1. Marking that node as "unschedulable" so it
doesn't get picked for scheduling based on the taint effect 2. To update
metrics. To do this, need to add few more fields in \{{*si.NodeInfo}} message
as well.
Phase 2:
We can handle Tolerations in future phases if needed. Tolerations is all about
how pods can tolerate these taints. Usually, pods don't need to tolerate the
above discussed in-built or system based taints. A taint (User or Admin
defined) could be used to prevent Pod (which doesn't require a GPU) from
scheduling on GPU nodes.
> Use k8s Nodes condition to determine health of the node
> ---
>
> Key: YUNIKORN-1117
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1117
> Project: Apache YuniKorn
> Issue Type: Improvement
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Fix For: 1.0.0
>
>
> Among multiple conditions discussed in
> [https://kubernetes.io/docs/concepts/architecture/nodes/#condition], only
> "Ready" has been used. We should use other conditions as well to determine a
> generic {{isNodeHealthy}} factor and eventually passing to core as well.
> Please refer the discussion
> [https://github.com/apache/incubator-yunikorn-k8shim/pull/380#issuecomment-1066328969]
> for more details.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org