[ 
https://issues.apache.org/jira/browse/YUNIKORN-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508916#comment-17508916
 ] 

Manikandan R commented on YUNIKORN-1117:
----------------------------------------

On digging deeper into this, learned that it is better to depend on "Taints and 
Tolerations" instead of "Conditions" as earlier one superseded the later 
because of its flexibility (Namespaces etc). For all conditions documented 
above, k8s automatically creates a taints against that particular node. Each 
taint has effect as well. Effects are NoSchedule, NoExecute etc. For example, 
For "Ready == false" condition, "node.kubernetes.io/not-ready" taint would be 
created against that node with NoSchedule effect. NoSchedule means that node 
should not be picked up for scheduling.

We can do the changes in phase wise manner as described below:

Phase 1:

As part of handling NodeUpdate event, Shim should able to receive the 
\{{Taints}} info from \{{*v1.Node}} and passes it to core through 
\{{*si.NodeInfo}} to do 2 things: 1. Marking that node as "unschedulable" so it 
doesn't get picked for scheduling based on the taint effect 2. To update 
metrics. To do this, need to add few more fields in \{{*si.NodeInfo}} message 
as well.

Phase 2:

We can handle Tolerations in future phases if needed. Tolerations is all about 
how pods can tolerate these taints. Usually, pods don't need to tolerate the 
above discussed in-built or system based taints. A taint (User or Admin 
defined) could be used to prevent Pod (which doesn't require a GPU) from 
scheduling on GPU nodes.

> Use k8s Nodes condition to determine health of the node
> -------------------------------------------------------
>
>                 Key: YUNIKORN-1117
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1117
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>            Reporter: Manikandan R
>            Assignee: Manikandan R
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Among multiple conditions discussed in 
> [https://kubernetes.io/docs/concepts/architecture/nodes/#condition], only 
> "Ready" has been used. We should use other conditions as well to determine a 
> generic {{isNodeHealthy}} factor and eventually passing to core as well.
> Please refer the discussion 
> [https://github.com/apache/incubator-yunikorn-k8shim/pull/380#issuecomment-1066328969]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to