[
https://issues.apache.org/jira/browse/YUNIKORN-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496466#comment-17496466
]
Wilfred Spiegelenburg commented on YUNIKORN-1091:
-------------------------------------------------
The [problematic
code|https://github.com/apache/incubator-yunikorn-k8shim/blob/v0.12.2/pkg/cache/nodes.go#L210-L220]
in the nodes.go file
To explain the problem case:
The flow on effect of the caches being out of sync is that we could see a shim
send the out of sync resource value to the core. The capacity will be updated
in the core but not in the shim corresponding _SchedulerNode_. Later in the
process if we have a pod scheduled by something that is not YuniKorn we get
into trouble.
In the call \{{schedulerNodes.updateNodeOccupiedResources()}} we use the nodes
cache. We retrieve a node from the cache in [line
166|https://github.com/apache/incubator-yunikorn-k8shim/blob/v0.12.2/pkg/cache/nodes.go#L166].
This is not yet the problem. We update the occupied resources correctly. But
then we send an update to the core based on the cached node. Since we did NOT
update the node with the capacity change in lines 210-220 the capacity that is
used is the wrong value.
This means that the change we have seen is wiped out and we reset to the value
that was set on creation.
> node capacity update not updated in cache
> -----------------------------------------
>
> Key: YUNIKORN-1091
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1091
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Reporter: Wilfred Spiegelenburg
> Priority: Critical
>
> {{ctx.updateNode()}} gets called by the K8s client event handling. This
> updates the two caches using two calls:
> # {{ctx.schedulerCache.UpdateNode()}}
> # {{ctx.nodes.updateNode()}}
> The first call adds the node (type _framework.NodeInfo_) if it does not exist
> and otherwise replaces the entry in the cache with the new info (GOOD)
> The second call adds the node (type _SchedulerNode_) if it does no exist to
> (GOOD). If the node exists it checks for a resource change. The call results
> in a notification of the scheduler core. It does NOT update the secondary
> cache with this change. This leaves the SchedulerNode capacity unchanged and
> out of sync.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]