[
https://issues.apache.org/jira/browse/YUNIKORN-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419582#comment-17419582
]
Wilfred Spiegelenburg commented on YUNIKORN-839:
------------------------------------------------
Some further detail on the comment above. Note that this does not need fixing
in the code in this jira. We just need to be aware of this, log a follow up or
document the case.
10 nodes in cluster, these distributions show the same graph but really have a
completely different cluster utilisation:
||node resource type||node set 1||node set 2||graph shown||
|fully utilised cpu|n1, n2, n3|n1, n2, n3|3 nodes in top usage bucket|
|fully utilised memory|n4, n5, n6|n1, n2, n3|3 nodes in top usage bucket|
|fully utilised GPU|n7, n8, n9|n1, n2, n3|3 nodes in top usage bucket|
|_empty_ nodes|n10|n4, n5, n6, n7, n8, n9, n10 |not displayed|
There is a difference for an administrator but we do not give them any data to
check.
It might be good to add a count showing the number of distinct nodes in the top
and bottom buckets over all the resources. For {{node set 1}} we would show 9
in the top range and for {{node set 2}} we would show 3. The closer that
distinct number of nodes is to the number shown in the bucket the more
concentrated the load is. The larger the difference is the more distributed the
load is.
So: 3 distinct compared to 3 in the bucket is highly concentrated, 9 distinct
compared to 3 in the bucket is highly distributed.
I know this is a simple example and we probably need to take the average of
nodes in a bucket over all the resource types but I think this makes it clear
what I see as the problem.
> getNodesUtilJSON is broken
> --------------------------
>
> Key: YUNIKORN-839
> URL: https://issues.apache.org/jira/browse/YUNIKORN-839
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - common
> Affects Versions: 0.11, 0.10
> Reporter: Wilfred Spiegelenburg
> Assignee: Ting Yao,Huang
> Priority: Critical
>
> In {{getNodesUtilJSON()}} is only calculated for resources that are defined
> on a partition level. These might be the same resources as defined in the
> nodes or they might not be.
> When walking over the node the last node visited will ultimately set the
> {{resourceExists}} flag. If I have a set of 2 nodes one that has the resource
> GPU defined and one that has not the outcome of the call will be different
> for the order the nodes get visited.
> If the node with GPU is visited last the second loop will be entered with
> {{resourceExists}} true if it was checked first {{resourceExists}} will be
> false.
> That means the output of the call is unreliable, it also means that any web
> UI parts that display the details cannot be relied on.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]