[jira] [Commented] (YUNIKORN-839) getNodesUtilJSON is broken

Wilfred Spiegelenburg (Jira) Thu, 23 Sep 2021 22:54:05 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419582#comment-17419582
 ]


Wilfred Spiegelenburg commented on YUNIKORN-839:
------------------------------------------------

Some further detail on the comment above. Note that this does not need fixing 
in the code in this jira. We just need to be aware of this, log a follow up or 
document the case.

10 nodes in cluster, these distributions show the same graph but really have a 
completely different cluster utilisation:

 
||node resource type||node set 1||node set 2||graph shown||
|fully utilised cpu|n1, n2, n3|n1, n2, n3|3 nodes in top usage bucket|
|fully utilised memory|n4, n5, n6|n1, n2, n3|3 nodes in top usage bucket|
|fully utilised GPU|n7, n8, n9|n1, n2, n3|3 nodes in top usage bucket|
|_empty_ nodes|n10|n4, n5, n6, n7, n8, n9, n10 |not displayed|

There is a difference for an administrator but we do not give them any data to 
check.

It might be good to add a count showing the number of distinct nodes in the top 
and bottom buckets over all the resources. For {{node set 1}} we would show 9 
in the top range and for {{node set 2}} we would show 3. The closer that 
distinct number of nodes is to the number shown in the bucket the more 
concentrated the load is. The larger the difference is the more distributed the 
load is.

So: 3 distinct compared to 3 in the bucket is highly concentrated, 9 distinct 
compared to 3 in the bucket is highly distributed. 

I know this is a simple example and we probably need to take the average of 
nodes in a bucket over all the resource types but I think this makes it clear 
what I see as the problem.

 

> getNodesUtilJSON is broken
> --------------------------
>
>                 Key: YUNIKORN-839
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-839
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - common
>    Affects Versions: 0.11, 0.10
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Ting Yao,Huang
>            Priority: Critical
>
> In {{getNodesUtilJSON()}} is only calculated for resources that are defined 
> on a partition level. These might be the same resources as defined in the 
> nodes or they might not be.
> When walking over the node the last node visited will ultimately set the 
> {{resourceExists}} flag. If I have a set of 2 nodes one that has the resource 
> GPU defined and one that has not the outcome of the call will be different 
> for the order the nodes get visited.
> If the node with GPU is visited last the second loop will be entered with 
> {{resourceExists}} true if it was checked first {{resourceExists}} will be 
> false.
> That means the output of the call is unreliable, it also means that any web 
> UI parts that display the details cannot be relied on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-839) getNodesUtilJSON is broken

Reply via email to