[ 
https://issues.apache.org/jira/browse/YUNIKORN-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841449#comment-17841449
 ] 

Yu-Lin Chen commented on YUNIKORN-2573:
---------------------------------------

Hi [~arhtur007], 



The issue was caused by the different lock levels used when updating nodeInfo 
and updating partitionContext.
Both actions did not share the same lock. (Node update is prior to partition 
update)
 * 
[https://github.com/apache/yunikorn-core/blob/03c3ccce0d618ba163e726515d0a3185f213a695/pkg/scheduler/context.go#L687]

 

I think one possible solution is to change waitForAvailableNodeResource() to 
waitForTotalPartitionResource():
 * 
[https://github.com/apache/yunikorn-core/blob/03c3ccce0d618ba163e726515d0a3185f213a695/pkg/scheduler/tests/operation_test.go#L502-L503]

 

> Flaky test TestUpdateNodeCapacityWithMultipleNodes
> --------------------------------------------------
>
>                 Key: YUNIKORN-2573
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2573
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Arthur Wang
>            Assignee: Arthur Wang
>            Priority: Minor
>
> [github 
> pipeline|https://github.com/apache/yunikorn-core/actions/runs/8770718393/job/24067600801]
> Github CI occasionally fail.
> Still working on finding root cause.
> Since there always an error or warning from scheduler health check when 
> running multiple tests at the same time,
> maybe it's some test setting issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to