Wilfred Spiegelenburg created YUNIKORN-2171:
-----------------------------------------------

             Summary: race condition in node removal and scheduling cycle
                 Key: YUNIKORN-2171
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2171
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
    Affects Versions: 1.4.0, 1.3.0, 1.2.0, 1.1.0, 1.0.0
            Reporter: Wilfred Spiegelenburg
            Assignee: Wilfred Spiegelenburg


When a node gets removed the partition resources and thus the root max 
resources are decreased. The node removal locks the partition, removes the node 
and releases the partition lock before proceeding. Cleanup of the allocations 
happens after that. This means that for a short period of time the root queue 
max resources are already decreased while the usage is not.

The scheduling cycle could be running during the node removal. The queue 
headroom calculation uses the queue max resources and usage to calculate the 
difference. The whole hierarchy is traversed for this.

If the headroom is limited by the root queue then we could have a race between 
the removal of the node allocations and scheduling:
 * scheduling starts and queue headroom is calculated
 * node is removed, queue max is lowered
 * scheduling finds new allocation
 * new allocation gets added to the queue updating usage
 * root queue is over its max already or would go over max: scheduling fails
 * node allocation removal proceeds and corrects the queue usage



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to