[ 
https://issues.apache.org/jira/browse/YUNIKORN-288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-288.
--------------------------------------------
    Resolution: Duplicate

I can see the issue happen. I do not have a possible workaround. 

The issue is the fact that the communication between the cache and the 
scheduler itself is event driven. This means that the scheduler when making an 
update will ask the cache to update via an async event. The event needs to be 
processed to update the state of the application. This time taken is undefined. 
It relies on the thread scheduling at an OS level, the system load and the 
number of cores etc for the machine the scheduler runs on. All of that is out 
of our control and is really deployment dependent.
Any delay in the event processing means that an application is really in 
between two states: outstanding request has been processed and removed in the 
scheduler but the state in the cache has not changed. This means that an 
application could no longer be considered for scheduling but also does not 
block based on the new expected state.

There always will be a case that the event processing is delayed long enough to 
allow one or more scheduling cycles to be processed. This all depends on the 
outstanding requests, applications and queues configured. The worst case would 
be multiple applications with one outstanding request in the same queue. The 
scheduling for that case will be the fastest as it requires the least amount of 
work on the scheduler side.

We have logged YUNIKORN-317 to merge the scheduling logic and the cache into 
one. It is the only possible solution.

> With StateAware Sorting policy enabled, Race condition while moving the apps 
> from accepted to starting state
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-288
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-288
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Ayub Pathan
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>
> With 3 apps submitted with each having one pod, below are the expected 
> states..
> {noformat}
> At this point, the apps state will be
> app01 - Starting
> app02 - Accepted
> app03 - Accepted{noformat}
> Now, submit one more pod to app01 and the expected states for each app will 
> be..
> {noformat}
> Add another pod for app01, and once this pod is allocated, verify app states:
>    app01 - Running => pod1, pod2
>    app02 - Starting => pod1
>    app03 - Accepted => pod1
>  {noformat}
> Due to race condition, once the app01 pods are allocated, both app02 & app03 
> move Starting state which is not expected(app03 should be in Accepted state) 
> as per above...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to