[jira] [Comment Edited] (YUNIKORN-582) Consider a fallback mechanism to schedule the app in case of gang failure instead of marking the app as failed

Kinga Marton (Jira) Thu, 27 May 2021 05:00:06 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351695#comment-17351695
 ]


Kinga Marton edited comment on YUNIKORN-582 at 5/27/21, 11:59 AM:
------------------------------------------------------------------

Hi [~chenya_zhang], for now we plan to introduce a very simple fix for this 
issue, what means, after the timeout we will schedule the application as a 
simple application, not as one with gang requirements. In the future we can 
think about different scheduling policies.
{quote}1. Do we plan to re-prioritize a failed-to-be-scheduled app to the front 
of the queue?
{quote}
No, we are not planning to introduce priorities right now, we can think about 
it when we will introduce thee application priority in general way as well.
{quote}2. Is there still going to be a timeout if an app is retried for a few 
times?
{quote}
No, thee application will be retried only once, when it will be handled as a 
normal application. From this point it will behave exactly in the same way as 
the simple application.
{quote}3. Curious if YK expects the requested resource from an application to 
be reasonable/considerate.
{quote}
YK is not performing this kind of checks
{quote}4. Do we plan to let users know that their app is retried for one, two, 
three times
{quote}
The application will be retried only once. We can push some events to let the 
user know that the application is scheduled as a simple app. Also we will have 
a new application state (Resuming, or FallingBack...) what will let thee user 
know that the application s retried
 after the timeout.  Again we can think about some scheduling policies to 
define how many times to retry the application, but for now let's keep is as 
simple as possible. 

 

The main workflow we have in mind is the following one: 

*Shim side*
 * parse and pass the Gang scheduling style to the core as an application 
attribute
 * When the shim gets the state change event check if the app moved into 
Starting state(or a newly introduced state) and in the shim side its state is 
TryReserving, then start the application in the shim side as well and let the 
real pods to be scheduled.

*Core side*
 * when the placeholder times out, check the set style,
 ** if it is {{hard}}, make the cleanup and fail the application, as we are 
doing now as well.
 ** if the style is set to {{soft:}}
 ***  make the placeholder cleanup, but don't fail the application, instead 
move it into a new state (Resuming, or FallingBack..)
 *** when all the placeholders are deleted instead of failing the app, move it 
into -Starting- Accepted state and from this point it will be handled as a 
simple application**

 cc [~maniraj...@gmail.com] for the shim side changes.


was (Author: kmarton):
Hi [~chenya_zhang], for now we plan to introduce a very simple fix for this 
issue, what means, after the timeout we will schedule the application as a 
simple application, not as one with gang requirements. In the future we can 
think about different scheduling policies.
{quote}1. Do we plan to re-prioritize a failed-to-be-scheduled app to the front 
of the queue?
{quote}
No, we are not planning to introduce priorities right now, we can think about 
it when we will introduce thee application priority in general way as well.
{quote}2. Is there still going to be a timeout if an app is retried for a few 
times?
{quote}
No, thee application will be retried only once, when it will be handled as a 
normal application. From this point it will behave exactly in the same way as 
the simple application.
{quote}3. Curious if YK expects the requested resource from an application to 
be reasonable/considerate.
{quote}
YK is not performing this kind of checks
{quote}4. Do we plan to let users know that their app is retried for one, two, 
three times
{quote}
The application will be retried only once. We can push some events to let the 
user know that the application is scheduled as a simple app. Also we will have 
a new application state (Resuming, or FallingBack...) what will let thee user 
know that the application s retried
 after the timeout.  Again we can think about some scheduling policies to 
define how many times to retry the application, but for now let's keep is as 
simple as possible. 

 

The main workflow we have in mind is the following one: 

*Shim side*
 * parse and pass the Gang scheduling style to the core as an application 
attribute
 * When the shim gets the state change event check if the app moved into 
Starting state(or a newly introduced state) and in the shim side its state is 
TryReserving, then start the application in the shim side as well and let the 
real pods to be scheduled.

*Core side*
 * when the placeholder times out, check the set style,
 ** if it is {{hard}}, make the cleanup and fail the application, as we are 
doing now as well.
 ** if the style is set to {{soft:}}
 ***  make the placeholder cleanup, but don't fail the application, instead 
move it into a new state (Resuming, or FallingBack..)
 *** when all the placeholders are deleted instead of failing the app, move it 
into Starting state and from this point it will be handled as a simple 
application**

 cc [~maniraj...@gmail.com] for the shim side changes.

> Consider a fallback mechanism to schedule the app in case of gang failure 
> instead of marking the app as failed
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-582
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-582
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: core - scheduler
>            Reporter: Ayub Pathan
>            Assignee: Kinga Marton
>            Priority: Major
>
> Incases when the app encounters gang issues due to placeholder pod 
> allocation(failed due to various reasons), currently yunikorn marks the app 
> failed. 
> Instead, consider a configurable option for hard or soft gang scheduling 
> which allows fallback mechanism to schedule the app successfully.  This needs 
> to be brain stormed to see if this makes sense. Let us use this jira for 
> documenting all the thoughts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

[jira] [Comment Edited] (YUNIKORN-582) Consider a fallback mechanism to schedule the app in case of gang failure instead of marking the app as failed

Reply via email to