[ https://issues.apache.org/jira/browse/YUNIKORN-582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351695#comment-17351695 ]
Kinga Marton commented on YUNIKORN-582: --------------------------------------- Hi [~chenya_zhang], for now we plan to introduce a very simple fix for this issue, what means, after the timeout we will schedule the application as a simple application, not as one with gang requirements. In the future we can think about different scheduling policies. {quote}1. Do we plan to re-prioritize a failed-to-be-scheduled app to the front of the queue? {quote} No, we are not planning to introduce priorities right now, we can think about it when we will introduce thee application priority in general way as well. {quote}2. Is there still going to be a timeout if an app is retried for a few times? {quote} No, thee application will be retried only once, when it will be handled as a normal application. From this point it will behave exactly in the same way as the simple application. {quote}3. Curious if YK expects the requested resource from an application to be reasonable/considerate. {quote} YK is not performing this kind of checks {quote}4. Do we plan to let users know that their app is retried for one, two, three times {quote} The application will be retried only once. We can push some events to let the user know that the application is scheduled as a simple app. Also we will have a new application state (Resuming, or FallingBack...) what will let thee user know that the application s retried after the timeout. Again we can think about some scheduling policies to define how many times to retry the application, but for now let's keep is as simple as possible. The main workflow we have in mind is the following one: *Shim side* * parse and pass the Gang scheduling style to the core as an application attribute * When the shim gets the state change event check if the app moved into Starting state(or a newly introduced state) and in the shim side its state is TryReserving, then start the application in the shim side as well and let the real pods to be scheduled. *Core side* * when the placeholder times out, check the set style, ** if it is {{hard}}, make the cleanup and fail the application, as we are doing now as well. ** if the style is set to {{soft:}} *** make the placeholder cleanup, but don't fail the application, instead move it into a new state (Resuming, or FallingBack..) *** when all the placeholders are deleted instead of failing the app, move it into Starting state and from this point it will be handled as a simple application** > Consider a fallback mechanism to schedule the app in case of gang failure > instead of marking the app as failed > -------------------------------------------------------------------------------------------------------------- > > Key: YUNIKORN-582 > URL: https://issues.apache.org/jira/browse/YUNIKORN-582 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler > Reporter: Ayub Pathan > Assignee: Kinga Marton > Priority: Major > > Incases when the app encounters gang issues due to placeholder pod > allocation(failed due to various reasons), currently yunikorn marks the app > failed. > Instead, consider a configurable option for hard or soft gang scheduling > which allows fallback mechanism to schedule the app successfully. This needs > to be brain stormed to see if this makes sense. Let us use this jira for > documenting all the thoughts. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org