[ https://issues.apache.org/jira/browse/YUNIKORN-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kinga Marton reassigned YUNIKORN-588: ------------------------------------- Assignee: (was: Kinga Marton) > Placeholder pods are not cleaned up timely when the Spark driver fails > ---------------------------------------------------------------------- > > Key: YUNIKORN-588 > URL: https://issues.apache.org/jira/browse/YUNIKORN-588 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes > Affects Versions: 0.10 > Reporter: Chaoran Yu > Priority: Major > Labels: spark > Attachments: Screen Shot 2021-03-19 at 9.41.48 PM.png > > > When a Spark job is gang scheduled, if the driver pod fails immediately upon > running (e.g. due to an error in the Spark application code), the placeholder > pods will still try to reserve resources. They won't be terminated until > after the configured timeout has passed, even though they should have been > cleaned up the moment that the driver failed. Because we already knew at that > point, none of the executors would have a chance to start. > Something probably needs to be done at the Spark operator plugin level to > activate placeholder cleanup to release resources sooner. > Edit: Actually a fix needs to be developed without the Spark operator plugin > because the user might not be using it. The Spark job could well have been > submitted via spark-submit. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org