[ 
https://issues.apache.org/jira/browse/YUNIKORN-588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kinga Marton reassigned YUNIKORN-588:
-------------------------------------

    Assignee:     (was: Kinga Marton)

> Placeholder pods are not cleaned up timely when the Spark driver fails
> ----------------------------------------------------------------------
>
>                 Key: YUNIKORN-588
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-588
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>    Affects Versions: 0.10
>            Reporter: Chaoran Yu
>            Priority: Major
>              Labels: spark
>         Attachments: Screen Shot 2021-03-19 at 9.41.48 PM.png
>
>
> When a Spark job is gang scheduled, if the driver pod fails immediately upon 
> running (e.g. due to an error in the Spark application code), the placeholder 
> pods will still try to reserve resources. They won't be terminated until 
> after the configured timeout has passed, even though they should have been 
> cleaned up the moment that the driver failed. Because we already knew at that 
> point, none of the executors would have a chance to start. 
>  Something probably needs to be done at the Spark operator plugin level to 
> activate placeholder cleanup to release resources sooner.
> Edit: Actually a fix needs to be developed without the Spark operator plugin 
> because the user might not be using it. The Spark job could well have been 
> submitted via spark-submit.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to