[jira] [Created] (YUNIKORN-588) Placeholder pods are not cleaned up timely when the Spark driver fails

Chaoran Yu (Jira) Fri, 19 Mar 2021 18:58:06 -0700

Chaoran Yu created YUNIKORN-588:
-----------------------------------

             Summary: Placeholder pods are not cleaned up timely when the Spark 
driver fails
                 Key: YUNIKORN-588
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-588
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
    Affects Versions: 0.10
            Reporter: Chaoran Yu



When a Spark job is gang scheduled, if the driver pod fails immediately upon 
running (e.g. due to an error in the Spark application code), the placeholder 
pods will still try to reserve resources. They won't be terminated until after 
the configured timeout has passed, even though they should have been cleaned up 
the moment that the driver failed. Because we already knew at that point, none 
of the executors would have a chance to start. 
Something probably needs to be done at the Spark operator plugin level to 
activate placeholder cleanup to release resources sooner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

[jira] [Created] (YUNIKORN-588) Placeholder pods are not cleaned up timely when the Spark driver fails

Reply via email to