[jira] [Created] (SPARK-27347) [MESOS] Fix supervised driver retry logic when agent crashes/restarts

Sam Tran (JIRA) Tue, 02 Apr 2019 08:32:24 -0700

Sam Tran created SPARK-27347:
--------------------------------

             Summary: [MESOS] Fix supervised driver retry logic when agent 
crashes/restarts
                 Key: SPARK-27347
                 URL: https://issues.apache.org/jira/browse/SPARK-27347
             Project: Spark
          Issue Type: Bug
          Components: Mesos
    Affects Versions: 2.4.0, 2.3.2, 2.2.1
            Reporter: Sam Tran



Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
times when an agent would crash, come back, and re-register even when those 
jobs had already relaunched on a different agent.

That is:
 * supervised driver is running on agent1
 * agent1 crashes
 * driver is relaunched on another agent as `<task-id>-retry-1`
 * agent1 comes back online and re-registers with scheduler
 * spark relaunches the same job as `<task-id>-retry-2`
 * now there are two jobs running simultaneously and the first retry job is 
effectively orphaned within Zookeeper

This is because when an agent comes back and re-registers, it sends a status 
update {{TASK_FAILED}} for its old driver-task. Previous logic would 
indiscriminately remove the {{submissionId }}from Zookeeper's 
{{launchedDrivers}} node and add it to {{retryList}} node.

Then, when a new offer came in, it would relaunch another {{-retry-}} task even 
though one was previously running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27347) [MESOS] Fix supervised driver retry logic when agent crashes/restarts

Reply via email to