Sam Tran created SPARK-27347: -------------------------------- Summary: [MESOS] Fix supervised driver retry logic when agent crashes/restarts Key: SPARK-27347 URL: https://issues.apache.org/jira/browse/SPARK-27347 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.4.0, 2.3.2, 2.2.1 Reporter: Sam Tran
Ran into scenarios where {{--supervised}} Spark jobs were retried multiple times when an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent. That is: * supervised driver is running on agent1 * agent1 crashes * driver is relaunched on another agent as `<task-id>-retry-1` * agent1 comes back online and re-registers with scheduler * spark relaunches the same job as `<task-id>-retry-2` * now there are two jobs running simultaneously and the first retry job is effectively orphaned within Zookeeper This is because when an agent comes back and re-registers, it sends a status update {{TASK_FAILED}} for its old driver-task. Previous logic would indiscriminately remove the {{submissionId }}from Zookeeper's {{launchedDrivers}} node and add it to {{retryList}} node. Then, when a new offer came in, it would relaunch another {{-retry-}} task even though one was previously running. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org