[ https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27347: ------------------------------------ Assignee: Apache Spark > [MESOS] Fix supervised driver retry logic when agent crashes/restarts > --------------------------------------------------------------------- > > Key: SPARK-27347 > URL: https://issues.apache.org/jira/browse/SPARK-27347 > Project: Spark > Issue Type: Bug > Components: Mesos > Affects Versions: 2.2.1, 2.3.2, 2.4.0 > Reporter: Sam Tran > Assignee: Apache Spark > Priority: Major > Labels: mesos > > Ran into scenarios where {{--supervised}} Spark jobs were retried multiple > times when an agent would crash, come back, and re-register even when those > jobs had already relaunched on a different agent. > That is: > * supervised driver is running on agent1 > * agent1 crashes > * driver is relaunched on another agent as `<task-id>-retry-1` > * agent1 comes back online and re-registers with scheduler > * spark relaunches the same job as `<task-id>-retry-2` > * now there are two jobs running simultaneously and the first retry job is > effectively orphaned within Zookeeper > This is because when an agent comes back and re-registers, it sends a status > update {{TASK_FAILED}} for its old driver-task. Previous logic would > indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} > node and add it to {{retryList}} node. > Then, when a new offer came in, it would relaunch another -retry task even > though one was previously running. > Sample log looks something like this: > {code:java} > 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: > ... [offers] ... > 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer > 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver > driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001" > ... > 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001 state=TASK_STARTING message='' > 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001 state=TASK_RUNNING message='' > ... > 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed > out' reason=REASON_SLAVE_REMOVED > ... > 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer > 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver > driver-20190115192138-0001 with taskId: value: > "driver-20190115192138-0001-retry-1" > ... > 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message='' > 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message='' > ... > 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable > agent re-reregistered' > ... > 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal > executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED > 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with > driver-20190115192138-0001 in status update > ... > 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer > 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver > driver-20190115192138-0001 with taskId: value: > "driver-20190115192138-0001-retry-2" > ... > 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message='' > 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: > taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org