[ https://issues.apache.org/jira/browse/SPARK-32007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Suraj Sharma updated SPARK-32007: --------------------------------- Environment: |Java Version|1.8.0_121 (Oracle Corporation)| |Java Home|/usr/java/jdk1.8.0_121/jre| |Scala Version|version 2.11.12| |OS|Amazon Linux| h4. was: ||Name||Value|| |Java Version|1.8.0_121 (Oracle Corporation)| |Java Home|/usr/java/jdk1.8.0_121/jre| |Scala Version|version 2.11.12| |OS|Amazon Linux| h4. > Spark Driver Supervise does not work reliably > --------------------------------------------- > > Key: SPARK-32007 > URL: https://issues.apache.org/jira/browse/SPARK-32007 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.4.4 > Environment: |Java Version|1.8.0_121 (Oracle Corporation)| > |Java Home|/usr/java/jdk1.8.0_121/jre| > |Scala Version|version 2.11.12| > |OS|Amazon Linux| > h4. > Reporter: Suraj Sharma > Priority: Critical > > I have a standalone cluster setup. I DO NOT have a streaming use case. I use > AWS EC2 machines to have spark master and worker processes. > *Problem*: If a spark worker machine running some drivers and executor dies, > then the driver is not spawned again on other healthy machines. > *Below are my findings:* > ||Action/Behaviour||Executor||Driver|| > |Worker Machine Stop|Relaunches on an active machine|NO Relaunch| > |kill -9 to process|Relaunches on other machines|Relaunches on other machines| > |kill to process|Relaunches on other machines|Relaunches on other machines| > *Cluster Setup:* > # I have a spark standalone cluster > # {{spark.driver.supervise=true}} > # Spark Master HA is enabled and is backed by zookeeper > # Spark version = 2.4.4 > # I am using a systemd script for the spark worker process -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org