If you’re using multiple masters with ZooKeeper then you should set your master URL to be
spark://host01:7077,host02:7077 And the property spark.deploy.recoveryMode=ZOOKEEPER See here for more info: http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper From: James King Date: Friday, May 8, 2015 at 11:22 AM To: user Subject: Submit Spark application in cluster mode and supervised I have two hosts host01 and host02 (lets call them) I run one Master and two Workers on host01 I also run one Master and two Workers on host02 Now I have 1 LIVE Master on host01 and a STANDBY Master on host02 The LIVE Master is aware of all Workers in the cluster Now I submit a Spark application using bin/spark-submit --class SomeApp --deploy-mode cluster --supervise --master spark://host01:7077 Some.jar This to make the driver resilient to failure. Now the interesting part: If I stop the cluster (all daemons on all hosts) and restart the Master and Workers only on host01 the job resumes! as expected. But if I stop the cluster (all daemons on all hosts) and restart the Master and Workers only on host02 the job does not resume execution! why? I can see the driver on host02 WebUI listed but no job execution. Please let me know why. Am I wrong to expect it to resume execution in this case?