I have two hosts host01 and host02 (lets call them) I run one Master and two Workers on host01 I also run one Master and two Workers on host02
Now I have 1 LIVE Master on host01 and a STANDBY Master on host02 The LIVE Master is aware of all Workers in the cluster Now I submit a Spark application using bin/spark-submit --class SomeApp --deploy-mode cluster --supervise --master spark://host01:7077 Some.jar This to make the driver resilient to failure. Now the interesting part: If I stop the cluster (all daemons on all hosts) and restart the Master and Workers *only* on host01 the job resumes! as expected. But if I stop the cluster (all daemons on all hosts) and restart the Master and Workers *only* on host02 the job *does not* resume execution! why? I can see the driver on host02 WebUI listed but no job execution. Please let me know why. Am I wrong to expect it to resume execution in this case?