I have two hosts host01 and host02 (lets call them)

I run one Master and two Workers on host01
I also run one Master and two Workers on host02

Now I have 1 LIVE Master on host01 and a STANDBY Master on host02
The LIVE Master is aware of all Workers in the cluster

Now I submit a Spark application using

bin/spark-submit --class SomeApp --deploy-mode cluster --supervise --master
spark://host01:7077 Some.jar

This to make the driver resilient to failure.

Now the interesting part:

If I stop the cluster (all daemons on all hosts) and restart
the Master and Workers *only* on host01 the job resumes! as expected.

But if I stop the cluster (all daemons on all hosts) and restart the Master
and Workers *only* on host02 the job *does not* resume execution! why?

I can see the driver on host02 WebUI listed but no job execution. Please
let me know why.

Am I wrong to expect it to resume execution in this case?

Reply via email to