Many Thanks Silvio,

What I found out later is the if there was catastrophic failure and all the
daemons fail at the same time before any fail-over takes place in this case
when you bring back the cluster up the the job resumes only on the Master
is was last running on before the failure.

Otherwise during partial failure normal fail-over takes place and the
driver is handed over to another Master.

Which answers my initial question.

Regards
jk

On Fri, May 8, 2015 at 7:34 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>   If you’re using multiple masters with ZooKeeper then you should set
> your master URL to be
>
>  spark://host01:7077,host02:7077
>
>  And the property spark.deploy.recoveryMode=ZOOKEEPER
>
>  See here for more info:
> http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper
>
>   From: James King
> Date: Friday, May 8, 2015 at 11:22 AM
> To: user
> Subject: Submit Spark application in cluster mode and supervised
>
>   I have two hosts host01 and host02 (lets call them)
>
>  I run one Master and two Workers on host01
> I also run one Master and two Workers on host02
>
>  Now I have 1 LIVE Master on host01 and a STANDBY Master on host02
> The LIVE Master is aware of all Workers in the cluster
>
>  Now I submit a Spark application using
>
>  bin/spark-submit --class SomeApp --deploy-mode cluster --supervise
> --master spark://host01:7077 Some.jar
>
>  This to make the driver resilient to failure.
>
>  Now the interesting part:
>
>  If I stop the cluster (all daemons on all hosts) and restart
> the Master and Workers *only* on host01 the job resumes! as expected.
>
>  But if I stop the cluster (all daemons on all hosts) and restart the
> Master and Workers *only* on host02 the job *does not*
> resume execution! why?
>
>  I can see the driver on host02 WebUI listed but no job execution. Please
> let me know why.
>
>  Am I wrong to expect it to resume execution in this case?
>
>
>
>
>
>

Reply via email to