Re: Spark resilience

Manoj Samel Tue, 15 Apr 2014 09:24:08 -0700

Thanks Aaron, this is useful !

- Manoj



On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Launching drivers inside the cluster was a feature added in 0.9, for
> standalone cluster mode:
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster
>
> Note the "supervise" flag, which will cause the driver to be restarted if
> it fails. This is a rather low-level mechanism which by default will just
> cause the whole job to rerun from the beginning. Special recovery would
> have to be implemented by hand, via some sort of state checkpointing into a
> globally visible storage system (e.g., HDFS), which, for example, Spark
> Streaming already does.
>
> Currently, this feature is not supported in YARN or Mesos fine-grained
> mode.
>
>
> On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel <manojsamelt...@gmail.com>wrote:
>
>> Could you please elaborate how drivers can be restarted automatically ?
>>
>> Thanks,
>>
>>
>> On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson <ilike...@gmail.com>wrote:
>>
>>> Master and slave are somewhat overloaded terms in the Spark ecosystem
>>> (see the glossary:
>>> http://spark.apache.org/docs/latest/cluster-overview.html#glossary).
>>> Are you actually asking about the Spark "driver" and "executors", or the
>>> standalone cluster "master" and "workers"?
>>>
>>> To briefly answer for either possibility:
>>> (1) Drivers are not fault tolerant but can be restarted automatically,
>>> Executors may be removed at any point without failing the job (though
>>> losing an Executor may slow the job significantly), and Executors may be
>>> added at any point and will be immediately used.
>>> (2) Standalone cluster Masters are fault tolerant and failure will only
>>> temporarily stall new jobs from starting or getting new resources, but does
>>> not affect currently-running jobs. Workers can fail and will simply cause
>>> jobs to lose their current Executors. New Workers can be added at any point.
>>>
>>>
>>>
>>> On Mon, Apr 14, 2014 at 11:00 AM, Ian Ferreira 
>>> <ianferre...@hotmail.com>wrote:
>>>
>>>> Folks,
>>>>
>>>> I was wondering what the failure support modes where for Spark while
>>>> running jobs
>>>>
>>>>
>>>>    1. What happens when a master fails
>>>>    2. What happens when a slave fails
>>>>    3. Can you mid job add and remove slaves
>>>>
>>>>
>>>> Regarding the install on Meso, if I understand correctly the Spark
>>>> master is behind a Zookeeper quorum so that isolates the slaves from a
>>>> master failure, but what about the masters behind quorum?
>>>>
>>>> Cheers
>>>> - Ian
>>>>
>>>>
>>>
>>
>

Re: Spark resilience

Reply via email to