Thanks Aaron, this is useful ! - Manoj
On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson <ilike...@gmail.com> wrote: > Launching drivers inside the cluster was a feature added in 0.9, for > standalone cluster mode: > http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster > > Note the "supervise" flag, which will cause the driver to be restarted if > it fails. This is a rather low-level mechanism which by default will just > cause the whole job to rerun from the beginning. Special recovery would > have to be implemented by hand, via some sort of state checkpointing into a > globally visible storage system (e.g., HDFS), which, for example, Spark > Streaming already does. > > Currently, this feature is not supported in YARN or Mesos fine-grained > mode. > > > On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel <manojsamelt...@gmail.com>wrote: > >> Could you please elaborate how drivers can be restarted automatically ? >> >> Thanks, >> >> >> On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson <ilike...@gmail.com>wrote: >> >>> Master and slave are somewhat overloaded terms in the Spark ecosystem >>> (see the glossary: >>> http://spark.apache.org/docs/latest/cluster-overview.html#glossary). >>> Are you actually asking about the Spark "driver" and "executors", or the >>> standalone cluster "master" and "workers"? >>> >>> To briefly answer for either possibility: >>> (1) Drivers are not fault tolerant but can be restarted automatically, >>> Executors may be removed at any point without failing the job (though >>> losing an Executor may slow the job significantly), and Executors may be >>> added at any point and will be immediately used. >>> (2) Standalone cluster Masters are fault tolerant and failure will only >>> temporarily stall new jobs from starting or getting new resources, but does >>> not affect currently-running jobs. Workers can fail and will simply cause >>> jobs to lose their current Executors. New Workers can be added at any point. >>> >>> >>> >>> On Mon, Apr 14, 2014 at 11:00 AM, Ian Ferreira >>> <ianferre...@hotmail.com>wrote: >>> >>>> Folks, >>>> >>>> I was wondering what the failure support modes where for Spark while >>>> running jobs >>>> >>>> >>>> 1. What happens when a master fails >>>> 2. What happens when a slave fails >>>> 3. Can you mid job add and remove slaves >>>> >>>> >>>> Regarding the install on Meso, if I understand correctly the Spark >>>> master is behind a Zookeeper quorum so that isolates the slaves from a >>>> master failure, but what about the masters behind quorum? >>>> >>>> Cheers >>>> - Ian >>>> >>>> >>> >> >