Thanks. This is getting a bit confusing. I have these modes for using Spark.
1. Spark local All on the same host --> -master local[n]l.. No need to start master and slaves. Uses resources as you submit the job. 2. Spark Standalone. Use a simple cluster manager included with Spark that makes it easy to set up a cluster --> --master spark://<HOSTNAME>:7077. Can run on different hosts. Does not rely on Yarn. It looks after scheduling itself. Need to start master and slaves The doc says: There are two deploy modes that can be used to launch Spark applications* on YARN*. *In cluster mode*, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. *In client mode*, the driver runs in the client process, and the application master is only used for requesting resources from YARN. Unlike Spark standalone <http://spark.apache.org/docs/latest/spark-standalone.html> and Mesos <http://spark.apache.org/docs/latest/running-on-mesos.html> modes, in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn. So either we have --> --master yarn --deploy-mode cluster OR --> master yarn-client So I am not sure running Spark with Yarn in either yarn-client or yarn cluster is going to make much difference. In sounds like yarn-cluster supercedes yarn-client? Any comments welcome Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 7 June 2016 at 15:40, Sebastian Piu <sebastian....@gmail.com> wrote: > If you run that job then the driver will ALWAYS run in the machine from > where you are issuing the spark submit command (E.g. some edge node with > the clients installed). No matter where the resource manager is running. > > If you change yarn-client for yarn-cluster then your driver will start > somewhere else in the cluster as will the workers and the spark submit > command will return before the program finishes > > On Tue, 7 Jun 2016, 14:53 Jacek Laskowski, <ja...@japila.pl> wrote: > >> Hi, >> >> --master yarn-client is deprecated and you should use --master yarn >> --deploy-mode client instead. There are two deploy-modes: client >> (default) and cluster. See >> http://spark.apache.org/docs/latest/cluster-overview.html. >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://medium.com/@jaceklaskowski/ >> Mastering Apache Spark http://bit.ly/mastering-apache-spark >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Tue, Jun 7, 2016 at 2:50 PM, Mich Talebzadeh >> <mich.talebza...@gmail.com> wrote: >> > ok thanks >> > >> > so I start SparkSubmit or similar Spark app on the Yarn resource manager >> > node. >> > >> > What you are stating is that Yan may decide to start the driver program >> in >> > another node as opposed to the resource manager node >> > >> > ${SPARK_HOME}/bin/spark-submit \ >> > --driver-memory=4G \ >> > --num-executors=5 \ >> > --executor-memory=4G \ >> > --master yarn-client \ >> > --executor-cores=4 \ >> > >> > Due to lack of resources in the resource manager node? What is the >> > likelihood of that. The resource manager node is the defector master >> node in >> > all probability much more powerful than other nodes. Also the node that >> > running resource manager is also running one of the node manager as >> well. So >> > in theory may be in practice may not? >> > >> > HTH >> > >> > Dr Mich Talebzadeh >> > >> > >> > >> > LinkedIn >> > >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> > >> > >> > >> > http://talebzadehmich.wordpress.com >> > >> > >> > >> > >> > On 7 June 2016 at 13:20, Sebastian Piu <sebastian....@gmail.com> wrote: >> >> >> >> What you are explaining is right for yarn-client mode, but the >> question is >> >> about yarn-cluster in which case the spark driver is also submitted >> and run >> >> in one of the node managers >> >> >> >> >> >> On Tue, 7 Jun 2016, 13:45 Mich Talebzadeh, <mich.talebza...@gmail.com> >> >> wrote: >> >>> >> >>> can you elaborate on the above statement please. >> >>> >> >>> When you start yarn you start the resource manager daemon only on the >> >>> resource manager node >> >>> >> >>> yarn-daemon.sh start resourcemanager >> >>> >> >>> Then you start nodemanager deamons on all nodes >> >>> >> >>> yarn-daemon.sh start nodemanager >> >>> >> >>> A spark app has to start somewhere. That is SparkSubmit. and that is >> >>> deterministic. I start SparkSubmit that talks to Yarn Resource >> Manager that >> >>> initialises and registers an Application master. The crucial point is >> Yarn >> >>> Resource manager which is basically a resource scheduler. It >> optimizes for >> >>> cluster resource utilization to keep all resources in use all the >> time. >> >>> However, resource manager itself is on the resource manager node. >> >>> >> >>> Now I always start my Spark app on the same node as the resource >> manager >> >>> node and let Yarn take care of the rest. >> >>> >> >>> Thanks >> >>> >> >>> Dr Mich Talebzadeh >> >>> >> >>> >> >>> >> >>> LinkedIn >> >>> >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >>> >> >>> >> >>> >> >>> http://talebzadehmich.wordpress.com >> >>> >> >>> >> >>> >> >>> >> >>> On 7 June 2016 at 12:17, Jacek Laskowski <ja...@japila.pl> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> It's not possible. YARN uses CPU and memory for resource constraints >> and >> >>>> places AM on any node available. Same about executors (unless data >> locality >> >>>> constraints the placement). >> >>>> >> >>>> Jacek >> >>>> >> >>>> On 6 Jun 2016 1:54 a.m., "Saiph Kappa" <saiph.ka...@gmail.com> >> wrote: >> >>>>> >> >>>>> Hi, >> >>>>> >> >>>>> In yarn-cluster mode, is there any way to specify on which node I >> want >> >>>>> the driver to run? >> >>>>> >> >>>>> Thanks. >> >>> >> >>> >> > >> >