db tsai, if in yarn-cluster mode the driver runs inside yarn, how can you do a rdd.collect and bring the results back to your application?
On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai <dbt...@stanford.edu> wrote: > We are submitting the spark job in our tomcat application using > yarn-cluster mode with great success. As Kevin said, yarn-client mode > runs driver in your local JVM, and it will have really bad network > overhead when one do reduce action which will pull all the result from > executor to your local JVM. Also, since you can only have one spark > context object in one JVM, it will be tricky to run multiple spark > jobs concurrently with yarn-clinet mode. > > This is how we submit spark job with yarn-cluster mode. Please use the > current master code, otherwise, after the job is finished, spark will > kill the JVM and exit your app. > > We setup the configuration of spark in a scala map. > > def getArgsFromConf(conf: Map[String, String]): Array[String] = { > Array[String]( > "--jar", conf.get("app.jar").getOrElse(""), > "--addJars", conf.get("spark.addJars").getOrElse(""), > "--class", conf.get("spark.mainClass").getOrElse(""), > "--num-executors", conf.get("spark.numWorkers").getOrElse("1"), > "--driver-memory", conf.get("spark.masterMemory").getOrElse("1g"), > "--executor-memory", conf.get("spark.workerMemory").getOrElse("1g"), > "--executor-cores", conf.get("spark.workerCores").getOrElse("1")) > } > > System.setProperty("SPARK_YARN_MODE", "true") > val sparkConf = new SparkConf > val args = getArgsFromConf(conf) > new Client(new ClientArguments(args, sparkConf), hadoopConfig, > sparkConf).run > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Thu, Jun 19, 2014 at 11:22 AM, Kevin Markey <kevin.mar...@oracle.com> > wrote: > > Yarn client is much like Spark client mode, except that the executors are > > running in Yarn containers managed by the Yarn resource manager on the > > cluster instead of as Spark workers managed by the Spark master. The > driver > > executes as a local client in your local JVM. It communicates with the > > workers on the cluster. Transformations are scheduled on the cluster by > the > > driver's logic. Actions involve communication between local driver and > > remote cluster executors. So, there is some additional network overhead, > > especially if the driver is not co-located on the cluster. In > yarn-cluster > > mode -- in contrast, the driver is executed as a thread in a Yarn > > application master on the cluster. > > > > In either case, the assembly JAR must be available to the application on > the > > cluster. Best to copy it to HDFS and specify its location by exporting > its > > location as SPARK_JAR. > > > > Kevin Markey > > > > > > On 06/19/2014 11:22 AM, Koert Kuipers wrote: > > > > i am trying to understand how yarn-client mode works. i am not using > > spark-submit, but instead launching a spark job from within my own > > application. > > > > i can see my application contacting yarn successfully, but then in yarn i > > get an immediate error: > > > > Application application_1403117970283_0014 failed 2 times due to AM > > Container for appattempt_1403117970283_0014_000002 exited with exitCode: > > -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does > not > > exist > > .Failing this attempt.. Failing the application. > > > > why is yarn trying to fetch my jar, and why as a local file? i would > expect > > the jar to be send to yarn over the wire upon job submission? > > > > >