What is best way to run spark job in "yarn-cluster" mode from java program(servlet container) and NOT using spark-submit command.

kshekhram Wed, 25 Feb 2015 23:45:13 -0800

Hello Spark experts
      I have tried reading spark documentation and searched many posts in
this forum but I couldn't find satisfactory answer to my question. I have
recently started using spark, so I may be missing something and that's why I
am looking for your guidance here.


I have a situation where I am running web application in Jetty using Spring
boot.My web application receives a REST web service request based on that It
needs to trigger spark calculation job in Yarn cluster. Since my job can
take longer to run and can access data from HDFS, so I want to run the spark
job in yarn-cluster mode and I don't want to keep spark context alive in my
web layer. One other reason for this is my application is multi tenant so
each tenant can run it's own job, so in yarn-cluster mode each tenant's job
can start it's own driver and run in it's own spark cluster. In web app JVM,
I assume I can't run multiple spark context in one JVM.

I want to trigger spark jobs in yarn-cluster mode grammatically, from java
program in the my web application. what is the best way to achieve this. I
am exploring various options and looking your guidance on which one is best

1. I can use *org.apache.spark.deploy.yarn.Client* class
/submitApplication()/ method. But I assume this class is not a public API
and can change between various spark releases.Also I noticed that this class
is made private for spark package in spark 1.2. In version 1.1, it was
public. So I have risk of breaking my code when I do spark upgrade if I use
this method.

2. I can use *spark-submit* command line shell to submit my jobs. But to
trigger it from my web application I need to use either Java ProcessBuilder
api or some package built on java ProcessBuilder. This has 2 issues. First
it doesn't sound like a clean way of doing it. I should have a programatic
way of triggering my spark applications in YARN. If YARN api allows it then
why we don't have this in Spark? Second problem will be I will loose the
capability of monitoring the submitted application and getting it's status..
Only crude way of doing it is reading the output stream of spark-submit
shell, which again doesn't sound like good approach.

Please suggest, what is best way of doing this with latest version of
spark(1.2.1). Later I have plans to deploy this entire application in amazon
EMR. So approach should work there also.

Thanks in advance 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-best-way-to-run-spark-job-in-yarn-cluster-mode-from-java-program-servlet-container-and-NOT-u-tp21817.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

What is best way to run spark job in "yarn-cluster" mode from java program(servlet container) and NOT using spark-submit command.

Reply via email to