Hello Spark experts I have tried reading spark documentation and searched many posts in this forum but I couldn't find satisfactory answer to my question. I have recently started using spark, so I may be missing something and that's why I am looking for your guidance here.
I have a situation where I am running web application in Jetty using Spring boot.My web application receives a REST web service request based on that It needs to trigger spark calculation job in Yarn cluster. Since my job can take longer to run and can access data from HDFS, so I want to run the spark job in yarn-cluster mode and I don't want to keep spark context alive in my web layer. One other reason for this is my application is multi tenant so each tenant can run it's own job, so in yarn-cluster mode each tenant's job can start it's own driver and run in it's own spark cluster. In web app JVM, I assume I can't run multiple spark context in one JVM. I want to trigger spark jobs in yarn-cluster mode grammatically, from java program in the my web application. what is the best way to achieve this. I am exploring various options and looking your guidance on which one is best 1. I can use *org.apache.spark.deploy.yarn.Client* class /submitApplication()/ method. But I assume this class is not a public API and can change between various spark releases.Also I noticed that this class is made private for spark package in spark 1.2. In version 1.1, it was public. So I have risk of breaking my code when I do spark upgrade if I use this method. 2. I can use *spark-submit* command line shell to submit my jobs. But to trigger it from my web application I need to use either Java ProcessBuilder api or some package built on java ProcessBuilder. This has 2 issues. First it doesn't sound like a clean way of doing it. I should have a programatic way of triggering my spark applications in YARN. If YARN api allows it then why we don't have this in Spark? Second problem will be I will loose the capability of monitoring the submitted application and getting it's status.. Only crude way of doing it is reading the output stream of spark-submit shell, which again doesn't sound like good approach. Please suggest, what is best way of doing this with latest version of spark(1.2.1). Later I have plans to deploy this entire application in amazon EMR. So approach should work there also. Thanks in advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-best-way-to-run-spark-job-in-yarn-cluster-mode-from-java-program-servlet-container-and-NOT-u-tp21817.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org