Re: Multi-tenancy for Spark (Streaming) Applications
Hi, by now I understood maybe a bit better how spark-submit and YARN play together and how Spark driver and slaves play together on YARN. Now for my usecase, as described on https://spark.apache.org/docs/latest/submitting-applications.html, I would probably have a end-user-facing gateway that submits my Spark (Streaming) application to the YARN cluster in yarn-cluster mode. I have a couple of questions regarding that setup: * That gateway does not need to be written in Scala or Java, it actually has no contact with the Spark libraries; it is just executing a program on the command line (./spark-submit ...), right? * Since my application is a streaming application, it won't finish by itself. What is the best way to terminate the application on the cluster from my gateway program? Can I just send SIGTERM to the spark-submit program? Is it recommended? * I guess there are many possibilities to achieve that, but what is a good way to send commands/instructions to the running Spark application? If I want to push some commands from the gateway to the Spark driver, I guess I need to get its IP address - how? If I want the Spark driver to pull its instructions, what is a good way to do so? Any suggestions? Thanks, Tobias
Re: Multi-tenancy for Spark (Streaming) Applications
Hi, On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das tathagata.das1...@gmail.com wrote: In the current state of Spark Streaming, creating separate Java processes each having a streaming context is probably the best approach to dynamically adding and removing of input sources. All of these should be able to to use a YARN cluster for resource allocation. So, for example, I would write a server application that accepts a command like createNewInstance and then calls spark-submit, pushing my actual application to the YARN cluster? Or could I use spark-jobserver? Thanks Tobias
Re: Multi-tenancy for Spark (Streaming) Applications
In the current state of Spark Streaming, creating separate Java processes each having a streaming context is probably the best approach to dynamically adding and removing of input sources. All of these should be able to to use a YARN cluster for resource allocation. On Wed, Sep 3, 2014 at 6:30 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I am not sure if multi-tenancy is the right word, but I am thinking about a Spark application where multiple users can, say, log into some web interface and specify a data processing pipeline with streaming source, processing steps, and output. Now as far as I know, there can be only one StreamingContext per JVM and also I cannot add sources or processing steps once it has been started. Are there any ideas/suggestinos for how to achieve a dynamic adding and removing of input sources and processing pipelines? Do I need a separate 'java' process per user? Also, can I realize such a thing when using YARN for dynamic allocation? Thanks Tobias