Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-11 Thread Tobias Pfeiffer
Hi,

by now I understood maybe a bit better how spark-submit and YARN play
together and how Spark driver and slaves play together on YARN.

Now for my usecase, as described on 
https://spark.apache.org/docs/latest/submitting-applications.html, I would
probably have a end-user-facing gateway that submits my Spark (Streaming)
application to the YARN cluster in yarn-cluster mode.

I have a couple of questions regarding that setup:
* That gateway does not need to be written in Scala or Java, it actually
has no contact with the Spark libraries; it is just executing a program on
the command line (./spark-submit ...), right?
* Since my application is a streaming application, it won't finish by
itself. What is the best way to terminate the application on the cluster
from my gateway program? Can I just send SIGTERM to the spark-submit
program? Is it recommended?
* I guess there are many possibilities to achieve that, but what is a good
way to send commands/instructions to the running Spark application? If I
want to push some commands from the gateway to the Spark driver, I guess I
need to get its IP address - how? If I want the Spark driver to pull its
instructions, what is a good way to do so? Any suggestions?

Thanks,
Tobias


Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-08 Thread Tobias Pfeiffer
Hi,

On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 In the current state of Spark Streaming, creating separate Java processes
 each having a streaming context is probably the best approach to
 dynamically adding and removing of input sources. All of these should be
 able to to use a YARN cluster for resource allocation.


So, for example, I would write a server application that accepts a command
like createNewInstance and then calls spark-submit, pushing my actual
application to the YARN cluster? Or could I use spark-jobserver?

Thanks
Tobias


Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-03 Thread Tathagata Das
In the current state of Spark Streaming, creating separate Java processes
each having a streaming context is probably the best approach to
dynamically adding and removing of input sources. All of these should be
able to to use a YARN cluster for resource allocation.


On Wed, Sep 3, 2014 at 6:30 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 I am not sure if multi-tenancy is the right word, but I am thinking
 about a Spark application where multiple users can, say, log into some web
 interface and specify a data processing pipeline with streaming source,
 processing steps, and output.

 Now as far as I know, there can be only one StreamingContext per JVM and
 also I cannot add sources or processing steps once it has been started. Are
 there any ideas/suggestinos for how to achieve a dynamic adding and
 removing of input sources and processing pipelines? Do I need a separate
 'java' process per user?
 Also, can I realize such a thing when using YARN for dynamic allocation?

 Thanks
 Tobias