Hi, On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! <zlgonza...@yahoo.com> wrote: > > I want to create a synchronous REST API that will process some data that > is passed in as some request. > I would imagine that the Spark Streaming Job on YARN is a long > running job that waits on requests from something. What that something is > is still not clear to me, but I would imagine that it’s some queue. > The goal is to be able to push a message onto a queue with some id, and > then get the processed results back from Spark Streaming. >
That is not exactly a Spark Streaming use case, I think. Spark Streaming pulls data from some source (like a queue), then processes all data collected in a certain interval in a mini-batch, and stores that data somewhere. It is not well suited for handling request-response cycles in a synchronous way; you might consider using plain Spark (without Streaming) for that. For example, you could use the unfiltered http://unfiltered.databinder.net/Unfiltered.html library and within request handling do some RDD operation, returning the output as HTTP response. This works fine as multiple threads can submit Spark jobs concurrently https://spark.apache.org/docs/latest/job-scheduling.html You could also check https://github.com/adobe-research/spindle -- that seems to be similar to what you are doing. The goal is for the REST API be able to respond to lots of calls with low > latency. > Hope that clarifies things... > Note that "low latency" for "lots of calls" is maybe not something that Spark was built for. Even if you do close to nothing data processing, you may not get below 200ms or so due to the overhead of submitting jobs etc., from my experience. Tobias