Re: Spark streaming for synchronous API

Ron's Yahoo! Mon, 08 Sep 2014 22:04:19 -0700

Hi Tobias,
  So I guess where I was coming from was the assumption that starting up a new 
job to be listening on a particular queue topic could be done asynchronously.
  For example, let’s say there’s a particular topic T1 in a Kafka queue. If I 
have a new set of requests coming from a particular client A, I was wondering 
if I could create a partition A.
  The streaming job is submitted to listen to T1.A and will write to a topic 
T2.A, which the REST endpoint would be listening on.
  It does seem a little contrived but the ultimate goal here is to get a bunch 
of messages from a queue, distribute to a bunch of Spark jobs that process and 
write back to another queue, which the REST endpoint synchronously waits on. 
Storm might be a better fit, but the background behind this question is that I 
want to reuse the same set of transformations for both batch and streaming, 
with the streaming use case represented by a REST call.
  In other words, the job submission would not be part of the equation so I 
would imagine the latency is limited to the processing, write back and 
consumption of the processed message by the original REST request.
  Let me know what you think…


Thanks,
Ron

On Sep 8, 2014, at 9:28 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:

> Hi,
> 
> On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! <zlgonza...@yahoo.com> wrote:
>  I want to create a synchronous REST API that will process some data that is 
> passed in as some request.
>  I would imagine that the Spark Streaming Job on YARN is a long running job 
> that waits on requests from something. What that something is is still not 
> clear to me, but I would imagine that it’s some queue. The goal is to be able 
> to push a message onto a queue with some id, and then  get the processed 
> results back from Spark Streaming.
> 
> That is not exactly a Spark Streaming use case, I think. Spark Streaming 
> pulls data from some source (like a queue), then processes all data collected 
> in a certain interval in a mini-batch, and stores that data somewhere. It is 
> not well suited for handling request-response cycles in a synchronous way; 
> you might consider using plain Spark (without Streaming) for that.
> 
> For example, you could use the unfiltered 
> http://unfiltered.databinder.net/Unfiltered.html library and within request 
> handling do some RDD operation, returning the output as HTTP response. This 
> works fine as multiple threads can submit Spark jobs concurrently 
> https://spark.apache.org/docs/latest/job-scheduling.html You could also check 
> https://github.com/adobe-research/spindle -- that seems to be similar to what 
> you are doing.
> 
>  The goal is for the REST API be able to respond to lots of calls with low 
> latency.
>  Hope that clarifies things...
> 
> Note that "low latency" for "lots of calls" is maybe not something that Spark 
> was built for. Even if you do close to nothing data processing, you may not 
> get below 200ms or so due to the overhead of submitting jobs etc., from my 
> experience.
> 
> Tobias
> 
>

Re: Spark streaming for synchronous API

Reply via email to