Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Ron,

On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid
 wrote:

   I’m trying to figure out how I can run Spark Streaming like an API.
   The goal is to have a synchronous REST API that runs the spark data flow
 on YARN.


I guess I *may* develop something similar in the future.

By a synchronous REST API, do you mean that submitting the job is
synchronous and you would fetch the processing results via a different
call? Or do you want to submit a job and get the processed data back as an
HTTP stream?

To begin with, is it even possible to have Spark Streaming run as a yarn
 job?


I think it is very much possible to run Spark Streaming as a YARN job; at
least it worked well with Mesos.

Tobias


Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Tobias,
  Let me explain a little more.
  I want to create a synchronous REST API that will process some data that is 
passed in as some request.
  I would imagine that the Spark Streaming Job on YARN is a long running job 
that waits on requests from something. What that something is is still not 
clear to me, but I would imagine that it’s some queue. The goal is to be able 
to push a message onto a queue with some id, and then get the processed results 
back from Spark Streaming.
  The goal is for the REST API be able to respond to lots of calls with low 
latency.
  Hope that clarifies things...

Thanks,
Ron


On Sep 8, 2014, at 7:41 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Ron,
 
 On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid 
 wrote:
   I’m trying to figure out how I can run Spark Streaming like an API.
   The goal is to have a synchronous REST API that runs the spark data flow on 
 YARN.
 
 I guess I *may* develop something similar in the future.
 
 By a synchronous REST API, do you mean that submitting the job is 
 synchronous and you would fetch the processing results via a different call? 
 Or do you want to submit a job and get the processed data back as an HTTP 
 stream?
 
 To begin with, is it even possible to have Spark Streaming run as a yarn job?
 
 I think it is very much possible to run Spark Streaming as a YARN job; at 
 least it worked well with Mesos.
 
 Tobias
 



Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Tobias,
 Let me explain a little more.
 I want to create a synchronous REST API that will process some data that is 
passed in as some request.
 I would imagine that the Spark Streaming Job on YARN is a long running job 
that waits on requests from something. What that something is is still not 
clear to me, but I would imagine that it’s some queue. The goal is to be able 
to push a message onto a queue with some id, and then  get the processed 
results back from Spark Streaming.
 The goal is for the REST API be able to respond to lots of calls with low 
latency.
 Hope that clarifies things...

Thanks,
Ron

On Sep 8, 2014, at 7:41 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Ron,
 
 On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid 
 wrote:
   I’m trying to figure out how I can run Spark Streaming like an API.
   The goal is to have a synchronous REST API that runs the spark data flow on 
 YARN.
 
 I guess I *may* develop something similar in the future.
 
 By a synchronous REST API, do you mean that submitting the job is 
 synchronous and you would fetch the processing results via a different call? 
 Or do you want to submit a job and get the processed data back as an HTTP 
 stream?
 
 To begin with, is it even possible to have Spark Streaming run as a yarn job?
 
 I think it is very much possible to run Spark Streaming as a YARN job; at 
 least it worked well with Mesos.
 
 Tobias
 



Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi,

On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote:

  I want to create a synchronous REST API that will process some data that
 is passed in as some request.
  I would imagine that the Spark Streaming Job on YARN is a long
 running job that waits on requests from something. What that something is
 is still not clear to me, but I would imagine that it’s some queue.
 The goal is to be able to push a message onto a queue with some id, and
 then  get the processed results back from Spark Streaming.


That is not exactly a Spark Streaming use case, I think. Spark Streaming
pulls data from some source (like a queue), then processes all data
collected in a certain interval in a mini-batch, and stores that data
somewhere. It is not well suited for handling request-response cycles in a
synchronous way; you might consider using plain Spark (without Streaming)
for that.

For example, you could use the unfiltered
http://unfiltered.databinder.net/Unfiltered.html library and within request
handling do some RDD operation, returning the output as HTTP response. This
works fine as multiple threads can submit Spark jobs concurrently
https://spark.apache.org/docs/latest/job-scheduling.html You could also
check https://github.com/adobe-research/spindle -- that seems to be similar
to what you are doing.

 The goal is for the REST API be able to respond to lots of calls with low
 latency.
  Hope that clarifies things...


Note that low latency for lots of calls is maybe not something that
Spark was built for. Even if you do close to nothing data processing, you
may not get below 200ms or so due to the overhead of submitting jobs etc.,
from my experience.

Tobias


Re: Spark streaming for synchronous API

2014-09-08 Thread Ron's Yahoo!
Hi Tobias,
  So I guess where I was coming from was the assumption that starting up a new 
job to be listening on a particular queue topic could be done asynchronously.
  For example, let’s say there’s a particular topic T1 in a Kafka queue. If I 
have a new set of requests coming from a particular client A, I was wondering 
if I could create a partition A.
  The streaming job is submitted to listen to T1.A and will write to a topic 
T2.A, which the REST endpoint would be listening on.
  It does seem a little contrived but the ultimate goal here is to get a bunch 
of messages from a queue, distribute to a bunch of Spark jobs that process and 
write back to another queue, which the REST endpoint synchronously waits on. 
Storm might be a better fit, but the background behind this question is that I 
want to reuse the same set of transformations for both batch and streaming, 
with the streaming use case represented by a REST call.
  In other words, the job submission would not be part of the equation so I 
would imagine the latency is limited to the processing, write back and 
consumption of the processed message by the original REST request.
  Let me know what you think…

Thanks,
Ron

On Sep 8, 2014, at 9:28 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,
 
 On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
  I want to create a synchronous REST API that will process some data that is 
 passed in as some request.
  I would imagine that the Spark Streaming Job on YARN is a long running job 
 that waits on requests from something. What that something is is still not 
 clear to me, but I would imagine that it’s some queue. The goal is to be able 
 to push a message onto a queue with some id, and then  get the processed 
 results back from Spark Streaming.
 
 That is not exactly a Spark Streaming use case, I think. Spark Streaming 
 pulls data from some source (like a queue), then processes all data collected 
 in a certain interval in a mini-batch, and stores that data somewhere. It is 
 not well suited for handling request-response cycles in a synchronous way; 
 you might consider using plain Spark (without Streaming) for that.
 
 For example, you could use the unfiltered 
 http://unfiltered.databinder.net/Unfiltered.html library and within request 
 handling do some RDD operation, returning the output as HTTP response. This 
 works fine as multiple threads can submit Spark jobs concurrently 
 https://spark.apache.org/docs/latest/job-scheduling.html You could also check 
 https://github.com/adobe-research/spindle -- that seems to be similar to what 
 you are doing.
 
  The goal is for the REST API be able to respond to lots of calls with low 
 latency.
  Hope that clarifies things...
 
 Note that low latency for lots of calls is maybe not something that Spark 
 was built for. Even if you do close to nothing data processing, you may not 
 get below 200ms or so due to the overhead of submitting jobs etc., from my 
 experience.
 
 Tobias
 
 



Re: Spark streaming for synchronous API

2014-09-08 Thread Tobias Pfeiffer
Hi,

On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! zlgonza...@yahoo.com wrote:

   So I guess where I was coming from was the assumption that starting up a
 new job to be listening on a particular queue topic could be done
 asynchronously.


No, with the current state of Spark Streaming, all data sources and the
processing pipeline must be fixed when you start your StreamingContext. You
cannot add new data sources dynamically at the moment, see
http://apache-spark-user-list.1001560.n3.nabble.com/Multi-tenancy-for-Spark-Streaming-Applications-td13398.html


   For example, let’s say there’s a particular topic T1 in a Kafka queue.
 If I have a new set of requests coming from a particular client A, I was
 wondering if I could create a partition A.
   The streaming job is submitted to listen to T1.A and will write to a
 topic T2.A, which the REST endpoint would be listening on.


That doesn't seem like a good way to use Kafka. It may be possible, but I
am pretty sure you should create a new topic T_A instead of a partition A
in an existing topic. With some modifications of Spark Streaming's
KafkaReceiver you *might* be able to get it to work as you imagine, but it
was not meant to be that way, I think.

Also, you will not get low latency, because Spark Streaming processes
data in batches of fixed interval length (say, 1 second) and in the worst
case your query will wait up to 1 second before processing even starts.

If I understand correctly what you are trying to do (which I am not sure
about), I would probably recommend to choose a bit of a different
architecture; in particular given that you cannot dynamically add data
sources.

Tobias