Re: Spark streaming for synchronous API
Hi again, On Tue, Sep 9, 2014 at 2:20 PM, Tobias Pfeiffer wrote: > > On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! wrote: >> >> For example, let’s say there’s a particular topic T1 in a Kafka queue. >> If I have a new set of requests coming from a particular client A, I was >> wondering if I could create a partition A. >> The streaming job is submitted to listen to T1.A and will write to a >> topic T2.A, which the REST endpoint would be listening on. >> > > That doesn't seem like a good way to use Kafka. It may be possible, but I > am pretty sure you should create a new topic T_A instead of a partition A > in an existing topic. With some modifications of Spark Streaming's > KafkaReceiver you *might* be able to get it to work as you imagine, but it > was not meant to be that way, I think. > Maybe I was wrong about a new topic being the better way. Looking, for example, at the way that Samza consumes Kafka streams < http://samza.incubator.apache.org/learn/documentation/latest/introduction/concepts.html>, it seems like there is one task per partition and data can go into partitions keyed by user ID. So maybe a new partition is actually the conceptually better way. Nonetheless, the built-in KafkaReceiver doesn't support assignment of partitions to receivers AFAIK ;-) Tobias
Re: Spark streaming for synchronous API
Hi, On Tue, Sep 9, 2014 at 2:02 PM, Ron's Yahoo! wrote: > So I guess where I was coming from was the assumption that starting up a > new job to be listening on a particular queue topic could be done > asynchronously. > No, with the current state of Spark Streaming, all data sources and the processing pipeline must be fixed when you start your StreamingContext. You cannot add new data sources dynamically at the moment, see http://apache-spark-user-list.1001560.n3.nabble.com/Multi-tenancy-for-Spark-Streaming-Applications-td13398.html > For example, let’s say there’s a particular topic T1 in a Kafka queue. > If I have a new set of requests coming from a particular client A, I was > wondering if I could create a partition A. > The streaming job is submitted to listen to T1.A and will write to a > topic T2.A, which the REST endpoint would be listening on. > That doesn't seem like a good way to use Kafka. It may be possible, but I am pretty sure you should create a new topic T_A instead of a partition A in an existing topic. With some modifications of Spark Streaming's KafkaReceiver you *might* be able to get it to work as you imagine, but it was not meant to be that way, I think. Also, you will not get "low latency", because Spark Streaming processes data in batches of fixed interval length (say, 1 second) and in the worst case your query will wait up to 1 second before processing even starts. If I understand correctly what you are trying to do (which I am not sure about), I would probably recommend to choose a bit of a different architecture; in particular given that you cannot dynamically add data sources. Tobias
Re: Spark streaming for synchronous API
Hi Tobias, So I guess where I was coming from was the assumption that starting up a new job to be listening on a particular queue topic could be done asynchronously. For example, let’s say there’s a particular topic T1 in a Kafka queue. If I have a new set of requests coming from a particular client A, I was wondering if I could create a partition A. The streaming job is submitted to listen to T1.A and will write to a topic T2.A, which the REST endpoint would be listening on. It does seem a little contrived but the ultimate goal here is to get a bunch of messages from a queue, distribute to a bunch of Spark jobs that process and write back to another queue, which the REST endpoint synchronously waits on. Storm might be a better fit, but the background behind this question is that I want to reuse the same set of transformations for both batch and streaming, with the streaming use case represented by a REST call. In other words, the job submission would not be part of the equation so I would imagine the latency is limited to the processing, write back and consumption of the processed message by the original REST request. Let me know what you think… Thanks, Ron On Sep 8, 2014, at 9:28 PM, Tobias Pfeiffer wrote: > Hi, > > On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! wrote: > I want to create a synchronous REST API that will process some data that is > passed in as some request. > I would imagine that the Spark Streaming Job on YARN is a long running job > that waits on requests from something. What that something is is still not > clear to me, but I would imagine that it’s some queue. The goal is to be able > to push a message onto a queue with some id, and then get the processed > results back from Spark Streaming. > > That is not exactly a Spark Streaming use case, I think. Spark Streaming > pulls data from some source (like a queue), then processes all data collected > in a certain interval in a mini-batch, and stores that data somewhere. It is > not well suited for handling request-response cycles in a synchronous way; > you might consider using plain Spark (without Streaming) for that. > > For example, you could use the unfiltered > http://unfiltered.databinder.net/Unfiltered.html library and within request > handling do some RDD operation, returning the output as HTTP response. This > works fine as multiple threads can submit Spark jobs concurrently > https://spark.apache.org/docs/latest/job-scheduling.html You could also check > https://github.com/adobe-research/spindle -- that seems to be similar to what > you are doing. > > The goal is for the REST API be able to respond to lots of calls with low > latency. > Hope that clarifies things... > > Note that "low latency" for "lots of calls" is maybe not something that Spark > was built for. Even if you do close to nothing data processing, you may not > get below 200ms or so due to the overhead of submitting jobs etc., from my > experience. > > Tobias > >
Re: Spark streaming for synchronous API
Hi, On Tue, Sep 9, 2014 at 12:59 PM, Ron's Yahoo! wrote: > > I want to create a synchronous REST API that will process some data that > is passed in as some request. > I would imagine that the Spark Streaming Job on YARN is a long > running job that waits on requests from something. What that something is > is still not clear to me, but I would imagine that it’s some queue. > The goal is to be able to push a message onto a queue with some id, and > then get the processed results back from Spark Streaming. > That is not exactly a Spark Streaming use case, I think. Spark Streaming pulls data from some source (like a queue), then processes all data collected in a certain interval in a mini-batch, and stores that data somewhere. It is not well suited for handling request-response cycles in a synchronous way; you might consider using plain Spark (without Streaming) for that. For example, you could use the unfiltered http://unfiltered.databinder.net/Unfiltered.html library and within request handling do some RDD operation, returning the output as HTTP response. This works fine as multiple threads can submit Spark jobs concurrently https://spark.apache.org/docs/latest/job-scheduling.html You could also check https://github.com/adobe-research/spindle -- that seems to be similar to what you are doing. The goal is for the REST API be able to respond to lots of calls with low > latency. > Hope that clarifies things... > Note that "low latency" for "lots of calls" is maybe not something that Spark was built for. Even if you do close to nothing data processing, you may not get below 200ms or so due to the overhead of submitting jobs etc., from my experience. Tobias
Re: Spark streaming for synchronous API
Tobias, Let me explain a little more. I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not clear to me, but I would imagine that it’s some queue. The goal is to be able to push a message onto a queue with some id, and then get the processed results back from Spark Streaming. The goal is for the REST API be able to respond to lots of calls with low latency. Hope that clarifies things... Thanks, Ron On Sep 8, 2014, at 7:41 PM, Tobias Pfeiffer wrote: > Ron, > > On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! > wrote: > I’m trying to figure out how I can run Spark Streaming like an API. > The goal is to have a synchronous REST API that runs the spark data flow on > YARN. > > I guess I *may* develop something similar in the future. > > By "a synchronous REST API", do you mean that submitting the job is > synchronous and you would fetch the processing results via a different call? > Or do you want to submit a job and get the processed data back as an HTTP > stream? > > To begin with, is it even possible to have Spark Streaming run as a yarn job? > > I think it is very much possible to run Spark Streaming as a YARN job; at > least it worked well with Mesos. > > Tobias >
Re: Spark streaming for synchronous API
Tobias, Let me explain a little more. I want to create a synchronous REST API that will process some data that is passed in as some request. I would imagine that the Spark Streaming Job on YARN is a long running job that waits on requests from something. What that something is is still not clear to me, but I would imagine that it’s some queue. The goal is to be able to push a message onto a queue with some id, and then get the processed results back from Spark Streaming. The goal is for the REST API be able to respond to lots of calls with low latency. Hope that clarifies things... Thanks, Ron On Sep 8, 2014, at 7:41 PM, Tobias Pfeiffer wrote: > Ron, > > On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! > wrote: > I’m trying to figure out how I can run Spark Streaming like an API. > The goal is to have a synchronous REST API that runs the spark data flow on > YARN. > > I guess I *may* develop something similar in the future. > > By "a synchronous REST API", do you mean that submitting the job is > synchronous and you would fetch the processing results via a different call? > Or do you want to submit a job and get the processed data back as an HTTP > stream? > > To begin with, is it even possible to have Spark Streaming run as a yarn job? > > I think it is very much possible to run Spark Streaming as a YARN job; at > least it worked well with Mesos. > > Tobias >
Re: Spark streaming for synchronous API
Ron, On Tue, Sep 9, 2014 at 11:27 AM, Ron's Yahoo! wrote: > > I’m trying to figure out how I can run Spark Streaming like an API. > The goal is to have a synchronous REST API that runs the spark data flow > on YARN. I guess I *may* develop something similar in the future. By "a synchronous REST API", do you mean that submitting the job is synchronous and you would fetch the processing results via a different call? Or do you want to submit a job and get the processed data back as an HTTP stream? To begin with, is it even possible to have Spark Streaming run as a yarn > job? > I think it is very much possible to run Spark Streaming as a YARN job; at least it worked well with Mesos. Tobias
Spark streaming for synchronous API
Hi, I’m trying to figure out how I can run Spark Streaming like an API. The goal is to have a synchronous REST API that runs the spark data flow on YARN. Has anyone done something like this? Can you share your architecture? To begin with, is it even possible to have Spark Streaming run as a yarn job? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org