Does Spark Streaming calculate during a batch?

2014-11-13 Thread Michael Campbell
I was running a proof of concept for my company with spark streaming, and the conclusion I came to is that spark collects data for the batch-duration, THEN starts the data-pipeline calculations. My batch size was 5 minutes, and the CPU was all but dead for 5, then when the 5 minutes were up the

Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread jay vyas
1) Your have a receiver thread. That thread might use alot of CPU, or not, depending on how you implement the thread in onStart. 2) Every 5 minutes, spark will submit a job which process every RDD which was created (i.e using the store() call) in the receiver . That job will run asynchronously

Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread Sean Owen
Yes. Data is collected for 5 minutes, then processing starts at the end. The result may be an arbitrary function of the data in the interval, so the interval has to finish before computation can start. If you want more continuous processing, you can simply reduce the batch interval to, say, 1

Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread Michael Campbell
On Thu, Nov 13, 2014 at 11:02 AM, Sean Owen so...@cloudera.com wrote: Yes. Data is collected for 5 minutes, then processing starts at the end. The result may be an arbitrary function of the data in the interval, so the interval has to finish before computation can start. Thanks everyone.