I was running a proof of concept for my company with spark streaming, and
the conclusion I came to is that spark collects data for the
batch-duration, THEN starts the data-pipeline calculations.
My batch size was 5 minutes, and the CPU was all but dead for 5, then when
the 5 minutes were up the
1) Your have a receiver thread. That thread might use alot of CPU, or not,
depending on how you implement the thread in onStart.
2) Every 5 minutes, spark will submit a job which process
every RDD which was created (i.e using the store() call) in the
receiver . That job will run asynchronously
Yes. Data is collected for 5 minutes, then processing starts at the
end. The result may be an arbitrary function of the data in the
interval, so the interval has to finish before computation can start.
If you want more continuous processing, you can simply reduce the
batch interval to, say, 1
On Thu, Nov 13, 2014 at 11:02 AM, Sean Owen so...@cloudera.com wrote:
Yes. Data is collected for 5 minutes, then processing starts at the
end. The result may be an arbitrary function of the data in the
interval, so the interval has to finish before computation can start.
Thanks everyone.