Yes. Data is collected for 5 minutes, then processing starts at the
end. The result may be an arbitrary function of the data in the
interval, so the interval has to finish before computation can start.

If you want more continuous processing, you can simply reduce the
batch interval to, say, 1 minute.

On Thu, Nov 13, 2014 at 3:35 PM, Michael Campbell
<michael.campb...@gmail.com> wrote:
> I was running a proof of concept for my company with spark streaming, and
> the conclusion I came to is that spark collects data for the batch-duration,
> THEN starts the data-pipeline calculations.
>
> My batch size was 5 minutes, and the CPU was all but dead for 5, then when
> the 5 minutes were up the CPU's would spike for a while presumably doing the
> calculations.
>
> Is this presumption true, or is it running the data through the calculation
> pipeline before the batch is up?
>
> What could lead to the periodic CPU spike - I had a reduceByKey, so was it
> doing that only after all the batch data was in?
>
> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to