I was running a proof of concept for my company with spark streaming, and
the conclusion I came to is that spark collects data for the
batch-duration, THEN starts the data-pipeline calculations.

My batch size was 5 minutes, and the CPU was all but dead for 5, then when
the 5 minutes were up the CPU's would spike for a while presumably doing
the calculations.

Is this presumption true, or is it running the data through the calculation
pipeline before the batch is up?

What could lead to the periodic CPU spike - I had a reduceByKey, so was it
doing that only after all the batch data was in?

Thanks

Reply via email to