I was running a proof of concept for my company with spark streaming, and the conclusion I came to is that spark collects data for the batch-duration, THEN starts the data-pipeline calculations.
My batch size was 5 minutes, and the CPU was all but dead for 5, then when the 5 minutes were up the CPU's would spike for a while presumably doing the calculations. Is this presumption true, or is it running the data through the calculation pipeline before the batch is up? What could lead to the periodic CPU spike - I had a reduceByKey, so was it doing that only after all the batch data was in? Thanks