1) Your have a receiver thread. That thread might use alot of CPU, or not,
depending on how  you implement the thread in onStart.

2) Every 5 minutes, spark will submit a job which process
every RDD which was created (i.e using the store() call) in the
receiver .  That job will run asynchronously to the receiver, which
is still working to produce new RDDs for the next batch,


So, maybe you're monitoring the CPU only on the
spark workers which is running the batch jobs, and not
on the spark worker which is doing the RDD ingestion?





On Thu, Nov 13, 2014 at 10:35 AM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> I was running a proof of concept for my company with spark streaming, and
> the conclusion I came to is that spark collects data for the
> batch-duration, THEN starts the data-pipeline calculations.
>
> My batch size was 5 minutes, and the CPU was all but dead for 5, then when
> the 5 minutes were up the CPU's would spike for a while presumably doing
> the calculations.
>
> Is this presumption true, or is it running the data through the
> calculation pipeline before the batch is up?
>
> What could lead to the periodic CPU spike - I had a reduceByKey, so was it
> doing that only after all the batch data was in?
>
> Thanks
>



-- 
jay vyas

Reply via email to