Re: Spark Streaming - Duration 1s not matching reality

Tathagata Das Thu, 05 Mar 2015 15:43:52 -0800

Hint: Print() just gives a sample of what is in the data, and does not
enforce the processing on all the data (only the first partition of the rdd
is computed to get 10 items). Count() actually processes all the data. This
is all due to lazy eval, if you don't need to use all the data, don't
compute all the data :)


HTH

TD

On Thu, Mar 5, 2015 at 3:10 PM, eleroy <ele...@msn.com> wrote:

> Hello,
>
> Getting started with Spark.
> Got JavaNetworkWordcount working on a 3 node cluster. netcat on 9999 with a
> infinite loop printing random numbers 0-100
>
> With a duration of 1sec, I do see a list of (word, count) values every
> second. The list is limited to 10 values (as per the docs)
>
> The count is ~6000 counts per number. I assume that since my input is
> random
> numbers from 0 to 100, and i count 6000 for each, the distribution being
> homogeneous, that would mean 600,000 values are being ingested.
> I switch to using a constant number, and then I'm seeing between 200,000
> and
> 2,000,000 counts, but the console response is erratic: it's not 1sec
> anymore, it's sometimes 2, sometimes more, and sometimes much faster...
>
> I am looking to do 1-to-1 processing (one value outputs one result) so I
> replaced the flatMap function with a map function, and do my calculation.
>
> Now I'd like to know how many events I was able to process but it's not
> clear at all:
> If I use print, it's fast again (1sec) but I only see the first 10 results.
> I was trying to add a counter... and realize the counter only seem to
> increment by only 11 each time
>
> This is very confusing... It looks like the counter is only incremented on
> the elements affected by the print statement... so does that mean the other
> values are not even calculated until requested?
>
> If i use .count() on the output RDD, then I do see a realistic count, but
> then it doesn't take 1sec anymore: it's more 4 to 5sec to get 600,000 -
> 1,000,000 events counted.
>
> I'm not sure where to get from here or how to benchmark the time to
> actually
> process the events....
>
> Any hint or useful link would be appreciated.
> Thanks for your help.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Duration-1s-not-matching-reality-tp21938.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark Streaming - Duration 1s not matching reality

Reply via email to