Re: *ByKey aggregations: performance + order

Sean Owen Wed, 14 Jan 2015 03:38:35 -0800

On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer <t...@preferred.jp> wrote:
>> Now I don't know (yet) if all of the functions I want to compute can be
>> expressed in this way and I was wondering about *how much* more expensive we
>> are talking about.
>
>
> OK, it seems like even on a local machine (with no network overhead), the
> groupByKey version is about 5 times slower than any of the other
> (reduceByKey, combineByKey etc.) functions...


Even without network overhead, you're still paying the cost of setting
up the shuffle and serialization.

Yes the overhead is that groupByKey has to move all the data around.
If an aggregation is really what's needed, then most of the reducing /
combining can happen locally before the result ever goes anywhere else
on the network.

There's also the issue of keys with very large values.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: *ByKey aggregations: performance + order

Reply via email to