If the map-side-combine is not that necessary, given the fact that it cannot
reduce the size of data for shuffling much (do need to serialized the key for
each value), but can reduce the number of key-value pairs, and potential reduce
the number of operations later (repartition and groupby).

On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah <mch...@palantir.com> wrote:
> Hi everyone,
>
> I was examining the Pyspark implementation of groupByKey in rdd.py. I would
> like to submit a patch improving Scala RDD’s groupByKey that has a similar
> robustness against large groups, as Pyspark’s implementation has logic to
> spill part of a single group to disk along the way.
>
> Its implementation appears to do the following:
>
> Combine and group-by-key per partition locally, potentially spilling
> individual groups to disk
> Shuffle the data explicitly using partitionBy
> After the shuffle, do another local groupByKey to get the final result,
> again potentially spilling individual groups to disk
>
> My question is: what does the explicit map-side-combine step (#1)
> specifically benefit here? I was under the impression that map-side-combine
> for groupByKey was not optimal and is turned off in the Scala implementation
> – Scala PairRDDFunctions.groupByKey calls to combineByKey with
> map-side-combine set to false. Is it something specific to how Pyspark can
> potentially spill the individual groups to disk?
>
> Thanks,
>
> -Matt Cheah
>
> P.S. Relevant Links:
>
> https://issues.apache.org/jira/browse/SPARK-3074
> https://github.com/apache/spark/pull/1977
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to