If the map-side-combine is not that necessary, given the fact that it cannot
reduce the size of data for shuffling much (do need to serialized the key for
each value), but can reduce the number of key-value pairs, and potential reduce
the number of operations later (repartition and groupby).
On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah mch...@palantir.com wrote:
Hi everyone,
I was examining the Pyspark implementation of groupByKey in rdd.py. I would
like to submit a patch improving Scala RDD’s groupByKey that has a similar
robustness against large groups, as Pyspark’s implementation has logic to
spill part of a single group to disk along the way.
Its implementation appears to do the following:
Combine and group-by-key per partition locally, potentially spilling
individual groups to disk
Shuffle the data explicitly using partitionBy
After the shuffle, do another local groupByKey to get the final result,
again potentially spilling individual groups to disk
My question is: what does the explicit map-side-combine step (#1)
specifically benefit here? I was under the impression that map-side-combine
for groupByKey was not optimal and is turned off in the Scala implementation
– Scala PairRDDFunctions.groupByKey calls to combineByKey with
map-side-combine set to false. Is it something specific to how Pyspark can
potentially spill the individual groups to disk?
Thanks,
-Matt Cheah
P.S. Relevant Links:
https://issues.apache.org/jira/browse/SPARK-3074
https://github.com/apache/spark/pull/1977
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org