Hi everyone,

I was examining the Pyspark implementation of groupByKey in rdd.py. I would
like to submit a patch improving Scala RDD¹s groupByKey that has a similar
robustness against large groups, as Pyspark¹s implementation has logic to
spill part of a single group to disk along the way.

Its implementation appears to do the following:
1. Combine and group-by-key per partition locally, potentially spilling
individual groups to disk
2. Shuffle the data explicitly using partitionBy
3. After the shuffle, do another local groupByKey to get the final result,
again potentially spilling individual groups to disk
My question is: what does the explicit map-side-combine step (#1)
specifically benefit here? I was under the impression that map-side-combine
for groupByKey was not optimal and is turned off in the Scala implementation
­ Scala PairRDDFunctions.groupByKey calls to combineByKey with
map-side-combine set to false. Is it something specific to how Pyspark can
potentially spill the individual groups to disk?

Thanks,

-Matt Cheah

P.S. Relevant Links:

https://issues.apache.org/jira/browse/SPARK-3074
https://github.com/apache/spark/pull/1977



Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to