Hi everyone, I was examining the Pyspark implementation of groupByKey in rdd.py. I would like to submit a patch improving Scala RDD¹s groupByKey that has a similar robustness against large groups, as Pyspark¹s implementation has logic to spill part of a single group to disk along the way.
Its implementation appears to do the following: 1. Combine and group-by-key per partition locally, potentially spilling individual groups to disk 2. Shuffle the data explicitly using partitionBy 3. After the shuffle, do another local groupByKey to get the final result, again potentially spilling individual groups to disk My question is: what does the explicit map-side-combine step (#1) specifically benefit here? I was under the impression that map-side-combine for groupByKey was not optimal and is turned off in the Scala implementation Scala PairRDDFunctions.groupByKey calls to combineByKey with map-side-combine set to false. Is it something specific to how Pyspark can potentially spill the individual groups to disk? Thanks, -Matt Cheah P.S. Relevant Links: https://issues.apache.org/jira/browse/SPARK-3074 https://github.com/apache/spark/pull/1977
smime.p7s
Description: S/MIME cryptographic signature