Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
If the map-side-combine is not that necessary, given the fact that it cannot
reduce the size of data for shuffling much (do need to serialized the key for
each value), but can reduce the number of key-value pairs, and potential reduce
the number of operations later (repartition and groupby).

On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah mch...@palantir.com wrote:
 Hi everyone,

 I was examining the Pyspark implementation of groupByKey in rdd.py. I would
 like to submit a patch improving Scala RDD’s groupByKey that has a similar
 robustness against large groups, as Pyspark’s implementation has logic to
 spill part of a single group to disk along the way.

 Its implementation appears to do the following:

 Combine and group-by-key per partition locally, potentially spilling
 individual groups to disk
 Shuffle the data explicitly using partitionBy
 After the shuffle, do another local groupByKey to get the final result,
 again potentially spilling individual groups to disk

 My question is: what does the explicit map-side-combine step (#1)
 specifically benefit here? I was under the impression that map-side-combine
 for groupByKey was not optimal and is turned off in the Scala implementation
 – Scala PairRDDFunctions.groupByKey calls to combineByKey with
 map-side-combine set to false. Is it something specific to how Pyspark can
 potentially spill the individual groups to disk?

 Thanks,

 -Matt Cheah

 P.S. Relevant Links:

 https://issues.apache.org/jira/browse/SPARK-3074
 https://github.com/apache/spark/pull/1977


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



PySpark GroupByKey implementation question

2015-07-14 Thread Matt Cheah
Hi everyone,

I was examining the Pyspark implementation of groupByKey in rdd.py. I would
like to submit a patch improving Scala RDD¹s groupByKey that has a similar
robustness against large groups, as Pyspark¹s implementation has logic to
spill part of a single group to disk along the way.

Its implementation appears to do the following:
1. Combine and group-by-key per partition locally, potentially spilling
individual groups to disk
2. Shuffle the data explicitly using partitionBy
3. After the shuffle, do another local groupByKey to get the final result,
again potentially spilling individual groups to disk
My question is: what does the explicit map-side-combine step (#1)
specifically benefit here? I was under the impression that map-side-combine
for groupByKey was not optimal and is turned off in the Scala implementation
­ Scala PairRDDFunctions.groupByKey calls to combineByKey with
map-side-combine set to false. Is it something specific to how Pyspark can
potentially spill the individual groups to disk?

Thanks,

-Matt Cheah

P.S. Relevant Links:

https://issues.apache.org/jira/browse/SPARK-3074
https://github.com/apache/spark/pull/1977





smime.p7s
Description: S/MIME cryptographic signature