I think we should start without map-side-combine for Scala, because it's easy to OOM in JVM than in Python (we don't have hard limit in Python yet).
On Wed, Jul 15, 2015 at 9:52 AM, Matt Cheah <mch...@palantir.com> wrote: > Should we actually enable map-side-combine for groupByKey in Scala RDD as > well, then? If we implement external-group-by should we implement it with > the map-side-combine semantics that Pyspark does? > -Matt Cheah > > On 7/15/15, 8:21 AM, "Davies Liu" <dav...@databricks.com> wrote: > >>If the map-side-combine is not that necessary, given the fact that it >>cannot >>reduce the size of data for shuffling much (do need to serialized the key >>for >>each value), but can reduce the number of key-value pairs, and potential >>reduce >>the number of operations later (repartition and groupby). >> >>On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah <mch...@palantir.com> wrote: >>> Hi everyone, >>> >>> I was examining the Pyspark implementation of groupByKey in rdd.py. I >>>would >>> like to submit a patch improving Scala RDD¹s groupByKey that has a >>>similar >>> robustness against large groups, as Pyspark¹s implementation has logic >>>to >>> spill part of a single group to disk along the way. >>> >>> Its implementation appears to do the following: >>> >>> Combine and group-by-key per partition locally, potentially spilling >>> individual groups to disk >>> Shuffle the data explicitly using partitionBy >>> After the shuffle, do another local groupByKey to get the final result, >>> again potentially spilling individual groups to disk >>> >>> My question is: what does the explicit map-side-combine step (#1) >>> specifically benefit here? I was under the impression that >>>map-side-combine >>> for groupByKey was not optimal and is turned off in the Scala >>>implementation >>> Scala PairRDDFunctions.groupByKey calls to combineByKey with >>> map-side-combine set to false. Is it something specific to how Pyspark >>>can >>> potentially spill the individual groups to disk? >>> >>> Thanks, >>> >>> -Matt Cheah >>> >>> P.S. Relevant Links: >>> >>> >>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji >>>ra_browse_SPARK-2D3074&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO >>>nmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZI >>>fTWTdAjACOGi3ozEffRaiBo&s=Onqi4oR_J4X2tV5u5NLiSnGdt31rRhHtD8R4KjBCQ9g&e= >>> >>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp >>>ark_pull_1977&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hz >>>wIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZIfTWTdAjAC >>>OGi3ozEffRaiBo&s=weq4Epxezp-hx8AdFlbd4dWSqllNppF5HNhJC1KhTCI&e= >>> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org