Re: PySpark GroupByKey implementation question

Davies Liu Wed, 15 Jul 2015 10:02:07 -0700

I think we should start without map-side-combine for Scala, because
it's easy to OOM
in JVM than in Python (we don't have hard limit in Python yet).


On Wed, Jul 15, 2015 at 9:52 AM, Matt Cheah <mch...@palantir.com> wrote:
> Should we actually enable map-side-combine for groupByKey in Scala RDD as
> well, then? If we implement external-group-by should we implement it with
> the map-side-combine semantics that Pyspark does?

> -Matt Cheah
>
> On 7/15/15, 8:21 AM, "Davies Liu" <dav...@databricks.com> wrote:
>
>>If the map-side-combine is not that necessary, given the fact that it
>>cannot
>>reduce the size of data for shuffling much (do need to serialized the key
>>for
>>each value), but can reduce the number of key-value pairs, and potential
>>reduce
>>the number of operations later (repartition and groupby).
>>
>>On Tue, Jul 14, 2015 at 7:11 PM, Matt Cheah <mch...@palantir.com> wrote:
>>> Hi everyone,
>>>
>>> I was examining the Pyspark implementation of groupByKey in rdd.py. I
>>>would
>>> like to submit a patch improving Scala RDD¹s groupByKey that has a
>>>similar
>>> robustness against large groups, as Pyspark¹s implementation has logic
>>>to
>>> spill part of a single group to disk along the way.
>>>
>>> Its implementation appears to do the following:
>>>
>>> Combine and group-by-key per partition locally, potentially spilling
>>> individual groups to disk
>>> Shuffle the data explicitly using partitionBy
>>> After the shuffle, do another local groupByKey to get the final result,
>>> again potentially spilling individual groups to disk
>>>
>>> My question is: what does the explicit map-side-combine step (#1)
>>> specifically benefit here? I was under the impression that
>>>map-side-combine
>>> for groupByKey was not optimal and is turned off in the Scala
>>>implementation
>>>  Scala PairRDDFunctions.groupByKey calls to combineByKey with
>>> map-side-combine set to false. Is it something specific to how Pyspark
>>>can
>>> potentially spill the individual groups to disk?
>>>
>>> Thanks,
>>>
>>> -Matt Cheah
>>>
>>> P.S. Relevant Links:
>>>
>>>
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_ji
>>>ra_browse_SPARK-2D3074&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oO
>>>nmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZI
>>>fTWTdAjACOGi3ozEffRaiBo&s=Onqi4oR_J4X2tV5u5NLiSnGdt31rRhHtD8R4KjBCQ9g&e=
>>>
>>>https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_sp
>>>ark_pull_1977&d=BQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hz
>>>wIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=J7Ee0rvGlgwL83LVX3ZIfTWTdAjAC
>>>OGi3ozEffRaiBo&s=weq4Epxezp-hx8AdFlbd4dWSqllNppF5HNhJC1KhTCI&e=
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: PySpark GroupByKey implementation question

Reply via email to