Yes, groupByKey() does partition by the hash of the key unless you specify
a custom Partitioner.

(1) If you were to use groupByKey() when the data was already partitioned
correctly, the data would indeed not be shuffled. Here is the associated
code, you'll see that it simply checks that the Partitioner the groupBy()
is looking for is equal to the Partitioner of the pre-existing RDD:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L89

By the way, I should warn you that groupByKey() is not a recommended
operation if you can avoid it, as it has non-obvious performance issues
when running with serious data.


On Sat, Jul 12, 2014 at 12:20 PM, Guanhua Yan <gh...@lanl.gov> wrote:

> Hi:
>
> I have trouble understanding the default partitioner (hash) in Spark.
> Suppose that an RDD with two partitions is created as follows:
>
> x = sc.parallelize([("a", 1), ("b", 4), ("a", 10), ("c", 7)], 2)
>
> Does spark partition x based on the hash of the key (e.g., "a", "b", "c") by 
> default?
>
> (1) Assuming this is correct, if I further use the groupByKey primitive, 
> x.groupByKey(), all the records sharing the same key should be located in the 
> same partition. Then it's not necessary to shuffle the data records around, 
> as all the grouping operations can be done locally.
>
> (2) If it's not true, how could I specify a partitioner simply based on the 
> hashing of the key (in Python)?
>
> Thank you,
>
> - Guanhua
>
>

Reply via email to