Date: Sat, 12 Jul 2014 16:32:22 -0700
To: user@spark.apache.org
Subject: Re: Confused by groupByKey() and the default partitioner
Yes, groupByKey() does partition by the hash of the key unless you specify a
custom Partitioner.
(1) If you were to use groupByKey() when the data was already
...@gmail.com
Reply-To: user@spark.apache.org
Date: Sat, 12 Jul 2014 16:32:22 -0700
To: user@spark.apache.org
Subject: Re: Confused by groupByKey() and the default partitioner
Yes, groupByKey() does partition by the hash of the key unless you specify
a custom Partitioner.
(1) If you were
Hi:
I have trouble understanding the default partitioner (hash) in Spark.
Suppose that an RDD with two partitions is created as follows:
x = sc.parallelize([(a, 1), (b, 4), (a, 10), (c, 7)], 2)
Does spark partition x based on the hash of the key (e.g., a, b, c) by
default?
(1) Assuming this is
Yes, groupByKey() does partition by the hash of the key unless you specify
a custom Partitioner.
(1) If you were to use groupByKey() when the data was already partitioned
correctly, the data would indeed not be shuffled. Here is the associated
code, you'll see that it simply checks that the