subject:"Confused by groupByKey\(\) and the default partitioner"

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Guanhua Yan

Date: Sat, 12 Jul 2014 16:32:22 -0700 To: user@spark.apache.org Subject: Re: Confused by groupByKey() and the default partitioner Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were to use groupByKey() when the data was already

Re: Confused by groupByKey() and the default partitioner

2014-07-13 Thread Aaron Davidson

...@gmail.com Reply-To: user@spark.apache.org Date: Sat, 12 Jul 2014 16:32:22 -0700 To: user@spark.apache.org Subject: Re: Confused by groupByKey() and the default partitioner Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were

Confused by groupByKey() and the default partitioner

2014-07-12 Thread Guanhua Yan

Hi: I have trouble understanding the default partitioner (hash) in Spark. Suppose that an RDD with two partitions is created as follows: x = sc.parallelize([(a, 1), (b, 4), (a, 10), (c, 7)], 2) Does spark partition x based on the hash of the key (e.g., a, b, c) by default? (1) Assuming this is

Re: Confused by groupByKey() and the default partitioner

2014-07-12 Thread Aaron Davidson

Yes, groupByKey() does partition by the hash of the key unless you specify a custom Partitioner. (1) If you were to use groupByKey() when the data was already partitioned correctly, the data would indeed not be shuffled. Here is the associated code, you'll see that it simply checks that the