Re: [SQL] Using HashPartitioner to distribute by column

Michael Davies Wed, 21 Jan 2015 08:44:53 -0800

Hi Cheng, 

Are you saying that by setting up the lineage 
schemaRdd.keyBy(_.getString(1)).partitionBy(new 
HashPartitioner(n)).values.applySchema(schema)
then Spark SQL will know that an SQL “group by” on Customer Code will not have 
to shuffle?


But the prepared will have already shuffled so we pay an upfront cost for 
future groupings (assuming we cache I suppose) 

Mick

> On 20 Jan 2015, at 20:44, Cheng Lian <lian.cs....@gmail.com> wrote:
> 
> First of all, even if the underlying dataset is partitioned as expected, a 
> shuffle can’t be avoided. Because Spark SQL knows nothing about the 
> underlying data distribution. However, this does reduce network IO.
> 
> You can prepare your data like this (say CustomerCode is a string field with 
> ordinal 1):
> 
> val schemaRdd = sql(...)
> val schema = schemaRdd.schema
> val prepared = schemaRdd.keyBy(_.getString(1)).partitionBy(new 
> HashPartitioner(n)).values.applySchema(schema)
> n should be equal to spark.sql.shuffle.partitions.
> 
> Cheng
> 
> On 1/19/15 7:44 AM, Mick Davies wrote:
> 
> 
> 
>> Is it possible to use a HashPartioner or something similar to distribute a
>> SchemaRDDs data by the hash of a particular column or set of columns.
>> 
>> Having done this I would then hope that GROUP BY could avoid shuffle
>> 
>> E.g. set up a HashPartioner on CustomerCode field so that 
>> 
>> SELECT CustomerCode, SUM(Cost)
>> FROM Orders
>> GROUP BY CustomerCode
>> 
>> would not need to shuffle.
>> 
>> Cheers 
>> Mick
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 
>

Re: [SQL] Using HashPartitioner to distribute by column

Reply via email to