Help with Spark SQL Hash Distribution

2015-05-04 Thread Mani
I am trying to distribute a table using a particular column which is the key 
that I’ll be using to perform join operations on the table. Is it possible to 
do this with Spark SQL?
I checked the method partitionBy() for rdds. But not sure how to specify which 
column is the key? Can anyone give an example?

Thanks
Mani
Graduate Student, Department of Computer Science
Virginia Tech








Re: Help with Spark SQL Hash Distribution

2015-05-04 Thread Michael Armbrust
If you do a join with at least one equality relationship between the two
tables, Spark SQL will automatically hash partition the data and perform
the join.

If you are looking to prepartition the data, that information is not yet
propagated from the in memory cached representation so won't help avoid an
extra shuffle, but Kai (cc-ed) was hoping to add that feature.

On Mon, May 4, 2015 at 9:05 PM, Mani man...@vt.edu wrote:

 I am trying to distribute a table using a particular column which is the
 key that I’ll be using to perform join operations on the table. Is it
 possible to do this with Spark SQL?
 I checked the method partitionBy() for rdds. But not sure how to specify
 which column is the key? Can anyone give an example?

 Thanks
 Mani
 Graduate Student, Department of Computer Science
 Virginia Tech