If you do a join with at least one equality relationship between the two tables, Spark SQL will automatically hash partition the data and perform the join.
If you are looking to prepartition the data, that information is not yet propagated from the in memory cached representation so won't help avoid an extra shuffle, but Kai (cc-ed) was hoping to add that feature. On Mon, May 4, 2015 at 9:05 PM, Mani <man...@vt.edu> wrote: > I am trying to distribute a table using a particular column which is the > key that I’ll be using to perform join operations on the table. Is it > possible to do this with Spark SQL? > I checked the method partitionBy() for rdds. But not sure how to specify > which column is the key? Can anyone give an example? > > Thanks > Mani > Graduate Student, Department of Computer Science > Virginia Tech > > > > > > >