Is it possible to use a HashPartioner or something similar to distribute a SchemaRDDs data by the hash of a particular column or set of columns.
Having done this I would then hope that GROUP BY could avoid shuffle E.g. set up a HashPartioner on CustomerCode field so that SELECT CustomerCode, SUM(Cost) FROM Orders GROUP BY CustomerCode would not need to shuffle. Cheers Mick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Using-HashPartitioner-to-distribute-by-column-tp21237.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org