You don't need to repartition your data just for join purpose. But if the either parties of join is already partitioned, Spark will use this advantage as part of join optimization.
Should you reduceByKey before the join really depend on your join logic. ReduceByKey will shuffle, and following join COULD cause another shuffle. So I am not sure if it is a smart way. Yong ________________________________ From: shyla deshpande <deshpandesh...@gmail.com> Sent: Wednesday, March 29, 2017 12:33 PM To: user Subject: Re: Spark SQL, dataframe join questions. On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande <deshpandesh...@gmail.com<mailto:deshpandesh...@gmail.com>> wrote: Following are my questions. Thank you. 1. When joining dataframes is it a good idea to repartition on the key column that is used in the join or the optimizer is too smart so forget it. 2. In RDD join, wherever possible we do reduceByKey before the join to avoid a big shuffle of data. Do we need to do anything similar with dataframe joins, or the optimizer is too smart so forget it.