You don't need to repartition your data just for join purpose. But if the 
either parties of join is already partitioned, Spark will use this advantage as 
part of join optimization.

Should you reduceByKey before the join really depend on your join logic. 
ReduceByKey will shuffle, and following join COULD cause another shuffle. So I 
am not sure if it is a smart way.

Yong

________________________________
From: shyla deshpande <deshpandesh...@gmail.com>
Sent: Wednesday, March 29, 2017 12:33 PM
To: user
Subject: Re: Spark SQL, dataframe join questions.



On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande 
<deshpandesh...@gmail.com<mailto:deshpandesh...@gmail.com>> wrote:

Following are my questions. Thank you.

1. When joining dataframes is it a good idea to repartition on the key column 
that is used in the join or
the optimizer is too smart so forget it.

2. In RDD join, wherever possible we do reduceByKey before the join to avoid a 
big shuffle of data. Do we need
to do anything similar with dataframe joins, or the optimizer is too smart so 
forget it.

Reply via email to