HI , I found following two links are helpful sharing with you .
http://stackoverflow.com/questions/38353524/how-to-ensure-partitioning-induced-by-spark-dataframe-join http://spark.apache.org/docs/latest/configuration.html Regards, Vaquar khan On Wed, Mar 29, 2017 at 2:45 PM, Vidya Sujeet <sjayatheer...@gmail.com> wrote: > In repartition, every element in the partition is moved to a new > partition..doing a full shuffle compared to shuffles done by reduceBy > clauses. With this in mind, repartition would increase your query > performance. ReduceBy key will also shuffle based on the aggregation. > > The best way to design is to check the query plan of your data frame join > query and do RDD joins accordingly, if needed. > > > On Wed, Mar 29, 2017 at 10:55 AM, Yong Zhang <java8...@hotmail.com> wrote: > >> You don't need to repartition your data just for join purpose. But if the >> either parties of join is already partitioned, Spark will use this >> advantage as part of join optimization. >> >> Should you reduceByKey before the join really depend on your join logic. >> ReduceByKey will shuffle, and following join COULD cause another shuffle. >> So I am not sure if it is a smart way. >> >> Yong >> >> ------------------------------ >> *From:* shyla deshpande <deshpandesh...@gmail.com> >> *Sent:* Wednesday, March 29, 2017 12:33 PM >> *To:* user >> *Subject:* Re: Spark SQL, dataframe join questions. >> >> >> >> On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande < >> deshpandesh...@gmail.com> wrote: >> >>> Following are my questions. Thank you. >>> >>> 1. When joining dataframes is it a good idea to repartition on the key >>> column that is used in the join or >>> the optimizer is too smart so forget it. >>> >>> 2. In RDD join, wherever possible we do reduceByKey before the join to >>> avoid a big shuffle of data. Do we need >>> to do anything similar with dataframe joins, or the optimizer is too smart >>> so forget it. >>> >>> >> > -- Regards, Vaquar Khan +1 -224-436-0783 IT Architect / Lead Consultant Greater Chicago