Re: Spark SQL, dataframe join questions.

vaquar khan Wed, 29 Mar 2017 15:59:10 -0700

HI ,

I found following two links are helpful sharing with you .


http://stackoverflow.com/questions/38353524/how-to-ensure-partitioning-induced-by-spark-dataframe-join

http://spark.apache.org/docs/latest/configuration.html


Regards,
Vaquar khan

On Wed, Mar 29, 2017 at 2:45 PM, Vidya Sujeet <sjayatheer...@gmail.com>
wrote:

> In repartition, every element in the partition is moved to a new
> partition..doing a full shuffle compared to shuffles done by reduceBy
> clauses. With this in mind, repartition would increase your query
> performance. ReduceBy key will also shuffle based on the aggregation.
>
> The best way to design is to check the query plan of your data frame join
> query and do RDD joins accordingly, if needed.
>
>
> On Wed, Mar 29, 2017 at 10:55 AM, Yong Zhang <java8...@hotmail.com> wrote:
>
>> You don't need to repartition your data just for join purpose. But if the
>> either parties of join is already partitioned, Spark will use this
>> advantage as part of join optimization.
>>
>> Should you reduceByKey before the join really depend on your join logic.
>> ReduceByKey will shuffle, and following join COULD cause another shuffle.
>> So I am not sure if it is a smart way.
>>
>> Yong
>>
>> ------------------------------
>> *From:* shyla deshpande <deshpandesh...@gmail.com>
>> *Sent:* Wednesday, March 29, 2017 12:33 PM
>> *To:* user
>> *Subject:* Re: Spark SQL, dataframe join questions.
>>
>>
>>
>> On Tue, Mar 28, 2017 at 2:57 PM, shyla deshpande <
>> deshpandesh...@gmail.com> wrote:
>>
>>> Following are my questions. Thank you.
>>>
>>> 1. When joining dataframes is it a good idea to repartition on the key 
>>> column that is used in the join or
>>> the optimizer is too smart so forget it.
>>>
>>> 2. In RDD join, wherever possible we do reduceByKey before the join to 
>>> avoid a big shuffle of data. Do we need
>>> to do anything similar with dataframe joins, or the optimizer is too smart 
>>> so forget it.
>>>
>>>
>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago

Re: Spark SQL, dataframe join questions.

Reply via email to