You need to show us the execution plan, so we can understand what is your issue.
Use the spark shell code to show how your DF is built, how you partition them, 
then use explain(true) on your join DF, and show the output here, so we can 
better help you.
Yong

> Date: Tue, 5 Apr 2016 09:46:59 -0700
> From: darshan.m...@gmail.com
> To: user@spark.apache.org
> Subject: Plan issue with spark 1.5.2
> 
> 
> I am using spark 1.5.2. I have a question regarding plan generated by spark.
> I have 3 data-frames which has the data for different countries. I have
> around 150 countries and data is skewed.
> 
> My 95% queries will have country as criteria. However, I have seen issues
> with the plans generated for queries which has country as join column.
> 
> Data-frames are partitioned based on the country.Not only these dataframes
> are co-partitioned, these are co-located as well. E.g. Data for UK in
> data-frame df1, df2 df3 will be at on same hdfs datanode. 
> 
> Then when i join these 3 tables and country is one of the join column. I
> assume that the join should be the map side join but it shuffles the data
> from 3 dataframes and then join using shuffled data. Apart from country
> there are other columns in join.
> 
> Is this correct behavior? If it is an issue is it fixed in latest versions?
> 
> Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Plan-issue-with-spark-1-5-2-tp26681.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
                                          

Reply via email to