Plan issue with spark 1.5.2

2016-04-05 Thread dsing001

I am using spark 1.5.2. I have a question regarding plan generated by spark.
I have 3 data-frames which has the data for different countries. I have
around 150 countries and data is skewed.

My 95% queries will have country as criteria. However, I have seen issues
with the plans generated for queries which has country as join column.

Data-frames are partitioned based on the country.Not only these dataframes
are co-partitioned, these are co-located as well. E.g. Data for UK in
data-frame df1, df2 df3 will be at on same hdfs datanode. 

Then when i join these 3 tables and country is one of the join column. I
assume that the join should be the map side join but it shuffles the data
from 3 dataframes and then join using shuffled data. Apart from country
there are other columns in join.

Is this correct behavior? If it is an issue is it fixed in latest versions?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Plan-issue-with-spark-1-5-2-tp26681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Partition pruning in spark 1.5.2

2016-04-05 Thread dsing001
HI,

I am using 1.5.2. I have a dataframe which is partitioned based on the
country. So I have around 150 partition in the dataframe. When I run
sparksql and use country = 'UK' it still reads all partitions and not able
to prune other partitions. Thus all the queries run for similar times
independent of what country I pass. Is it desired?

Is there a way to fix this in 1.5.2 by using some parameter or is it fixed
in latest versions?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partition-pruning-in-spark-1-5-2-tp26682.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org