Re: Partition pruning in spark 1.5.2

2016-04-06 Thread Darshan Singh
It worked fine and I was looking for this only as I do not want cache the dataframe as the data in some of partitions will change. However, I have much larger number of partitions(column is not just country but something where values can be 100's of thousands). Now the metdata is much bigger than

RE: Partition pruning in spark 1.5.2

2016-04-05 Thread Yong Zhang
that is the original question. Thanks Yong From: mich...@databricks.com Date: Tue, 5 Apr 2016 13:28:46 -0700 Subject: Re: Partition pruning in spark 1.5.2 To: darshan.m...@gmail.com CC: user@spark.apache.org The following should ensure partition pruning happens: df.write.partitionBy("country").s

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Darshan Singh
Thanks a lot. I will try this one as well. On Tue, Apr 5, 2016 at 9:28 PM, Michael Armbrust wrote: > The following should ensure partition pruning happens: > > df.write.partitionBy("country").save("/path/to/data") > sqlContext.read.load("/path/to/data").where("country =

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
The following should ensure partition pruning happens: df.write.partitionBy("country").save("/path/to/data") sqlContext.read.load("/path/to/data").where("country = 'UK'") On Tue, Apr 5, 2016 at 1:13 PM, Darshan Singh wrote: > Thanks for the reply. > > Now I saved the

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Darshan Singh
Thanks for the reply. Now I saved the part_movies as parquet file. Then created new dataframe from the saved parquet file and I did not persist it. The i ran the same query. It still read all 20 partitions and this time from hdfs. So what will be exact scenario when it will prune partitions. I

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
For the in-memory cache, we still launch tasks, we just skip blocks when possible using statistics about those blocks. On Tue, Apr 5, 2016 at 12:14 PM, Darshan Singh wrote: > Thanks. It is not my exact scenario but I have tried to reproduce it. I > have used 1.5.2. > > I

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Darshan Singh
Thanks. It is not my exact scenario but I have tried to reproduce it. I have used 1.5.2. I have a part-movies data-frame which has 20 partitions 1 each for a movie. I created following query val part_sql = sqlContext.sql("select * from part_movies where movie = 10") part_sql.count() I expect

Re: Partition pruning in spark 1.5.2

2016-04-05 Thread Michael Armbrust
Can you show your full code. How are you partitioning the data? How are you reading it? What is the resulting query plan (run explain() or EXPLAIN). On Tue, Apr 5, 2016 at 10:02 AM, dsing001 wrote: > HI, > > I am using 1.5.2. I have a dataframe which is partitioned

Partition pruning in spark 1.5.2

2016-04-05 Thread dsing001
HI, I am using 1.5.2. I have a dataframe which is partitioned based on the country. So I have around 150 partition in the dataframe. When I run sparksql and use country = 'UK' it still reads all partitions and not able to prune other partitions. Thus all the queries run for similar times