spark.sql.hive.convertMetastoreParquet is true. I can't repro the issue of scanning all partitions now.. : P
Anyway, I found another email thread "Re: Spark Sql behaves strangely with tables with a lot of partitions" I observe the same issue as Jerrick, spark driver will call listStatus for the whole table folder even if I load a pure hive table (not spark created), it cost some time for large partitioned table. Even through spark.sql.parquet.cacheMetadata is true, this listStatus will run before every query, it's not cached. 2015-11-05 8:50 GMT+08:00 Cheng Lian <lian.cs....@gmail.com>: > Is there any chance that " spark.sql.hive.convertMetastoreParquet" is > turned off? > > Cheng > > On 11/4/15 5:15 PM, Rex Xiong wrote: > > Thanks Cheng Lian. > I found in 1.5, if I use spark to create this table with partition > discovery, the partition pruning can be performed, but for my old table > definition in pure Hive, the execution plan will do a parquet scan across > all partitions, and it runs very slow. > Looks like the execution plan optimization is different. > > 2015-11-03 23:10 GMT+08:00 Cheng Lian <lian.cs....@gmail.com>: > >> SPARK-11153 should be irrelevant because you are filtering on a partition >> key while SPARK-11153 is about Parquet filter push-down and doesn't affect >> partition pruning. >> >> Cheng >> >> >> On 11/3/15 7:14 PM, Rex Xiong wrote: >> >> We found the query performance is very poor due to this issue >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-11153 >> We usually use filter on partition key, the date, it's in string type in >> 1.3.1 and works great. >> But in 1.5, it needs to do parquet scan for all partitions. >> 2015年10月31日 下午7:38,"Rex Xiong" <bycha...@gmail.com> 写道: >> >>> Add back this thread to email list, forgot to reply all. >>> 2015年10月31日 下午7:23,"Michael Armbrust" < <mich...@databricks.com> >>> mich...@databricks.com> 写道: >>> >>>> Not that I know of. >>>> >>>> On Sat, Oct 31, 2015 at 12:22 PM, Rex Xiong < <bycha...@gmail.com> >>>> bycha...@gmail.com> wrote: >>>> >>>>> Good to know that, will have a try. >>>>> So there is no easy way to achieve it in pure hive method? >>>>> 2015年10月 31日 下午7:17,"Michael Armbrust" <mich...@databricks.com> 写道: >>>>> >>>>>> Yeah, this was rewritten to be faster in Spark 1.5. We use it with >>>>>> 10,000s of partitions. >>>>>> >>>>>> On Sat, Oct 31, 2015 at 7:17 AM, Rex Xiong < <bycha...@gmail.com> >>>>>> bycha...@gmail.com> wrote: >>>>>> >>>>>>> 1.3.1 >>>>>>> It is a lot of improvement in 1.5+? >>>>>>> >>>>>>> 2015-10-30 19:23 GMT+08:00 Michael Armbrust < >>>>>>> <mich...@databricks.com>mich...@databricks.com>: >>>>>>> >>>>>>>> We have tried schema merging feature, but it's too slow, there're >>>>>>>>> hundreds of partitions. >>>>>>>>> >>>>>>>> Which version of Spark? >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >> > >