Is there a plan to fix this? I also ran into this issue with a *"select * from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto in worst case (1.6M partitions table). This is a serious blocker for us since we have many tables with near (and over) 1M partitions, and any query against these big tables wastes 5 minutes to get full partitions info.
I briefly looked at the code, and it looks like resolving metastore relations is the first thing that the analyzer does prior to any other optimization rules such as partition pruning. So in the Hive metastore client, it ends up calling getAllPartitions() with no filter expression. I am wondering how much work will be involved to fix this issue. Can you please advise what you think should be done? On Mon, Apr 13, 2015 at 3:27 PM, Michael Armbrust <mich...@databricks.com> wrote: > Yeah, we don't currently push down predicates into the metastore. Though, > we do prune partitions based on predicates (so we don't read the data). > > On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves <tgraves...@yahoo.com.invalid> > wrote: > > > Hey, > > I was trying out spark sql using the HiveContext and doing a select on a > > partitioned table with lots of partitions (16,000+). It took over 6 > minutes > > before it even started the job. It looks like it was querying the Hive > > metastore and got a good chunk of data back. Which I'm guessing is info > on > > the partitions. Running the same query using hive takes 45 seconds for > the > > entire job. > > I know spark sql doesn't support all the hive optimization. Is this a > > known limitation currently? > > Thanks,Tom >