Re: Spark Sql reading hive partitioned tables?

Cheolsoo Park Tue, 14 Apr 2015 09:59:40 -0700

Is there a plan to fix this? I also ran into this issue with a *"select *
from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto
in worst case (1.6M partitions table). This is a serious blocker for us
since we have many tables with near (and over) 1M partitions, and any query
against these big tables wastes 5 minutes to get full partitions info.

I briefly looked at the code, and it looks like resolving metastore
relations is the first thing that the analyzer does prior to any other
optimization rules such as partition pruning. So in the Hive metastore
client, it ends up calling getAllPartitions() with no filter expression. I
am wondering how much work will be involved to fix this issue. Can you
please advise what you think should be done?

On Mon, Apr 13, 2015 at 3:27 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Yeah, we don't currently push down predicates into the metastore.  Though,
> we do prune partitions based on predicates (so we don't read the data).
>
> On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves <tgraves...@yahoo.com.invalid>
> wrote:
>
> > Hey,
> > I was trying out spark sql using the HiveContext and doing a select on a
> > partitioned table with lots of partitions (16,000+). It took over 6
> minutes
> > before it even started the job. It looks like it was querying the Hive
> > metastore and got a good chunk of data back.  Which I'm guessing is info
> on
> > the partitions.  Running the same query using hive takes 45 seconds for
> the
> > entire job.
> > I know spark sql doesn't support all the hive optimization.  Is this a
> > known limitation currently?
> > Thanks,Tom
>

Re: Spark Sql reading hive partitioned tables?

Reply via email to