Select distinct on partitioned column requires reading all the files?

2015-02-23 Thread Stephen Boesch
When querying a hive table according to a partitioning column, it would be
logical that a simple

select count(distinct partitioned_column_name) from my_partitioned_table

would complete almost instantaneously.

But we are seeing that both hive and impala are unable to execute this
query properly: they just read the entire table!

What do we need to do to ensure the above command executes rapidly?


Re: Select distinct on partitioned column requires reading all the files?

2015-02-23 Thread Stephen Boesch
Great thanks.  Is this a server-side-only /requires restart parameter?

2015-02-23 22:36 GMT-08:00 Gopal Vijayaraghavan gop...@apache.org:

 Hi,

 Are you sure you have

 hive.optimize.metadataonly=true ?

 I’m not saying it will complete instantaneously (possibly even be very
 slow, due to the lack of a temp-table optimization of that), but it won’t
 read any part of the actual table.

 Cheers,
 Gopal

 From: Stephen Boesch java...@gmail.com
 Reply-To: user@hive.apache.org user@hive.apache.org
 Date: Monday, February 23, 2015 at 10:26 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Select distinct on partitioned column requires reading all the
 files?


 When querying a hive table according to a partitioning column, it would be
 logical that a simple

 select count(distinct partitioned_column_name) from my_partitioned_table

 would complete almost instantaneously.

 But we are seeing that both hive and impala are unable to execute this
 query properly: they just read the entire table!

 What do we need to do to ensure the above command executes rapidly?



Re: Select distinct on partitioned column requires reading all the files?

2015-02-23 Thread Gopal Vijayaraghavan
Hi,

Are you sure you have

hive.optimize.metadataonly=true ?

I¹m not saying it will complete instantaneously (possibly even be very slow,
due to the lack of a temp-table optimization of that), but it won¹t read any
part of the actual table.

Cheers,
Gopal

From:  Stephen Boesch java...@gmail.com
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Monday, February 23, 2015 at 10:26 PM
To:  user@hive.apache.org user@hive.apache.org
Subject:  Select distinct on partitioned column requires reading all the
files?


When querying a hive table according to a partitioning column, it would be
logical that a simple
select count(distinct partitioned_column_name) from my_partitioned_table
would complete almost instantaneously.
But we are seeing that both hive and impala are unable to execute this query
properly: they just read the entire table!
What do we need to do to ensure the above command executes rapidly?