Select distinct on partitioned column requires reading all the files?
When querying a hive table according to a partitioning column, it would be logical that a simple select count(distinct partitioned_column_name) from my_partitioned_table would complete almost instantaneously. But we are seeing that both hive and impala are unable to execute this query properly: they just read the entire table! What do we need to do to ensure the above command executes rapidly?
Re: Select distinct on partitioned column requires reading all the files?
Great thanks. Is this a server-side-only /requires restart parameter? 2015-02-23 22:36 GMT-08:00 Gopal Vijayaraghavan gop...@apache.org: Hi, Are you sure you have hive.optimize.metadataonly=true ? I’m not saying it will complete instantaneously (possibly even be very slow, due to the lack of a temp-table optimization of that), but it won’t read any part of the actual table. Cheers, Gopal From: Stephen Boesch java...@gmail.com Reply-To: user@hive.apache.org user@hive.apache.org Date: Monday, February 23, 2015 at 10:26 PM To: user@hive.apache.org user@hive.apache.org Subject: Select distinct on partitioned column requires reading all the files? When querying a hive table according to a partitioning column, it would be logical that a simple select count(distinct partitioned_column_name) from my_partitioned_table would complete almost instantaneously. But we are seeing that both hive and impala are unable to execute this query properly: they just read the entire table! What do we need to do to ensure the above command executes rapidly?
Re: Select distinct on partitioned column requires reading all the files?
Hi, Are you sure you have hive.optimize.metadataonly=true ? I¹m not saying it will complete instantaneously (possibly even be very slow, due to the lack of a temp-table optimization of that), but it won¹t read any part of the actual table. Cheers, Gopal From: Stephen Boesch java...@gmail.com Reply-To: user@hive.apache.org user@hive.apache.org Date: Monday, February 23, 2015 at 10:26 PM To: user@hive.apache.org user@hive.apache.org Subject: Select distinct on partitioned column requires reading all the files? When querying a hive table according to a partitioning column, it would be logical that a simple select count(distinct partitioned_column_name) from my_partitioned_table would complete almost instantaneously. But we are seeing that both hive and impala are unable to execute this query properly: they just read the entire table! What do we need to do to ensure the above command executes rapidly?