ORC queries inefficient for sorted field

Bryan Jeffrey Sat, 22 Feb 2014 19:04:24 -0800

Hello.

I'm running Hadoop 2.2.0 and Hive 0.12.0.


I have an ORC table partitioned by 'range', and sorted by 'time'.  I want
to select the max(time) value from a table for a given set of partitions.
 I begin with a query that looks like the following:

select max(time) from my_table where range > 1234;

The information about the minimum/maximum values for a given integer column
is in the ORC metadata.  If I run '--orcfiledump' on the files in the
specified partition(s), I can see the following output:

Column 1: count: 233056 min: 1393123416 max: 1393123499 sum: 324675782247877

However, my queries do not seem to be using the information.  The query I'm
running ends up with several hundred mappers, and takes  a very long time
to run on the data.  Running an orcfiledump on the files themselves (and
simply pulling the values for column 1) is faster by several orders of
magnitude.

I have verified that 'hive.optimize.index.filter' and 'hive.optimize.ppd'
are set to 'true'.  What can I do to avoid processing actual records for
this particular query, instead using the ORC file metadata or metastore
metadata?

Regards,

Bryan Jeffrey

ORC queries inefficient for sorted field

Reply via email to