Hello. I'm running Hadoop 2.2.0 and Hive 0.12.0.
I have an ORC table partitioned by 'range', and sorted by 'time'. I want to select the max(time) value from a table for a given set of partitions. I begin with a query that looks like the following: select max(time) from my_table where range > 1234; The information about the minimum/maximum values for a given integer column is in the ORC metadata. If I run '--orcfiledump' on the files in the specified partition(s), I can see the following output: Column 1: count: 233056 min: 1393123416 max: 1393123499 sum: 324675782247877 However, my queries do not seem to be using the information. The query I'm running ends up with several hundred mappers, and takes a very long time to run on the data. Running an orcfiledump on the files themselves (and simply pulling the values for column 1) is faster by several orders of magnitude. I have verified that 'hive.optimize.index.filter' and 'hive.optimize.ppd' are set to 'true'. What can I do to avoid processing actual records for this particular query, instead using the ORC file metadata or metastore metadata? Regards, Bryan Jeffrey