Re: Review Request: Use sorted nature of compact indexes

Kevin Wilfong Tue, 01 Nov 2011 16:57:00 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2605/
-----------------------------------------------------------

(Updated 2011-11-01 23:56:58.062678)

Review request for hive, Yongqiang He, Ning Zhang, and namit jain.

Changes
-------

Thanks for the feedback Namit. This diff should address all of your comments
that I haven't already addressed with my own comments.

The most notable change in this diff is that the BinarySearchRecordReader can
now be used for both HiveInputFormat and CombineHiveInputFormat. Their
respective record readers now inherit from BinarySearchRecordReader, and the
CompactIndexHandler sets a variable representing whether or not the
BinarySearchRecordReader should attempt to run its optimized search algorithm
on the data. If that variable is not set, the record readers behave exactly as
they did before.

I updated the tests I added and verified they passed. I also ran the tests
most likely to be affected, notably any tests related to indexes, and verified
they still passed.

Summary
-------

The CompactIndexHandler determines if the reentrant query it creates is a
candidate for using the fact the index is sorted (it has an appropriate number
of non-partition conditions, and the query plan is of the form expected). It
sets the input format to HiveSortedInputFormat, and marks the FilterOperator
for the non-partition condition.

The HiveSortedInputFormat is extends HiveInputFormat, so its splits consist of
data from a single file, and its record reader is HiveBinarySearchRecordReader.
HiveBinarySearchRecordReader starts by assuming it is performing a binary
search. It sets the appropriate flags in IOContext, which acts as the means of
communication between the FilterOperators and the record reader. The
non-partition FilterOperator is responsible for executing a comparison between
the value in the row and column of interest and the constant. It also provides
the type of the generic UDF. It sets this data in the IOContext. As long as
the binary search continues the FilterOperators do not forward rows to the
operators below them. The record reader uses the comparison and the type of
the generic UDF to execute a binary search on the underlying RCFile until it
finds the block of interest, or determines that if any block is of interest it
is the last one. The search then proceeds linearly from the beginning of the
identified block. If ever in the binary search a problem occurs, like the
comparison fails for some reason, a linear search begins from the beginning of
the data which has yet to be eliminated.

Regardless of whether or not a binary search is performed, the record reader
attempts to end the linear search as soon as it can based on the comparison and
the type of the generic UDF.

This addresses bug HIVE-2535.
https://issues.apache.org/jira/browse/HIVE-2535

Diffs (updated)
-----

trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1183507
trunk/conf/hive-default.xml 1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1183507

trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExprNodeGenericFuncEvaluator.java
1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java 1183507

trunk/ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java
1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveRecordReader.java
1183507

trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveBinarySearchRecordReader.java
PRE-CREATION
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveRecordReader.java 1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java 1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java
1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FilterDesc.java 1183507
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1183507

trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java
1183507

trunk/ql/src/test/org/apache/hadoop/hive/ql/hooks/VerifyHiveSortedInputFormatUsedHook.java
PRE-CREATION

trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestHiveBinarySearchRecordReader.java
PRE-CREATION
trunk/ql/src/test/queries/clientpositive/index_compact_binary_search.q
PRE-CREATION
trunk/ql/src/test/results/clientpositive/index_compact_binary_search.q.out
PRE-CREATION

Diff: https://reviews.apache.org/r/2605/diff

Testing
-------

I added a test to verify the functionality of the HiveBinarySearchRecordReader.

I also added a .q file to test that this returns the correct results when the
underlying index is stored in an RCFile and when it is stored in as a text
file, with all of the supported operators.

I ran the .q files to verify they still pass.

I ran some queries to verify there was a CPU benefit to doing this. I saw as
much as a 45% reduction in the total CPU used by the map reduce job to scan the
index, for a large data set.

Thanks,

Kevin

Re: Review Request: Use sorted nature of compact indexes

Reply via email to