[ https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000145#comment-13000145 ]
Prajakta Kalmegh commented on HIVE-1694: ---------------------------------------- We are currently working on implementing a new index type to get a correct COUNT for group-by queries that are re-written to use index table instead of base table. We have three implementation options as listed below: 1. Create a new index handler called 'AggregationIndexHandler' which pre-computes the count on indexed columns In this appraoch, we plan to implement a new index type that stores the COUNT of the indexed column per index entry as suggested by John in November 1st 2010's comment. The 'AggregationIndexHandler' will override the default implementation for 'analyzeIndexDefinition(...)' and 'generateIndexBuildTaskList(...)' methods. The 'analyzeIndexDefinition(...)' method implementation will add a FieldSchema object for COUNT rather than 'offsets' in the StorageDescriptor for the index table. The 'generateIndexBuildTaskList(...)' method will be the same but there will be a change in the implementation of the 'getIndexBuilderMapRedTask(...)' method which is invoked within this method. Currently, for the index rebuild on the following index creation, CREATE INDEX lineitem_lshipdate_idx ON TABLE lineitem(l_shipdate) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD; ALTER INDEX lineitem_lshipdate_idx ON lineitem REBUILD; the 'getIndexBuilderMapRedTask(...)' method creates a command 'INSERT OVERWRITE TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT `l_shipdate`,INPUT__FILE__NAME, collect_set (BLOCK__OFFSET__INSIDE__FILE) FROM `lineitem` GROUP BY `l_shipdate`, INPUT__FILE__NAME' in the CompactIndexHandler. This command is later passed to the driver to compile and rebuild the index which collects data from base table and stores it in index table. With the 'AggregationIndexHandler', we plan to create a command like ' INSERT OVERWRITE TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT `l_shipdate`,INPUT__FILE__NAME, COUNT(`l_shipdate`) FROM `lineitem` GROUP BY `l_shipdate`, INPUT__FILE__NAME' which will count the number of unique indexed column entries. In the current implementation of our optimizer, we are doing a 'size(_offsets)' to replace the COUNT in group-by queries. We need to change this implementation if we use the AggregationIndexHandler. This will still give a comparable performance improvement on the original queries that will, otherwise, scan full tables for a COUNT of the indexed columns. Note: We plan to go ahead with creating a new HiveIndexHandler type rather than changing the implementation of the CompactIndexHandler to let in some flexibility in the use of AggregationIndexHandler only for cases where it is required. 2. Implement a 'CompactRowIndexHandler' similar to 'CompactIndexHandler' but it stores the ROW_OFFSET_INSIDE_FILE instead of BLOCK_OFFSET_INSIDE_FILE We create a new index handler called 'CompactRowIndexHandler'. This will give us a ROW_OFFSET_INSIDE_FILE array<bigint> and we can compute the COUNT for group-by queries by doing a s'ize(ROW_OFFSET_INSIDE_FILE)' on it. The command will look like, ' INSERT OVERWRITE TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT `l_shipdate`,INPUT__FILE__NAME, collect_set (ROW_OFFSET_INSIDE_FILE) FROM `lineitem` GROUP BY `l_shipdate`, INPUT__FILE__NAME' which will give us an array of all row offsets in the input file grouped by indexed column. We had a look at the HIVE-1803 patch. If we re-use the implementation changes done in Hive1803 to classes HiveContextAwareRecordReader.java, IOContext.java, VirtualColumn.java and MapOperator.java, we can implement 'CompactRowIndexHandler.java' to support the above index type. 3. Use BitMapIndexHandler without creating a new index handler Another approach is to use the Hive1803 BitMapIndexHandler to get a count of all those bits in the bitmap array for which the row_offset is ON. This will indeed give us a correct count for all the rows that contain the indexed column. However, we will have multiple counts per block entry. For this, we may need to do the SUM again to get a correct count per indexed column entry. Although this is do-able, it will be somewhat complex to implement. Please let us know what you think of the proposed solutions. If you have a better implementation approach in mind, please do share it with us. > Accelerate GROUP BY execution using indexes > ------------------------------------------- > > Key: HIVE-1694 > URL: https://issues.apache.org/jira/browse/HIVE-1694 > Project: Hive > Issue Type: New Feature > Components: Indexing, Query Processor > Affects Versions: 0.7.0 > Reporter: Nikhil Deshpande > Assignee: Nikhil Deshpande > Attachments: HIVE-1694.1.patch.txt, HIVE-1694_2010-10-28.diff, > demo_q1.hql, demo_q2.hql > > > The index building patch (Hive-417) is checked into trunk, this JIRA issue > tracks supporting indexes in Hive compiler & execution engine for SELECT > queries. > This is in ref. to John's comment at > https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869 > on creating separate JIRA issue for tracking index usage in optimizer & query > execution. > The aim of this effort is to use indexes to accelerate query execution (for > certain class of queries). E.g. > - Filters and range scans (already being worked on by He Yongqiang as part of > HIVE-417?) > - Joins (index based joins) > - Group By, Order By and other misc cases > The proposal is multi-step: > 1. Building index based operators, compiler and execution engine changes > 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose > between index scans, full table scans etc.) > This JIRA initially focuses on the first step. This JIRA is expected to hold > the information about index based plans & operator implementations for above > mentioned cases. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira