[jira] Commented: (HIVE-1694) Accelerate GROUP BY execution using indexes

Prajakta Kalmegh (JIRA) Sun, 27 Feb 2011 23:00:03 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000145#comment-13000145
 ]


Prajakta Kalmegh commented on HIVE-1694:
----------------------------------------

We are currently working on implementing a new index type to get a correct 
COUNT for group-by queries that are re-written to use index table instead of 
base table. We have three implementation options as listed below:

1. Create a new index handler called 'AggregationIndexHandler' which 
pre-computes the count on indexed columns
        In this appraoch, we plan to implement a new index type that stores the 
COUNT of the indexed column per index entry as suggested by John in November 
1st 2010's comment. The 'AggregationIndexHandler' will override the default 
implementation for 'analyzeIndexDefinition(...)' and 
'generateIndexBuildTaskList(...)' methods. The 'analyzeIndexDefinition(...)' 
method implementation will add a FieldSchema object for COUNT rather than 
'offsets' in the StorageDescriptor for the index table. The 
'generateIndexBuildTaskList(...)' method will be the same but there will be a 
change in the implementation of the 'getIndexBuilderMapRedTask(...)' method 
which is invoked within this method. 
       
Currently, for the index rebuild on the following index creation, 

CREATE INDEX lineitem_lshipdate_idx ON TABLE lineitem(l_shipdate) AS 
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED 
REBUILD;
ALTER INDEX lineitem_lshipdate_idx ON lineitem REBUILD;

the 'getIndexBuilderMapRedTask(...)' method creates a command 'INSERT OVERWRITE 
TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT 
`l_shipdate`,INPUT__FILE__NAME, collect_set (BLOCK__OFFSET__INSIDE__FILE)  FROM 
`lineitem` GROUP BY `l_shipdate`, INPUT__FILE__NAME' in the 
CompactIndexHandler. This command is later passed to the driver to compile and 
rebuild the index which collects data from base table and stores it in index 
table.

With the 'AggregationIndexHandler', we plan to create a command like ' INSERT 
OVERWRITE TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT 
`l_shipdate`,INPUT__FILE__NAME, COUNT(`l_shipdate`) FROM `lineitem`  GROUP BY 
`l_shipdate`, INPUT__FILE__NAME' which will count the number of unique indexed 
column entries.

        In the current implementation of our optimizer, we are doing a 
'size(_offsets)' to replace the COUNT in group-by queries. We need to change 
this implementation if we use the AggregationIndexHandler. This will still give 
a comparable performance improvement on the original queries that will, 
otherwise, scan full tables for a COUNT of the indexed columns. 
Note: We plan to go ahead with creating a new HiveIndexHandler type rather than 
changing the implementation of the CompactIndexHandler to let in some 
flexibility in the use of AggregationIndexHandler only for cases where it is 
required. 

2. Implement a 'CompactRowIndexHandler' similar to 'CompactIndexHandler' but it 
stores the ROW_OFFSET_INSIDE_FILE instead of BLOCK_OFFSET_INSIDE_FILE
        We create a new index handler called 'CompactRowIndexHandler'. This 
will give us a ROW_OFFSET_INSIDE_FILE array<bigint> and we can compute the 
COUNT for group-by queries by doing a s'ize(ROW_OFFSET_INSIDE_FILE)' on it.  
The command will look like,

' INSERT OVERWRITE TABLE `default__lineitem_lineitem_lshipdate_idx__` SELECT 
`l_shipdate`,INPUT__FILE__NAME, collect_set (ROW_OFFSET_INSIDE_FILE) FROM 
`lineitem`  GROUP BY `l_shipdate`, INPUT__FILE__NAME' which will give us an 
array of all row offsets in the input file grouped by indexed column.

We had a look at the HIVE-1803 patch. If we re-use the implementation changes 
done in Hive1803 to classes HiveContextAwareRecordReader.java, IOContext.java, 
VirtualColumn.java and MapOperator.java, we can implement 
'CompactRowIndexHandler.java' to support the above index type.

3. Use BitMapIndexHandler without creating a new index handler
        Another approach is to use the Hive1803 BitMapIndexHandler to get a 
count of all those bits in the bitmap array for which the row_offset is ON. 
This will indeed give us a correct count for all the rows that contain the 
indexed column. However, we will have multiple counts per block entry. For 
this, we may need to do the SUM again to get a correct count per indexed column 
entry. Although this is do-able, it will be somewhat complex to implement.

Please let us know what you think of the proposed solutions. If you have a 
better implementation approach in mind, please do share it with us. 


> Accelerate GROUP BY execution using indexes
> -------------------------------------------
>
>                 Key: HIVE-1694
>                 URL: https://issues.apache.org/jira/browse/HIVE-1694
>             Project: Hive
>          Issue Type: New Feature
>          Components: Indexing, Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Nikhil Deshpande
>            Assignee: Nikhil Deshpande
>         Attachments: HIVE-1694.1.patch.txt, HIVE-1694_2010-10-28.diff, 
> demo_q1.hql, demo_q2.hql
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HIVE-1694) Accelerate GROUP BY execution using indexes

Reply via email to