[
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nikhil Deshpande updated HIVE-1694:
-----------------------------------
Status: Patch Available (was: Open)
This is a patch to demonstrate query performance gains using indexes
(added in HIVE-417). The patch is over latest hive trunk.
ChangeLog for the patch:
- Implements a new rewrite for a certain set of queries with GROUP BY to speed
those queries by running them on index data instead of base table.
- Implements a skeleton generic rewrite engine.
- Implements the rewrite rule for a GroupBy queries set (mentioned above).
More details in the class comment GbToCompactSumIdxRewrite.
- Rewrite needs to be currently explicitly enabled with a flag
hive.ql.rw.gb_to_idx.
- Modifies metastore & metadata API for getting some index info.
- Modifies QB metadata & parseblock code to add some rewrite assist methods.
- Inserts a rewrite hook into Semantic Analyzer.
- Fixes a bug in ql QTestUtil to clean-up indexed tables properly
- Contains new test for Group By rewrite using indexes:
ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q
Quick performance test results on a very small Hadoop cluster:
2 queries (chosen to demonstrate perf gains) run on TPC-H benchmark data
lineitem table.
Timings in seconds, data set size (1M, 1G etc.) is TPC-H scale factor.
{noformat}
-----------------------------------------------
1M 1G 10G 30G
-----------------------------------------------
q1_no_idx 24.161 76.790 506.005 1551.555
q1_with_idx 21.268 27.292 35.502 86.133
-----------------------------------------------
q1_no_idx 73.660 130.587 764.619 2146.423
q2_with_idx 69.393 75.493 92.867 190.619
-----------------------------------------------
{noformat}
Hadoop cluster description used for above perf test:
- 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5,
16GB RAM)
- 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not
partitioned and clustered, Hive tables stored in row-store format, HDFS
replication factor: 2
- Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM)
- Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H
30GB data: 21GB lineitem, ~180Million tuples)
These changes are being maintained at http://github.com/prafullat/hive
> Accelerate query execution using indexes
> ----------------------------------------
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
> Issue Type: New Feature
> Components: Indexing, Query Processor
> Affects Versions: 0.7.0
> Reporter: Nikhil Deshpande
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue
> tracks supporting indexes in Hive compiler & execution engine for SELECT
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold
> the information about index based plans & operator implementations for above
> mentioned cases.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.