[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
He Yongqiang updated HIVE-417:
------------------------------
Attachment: hive-indexing.3.patch
With this patch, the index can work. but it is not so intelligent.
This is how this patch works:
=== how to create the index table and generate index data ===
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
drop table src_rc_index;
//create an index table on table src_rc, and the index col is key.
//And the index table's data is stored using textfile (also work with seq,
rcfile)
create index src_rc_index type compact on table src_rc(key) stored as textfile;
hive> show table extended like src_rc_index;
tableName:src_rc_index
owner:heyongqiang
location:file:/user/hive/warehouse/src_rc_index
inputformat:org.apache.hadoop.mapred.TextInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
columns:struct columns { i32 key, string _bucketname, list<string> _offsets}
About the index table's schema. besides the index columns from the base table,
the index table has two more columns (_bucketname string, array(string) offsets
)
//generate the actuall index table's data (here also support partition)
update index src_rc_index;
====How to use the index table====
//find the offset for 'key=0' in the index table, and put the bucketname and
offset list in a temp directory
insert overwrite directory "/tmp/index_result" select `_bucketname` ,
`_offsets` from src_rc_index where key=0;
set hive.exec.index_file=/tmp/index_result;
//use a new index file format to prune inputsplit based on the offset list
//stored in "hive.exec.index_file" which is populated in previous command
set hive.input.format=org.apache.hadoop.hive.ql.index.io.HiveIndexInputFormat;
//this query will not scan the whole base data
select key, value from src_rc where key=0;
Things done in the patch:
1) hql command for creating index table
2) hql command and map-reduce job for updating index (generating the index
table's data).
3) a HiveIndexInputFormat to leverage the offsets got from index table to
reduce number of blocks/map-tasks
Things need to be done:
1) right now the index table is manually specified in queries. we need this to
be more intelligent by automatically generating the plan using index .
2) The HiveIndexInputFormat needs a new RecordReader to seek to a given offset
instead of scanning the whole block.
3) right now we use a map-reduce job to scan the whole index table to find hits
offsets. But since the index table is sorted, we can leverage the sort property
to avoid the map-reduce job in many cases. (easiest way is to do a binary
search in client.)
The first todo is the most important part. I think the third may need much
more work (maybe not true).
(Note: although this patch has been tested in production cluster, it could
still have bugs. We will be really appreciate if you can report bugs you find
here.)
> Implement Indexing in Hive
> --------------------------
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Metastore, Query Processor
> Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
> Reporter: Prasad Chakka
> Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch,
> hive-indexing.3.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.