[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713003#action_12713003
]
He Yongqiang commented on HIVE-417:
-----------------------------------
Checked how Mysql does with index and found mysql either can not use index to
handle situations in my earlier post:
{quote}
but, we can not use it for queries like:
4) select * from table1 where col2>34 and col3<3
5) select * from table1 where col2 =34
6) select * from table1 where col3 <45
{quote}
And now a basic idea for our index design, just like Prasad commented in
previous post:
1) index structure
use a mr job to create index, input is a file with all columns, and mapper
output kv pairs, where key is <indexed col1, indexed col2,...> offset.
And we define a comparator for <indexed col1, indexed col2,...> to letting the
shuffle phase sort all mappers' output. And in reducer, we combine kv-pairs to
<indexed col1, indexed col2,...> list_of_offsets
This is a dense sorted index, then we create a sparse index on the dense index.
And we also collect column data distribution informations (histogram) while
doing this.
2)
we consider using index for a query only when the query involves the columns of
leftmost part of the index.
And also need to consider index merge when involves two indexes, and a cost
estimation to consider whether using index will decrease query time (this is
the work need to do in the optimizer).
But as first step, we can first finish part 1 and hive ql part. Then consider
part two(optimizer part). After part1 finished, i will examine part2 in more
detail.
> Implement Indexing in Hive
> --------------------------
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Metastore, Query Processor
> Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
> Reporter: Prasad Chakka
> Assignee: He Yongqiang
> Attachments: hive-417.proto.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.