[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713003#action_12713003 ]
He Yongqiang commented on HIVE-417: ----------------------------------- Checked how Mysql does with index and found mysql either can not use index to handle situations in my earlier post: {quote} but, we can not use it for queries like: 4) select * from table1 where col2>34 and col3<3 5) select * from table1 where col2 =34 6) select * from table1 where col3 <45 {quote} And now a basic idea for our index design, just like Prasad commented in previous post: 1) index structure use a mr job to create index, input is a file with all columns, and mapper output kv pairs, where key is <indexed col1, indexed col2,...> offset. And we define a comparator for <indexed col1, indexed col2,...> to letting the shuffle phase sort all mappers' output. And in reducer, we combine kv-pairs to <indexed col1, indexed col2,...> list_of_offsets This is a dense sorted index, then we create a sparse index on the dense index. And we also collect column data distribution informations (histogram) while doing this. 2) we consider using index for a query only when the query involves the columns of leftmost part of the index. And also need to consider index merge when involves two indexes, and a cost estimation to consider whether using index will decrease query time (this is the work need to do in the optimizer). But as first step, we can first finish part 1 and hive ql part. Then consider part two(optimizer part). After part1 finished, i will examine part2 in more detail. > Implement Indexing in Hive > -------------------------- > > Key: HIVE-417 > URL: https://issues.apache.org/jira/browse/HIVE-417 > Project: Hadoop Hive > Issue Type: New Feature > Components: Metastore, Query Processor > Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 > Reporter: Prasad Chakka > Assignee: He Yongqiang > Attachments: hive-417.proto.patch > > > Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.