[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714492#action_12714492
]
Seymour Zhang commented on HIVE-417:
------------------------------------
Hello Prasad and Yongqiang, Thank you very much for this great effort.
One of my suggestions would be that, since we've done indexing with Mapreduce,
and for some queries based on the generated indexes, can we just omit the
time-consuming Mapreduce phase during the querying period, as we've already got
all of the files/offsets and we can go to these specific file offsets directly
to get relevant rows of the table? This would greatly expedite the query
process.
This would be helpful for the following case in one of my usages with Hive.
With Hive, I've already sharded (by date), and bucketed (by cols hashing) of my
log data into a hierachical files. Also I've sorted each file with the hashing
cols. As I may have many rows with same column values but different timestamps,
to minimize index size, I'd like to treat these rows of same col values as a
block and only use a single index entry for this block. This will grealy reduce
the index size of my data, but still very useful in my query request with those
cols.
> Implement Indexing in Hive
> --------------------------
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Metastore, Query Processor
> Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
> Reporter: Prasad Chakka
> Assignee: He Yongqiang
> Attachments: hive-417.proto.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.