[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12758116#action_12758116
]
Joydeep Sen Sarma commented on HIVE-417:
----------------------------------------
are there any references on this technique?
someone had earlier suggested this (apparently from reading Netezza
documentation) - but i don't understand when it would work. why would a (fairly
large) sequencefile block only limited range of values (assuming the metadata
stores a min-max range). most cases i can imagine in our dataset would either
have low cardinality columns (so most values would be present) or for large
cardinality ones - the distribution would be random (relative to the primary
sort key) - and the range would seem ineffective.
unless there are columns that are closely related to the how data is
sorted/partitioned (perhaps some product ids are limited to specific range of
time - but the partitioning is on time and not product id - and even that
sounds dubious).
a bloom filter would seem much more plausible at allowing good filtering. even
then don't understand why this sort of metadata should be kept along with the
block and not separately (much more flexible - can be added on demand) as this
jira is headed towards.
> Implement Indexing in Hive
> --------------------------
>
> Key: HIVE-417
> URL: https://issues.apache.org/jira/browse/HIVE-417
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: Metastore, Query Processor
> Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
> Reporter: Prasad Chakka
> Assignee: He Yongqiang
> Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.