[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734405#action_12734405
 ] 

He Yongqiang commented on HIVE-417:
-----------------------------------

1) For a given key, we are using a sorted set for each bucket to store 
positions at the reduer. I am worried that "one sorted set for each bucket" may 
cause out of memory problem.
as you commentted earlier: "List<bucketname, List<offset>> column, offsets are 
sorted". 
Think about one extreme situation: one file contains a single value million 
times. So at the reducer we are storing million positions in a sorted set. 

>>So reducer can flush the offsets periodically to disk thus avoiding 
>>OutOfMemory exceptions in reducer. 
If we do this, how we can guarantee they are sorted. I mean offsets after this 
flush are greater than offsets in previous flush.

2)What are the other options for the index output format?
I think there is no other options. We need to discard the key part. And i think 
in hive only IgnoreKeyTextOutputFormat does that. And Of course all hive's 
custom HiveOutputFormat can discard key part, but they can not be specified in 
the map-reduce jobconf, since they do not extend OutputFormat.

> Implement Indexing in Hive
> --------------------------
>
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
>            Reporter: Prasad Chakka
>            Assignee: He Yongqiang
>         Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to