[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268176#comment-14268176
 ] 

Owen O'Malley commented on HIVE-9188:
-------------------------------------

[~gopalv] I don't understand your concern. The indexes are already stored in 
ROW_INDEX streams. I'm just saying that the bloom filters, which are much 
larger than the rest of the ROW_INDEX be split into a BLOOM_FILTER stream 
instead of bundled in with the ROW_INDEX stream. That would let you load just 
the ROW_INDEX if you don't need the bloom filter.

The size of the bloom filter needs to be changed relative to the number of 
items. You've sized them for the default row group size (n = 10,000, p=0.05) -> 
7.8kb. To use them at the file level, you'd need to make the bloom filters much 
much much larger. For a file with 100 million values in a column, you'd need a 
74mb bloom filter. I'd propose that you only do the bloom filters at the row 
group level and scale them to match the row index stride rather than just use 
the default 10k.

> BloomFilter in ORC row group index
> ----------------------------------
>
>                 Key: HIVE-9188
>                 URL: https://issues.apache.org/jira/browse/HIVE-9188
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>    Affects Versions: 0.15.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>              Labels: orcfile
>         Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
> HIVE-9188.4.patch
>
>
> BloomFilters are well known probabilistic data structure for set membership 
> checking. We can use bloom filters in ORC index for better row group pruning. 
> Currently, ORC row group index uses min/max statistics to eliminate row 
> groups (stripes as well) that do not satisfy predicate condition specified in 
> the query. But in some cases, the efficiency of min/max based elimination is 
> not optimal (unsorted columns with wide range of entries). Bloom filters can 
> be an effective and efficient alternative for row group/split elimination for 
> point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to