[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017758#comment-16017758
 ] 

Ryan Blue commented on PARQUET-41:
----------------------------------

[~junjie], it sounds like your query column has 100% unique values, like a UUID 
column? How many values is your customer searching for in a typical query? What 
is your starting false-positive probability? And if you know how many row 
groups are false-positives, that would be great to know, too.

I'm a little worried that bloom filters work just for very specific cases. For 
example, if you have 100% unique values and a 10% FPP then the size of the 
bloom filter will be 10% of the size of the data (assuming each ID or UUID is 
stored in just 6 bytes after compression). That means the expected amount of 
data read for a 1-item query is ~20%: 10% for the BF and 10% for 
false-positives. For a 2-item query, that goes up to about 30%. One problem is 
that these numbers don't match your telecom example. If you're seeing a 6m 
query go down to 15s, that's better than a 20x speedup. Any idea what the 
difference is?

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to