[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017758#comment-16017758 ]
Ryan Blue commented on PARQUET-41: ---------------------------------- [~junjie], it sounds like your query column has 100% unique values, like a UUID column? How many values is your customer searching for in a typical query? What is your starting false-positive probability? And if you know how many row groups are false-positives, that would be great to know, too. I'm a little worried that bloom filters work just for very specific cases. For example, if you have 100% unique values and a 10% FPP then the size of the bloom filter will be 10% of the size of the data (assuming each ID or UUID is stored in just 6 bytes after compression). That means the expected amount of data read for a 1-item query is ~20%: 10% for the BF and 10% for false-positives. For a 2-item query, that goes up to about 30%. One problem is that these numbers don't match your telecom example. If you're seeing a 6m query go down to 15s, that's better than a 20x speedup. Any idea what the difference is? > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v6.3.15#6346)