[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093432#comment-16093432 ]
Jim Apple commented on PARQUET-41: ---------------------------------- We might want to consider "Cache-, Hash- and Space-Efficient Bloom Filters": http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf Here is an implementation using those block Bloom filters combined with so-called "split" Bloom filters as described in "Network Applications of Bloom Filters: A Survey", section 2.1: https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202 This can be very fast for a couple of reasons: 1. Only touching one cache line can be much faster than touching `k` cache lines as in a standard Bloom filter. 2. Intel and AMD processors in the last 2-4 years have AVX2 instructions that can parallelize a number of the operations https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2 As a result, I have seen speedups of 50% to 500% (depending on the operation and filter size) over the fastest other general-purpose Bloom filter I know of. > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v6.4.14#64029)