[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093432#comment-16093432
]
Jim Apple commented on PARQUET-41:
----------------------------------
We might want to consider "Cache-, Hash- and Space-Efficient Bloom Filters":
http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf
Here is an implementation using those block Bloom filters combined with
so-called "split" Bloom filters as described in "Network Applications of Bloom
Filters: A Survey", section 2.1:
https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202
This can be very fast for a couple of reasons:
1. Only touching one cache line can be much faster than touching `k` cache
lines as in a standard Bloom filter.
2. Intel and AMD processors in the last 2-4 years have AVX2 instructions that
can parallelize a number of the operations
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2
As a result, I have seen speedups of 50% to 500% (depending on the operation
and filter size) over the fastest other general-purpose Bloom filter I know of.
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: Ferdinand Xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)