[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093432#comment-16093432
 ] 

Jim Apple commented on PARQUET-41:
----------------------------------

We might want to consider "Cache-, Hash- and Space-Efficient Bloom Filters": 
http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf

Here is an implementation using those block Bloom filters combined with 
so-called "split" Bloom filters as described in "Network Applications of Bloom 
Filters: A Survey", section 2.1: 
https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf

https://gist.github.com/jbapple-cloudera/e78460e641967e33d6b68877cff27202

This can be very fast for a couple of reasons:

1. Only touching one cache line can be much faster than touching `k` cache 
lines as in a standard Bloom filter.

2. Intel and AMD processors in the last 2-4 years have AVX2 instructions that 
can parallelize a number of the operations 
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2  

As a result, I have seen speedups of 50% to 500% (depending on the operation 
and filter size) over the fastest other general-purpose Bloom filter I know of.

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to