[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353285#comment-14353285
 ] 

Julien Le Dem commented on PARQUET-41:
--------------------------------------

The first step is to specify how we would store the Bloom Filter in 
parquet-format.
As [~alexlevenson] mentioned it should be defined at the binary level.
A bloom filter is just a byte array (or possibly a few) and probably the spec 
of the hash functions used so it should not be that hard to defined.

Alternatively there are custom properties that can be added for non standard 
things but I would recommend going the way defined above for bloom filters as 
they are a standard we want to define.

listing a few folks that should probably review the format for this (from 
Impala, Drill, SparkSQL): [~marcelk] [~nongli] [~pipfiddle] [~jaltekruse] 
[~rdblue] [~marmbrus] 

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: ferdinand xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to