[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616251#comment-14616251
 ] 

Ferdinand Xu commented on PARQUET-41:
-------------------------------------

Hi [~rdblue], I have some thoughts for the bloom filter about the space 
efficiency.
At first, I think we should define in which level the bloom filter takes 
effect. The bloom filter is a complement to the dictionary. For page level, we 
have already the dictionary page which helps us filter data page. In the upper 
level, we could use bloom filter to filter the column chunk without parsing the 
dictionary page. Serving for this purpose, we could do some changes on the 
current implementations. Now bloom filter statistics is part of the statistics 
stored with the data page header. It's not a good design since it used more 
space than expectations. So I am thinking about making the bloom filter 
statistics as part of ColumnChunk instead. One extra benefits we can obtain is 
that we can postpone the time for constructing the bloom filter. In this way, 
we can do the construction of bloom filter in the flush method. In this stage, 
we have a better understand about how data is like(how much unique value there 
is). Any suggestions on this? We could have several rounds of discussions and 
do the POC work once completed. 

Regards,
Ferd

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to