[
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353425#comment-14353425
]
Alex Levenson commented on PARQUET-41:
--------------------------------------
Yes, I was just thinking, the hash functions for each data type need to be well
defined too but Julien beat me to it :)
So I would say next steps are:
1) Define a binary format for serializing a bloom filter into a byte array,
including any configuration data (like size), and also discuss whether that
should be global (stored once) or repeated into each bloom filter.
2) Define the hash function used for each data type
3) Create a java implementation of the above, possibly borrowing code snippets
from Algebird or Guava
> Add bloom filters to parquet statistics
> ---------------------------------------
>
> Key: PARQUET-41
> URL: https://issues.apache.org/jira/browse/PARQUET-41
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-format, parquet-mr
> Reporter: Alex Levenson
> Assignee: ferdinand xu
> Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter.
> This could be very useful in filtering entire row groups.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)