[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Alex Levenson (JIRA) Mon, 09 Mar 2015 12:21:31 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353425#comment-14353425
 ]


Alex Levenson commented on PARQUET-41:
--------------------------------------

Yes, I was just thinking, the hash functions for each data type need to be well 
defined too but Julien beat me to it :)

So I would say next steps are:
1) Define a binary format for serializing a bloom filter into a byte array, 
including any configuration data (like size), and also discuss whether that 
should be global (stored once) or repeated into each bloom filter.

2) Define the hash function used for each data type

3) Create a java implementation of the above, possibly borrowing code snippets 
from Algebird or Guava

> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: ferdinand xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-41) Add bloom filters to parquet statistics

Reply via email to