[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018218#comment-16018218
 ] 

Junjie Chen edited comment on PARQUET-41 at 5/20/17 12:58 AM:
--------------------------------------------------------------

Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in 
short time such as 1 minutes. The number of row output depends on the how 
frequency the subscriber been recorded. In my test, the false positive was set 
to 0.05.

About the difference of time speed up, I think is how many columns in table. In 
your calculation, the bloom filter is 10% of the data, but the proportion would 
be significant reduced in a 90+ columns table in telecom example (the time 
spend in read column maybe not significant reduced if parquet vectorization is 
enabled, but the size of bloom filter in a row group can be significantly 
shrink ). 

Multiple columns tables are very common in many customers, based on your 
calculation, we can spend very small size of index statistic to achieve at 
least 5 times speedup in HDFS scan stage in big tables with multiple columns 
and with a ~unique column.





was (Author: junjie):
Hi [~rdblue]
In telecom example, query column is not unique if someone has two calls in 
short time such as 1 minutes. The number of row output depends on the how 
frequency the subscriber been recorded. In my test, the false positive was set 
to 0.05.

About the difference of time speed up, I think is how many columns in table. In 
your calculation, the bloom filter is 10% of the data, but the proportion would 
be significant reduced in a 90+ columns table in telecom example (the time 
spend in read column maybe not significant reduced if parquet vectorization is 
enabled, but the size of bloom filter in a row group can be significantly 
shrink ). 

Multiple columns tables are very common in many customers, based on your 
calculation, we can spend very small size of index statistic to achieve at 
least 5 times speedup in HDFS scan stage in big tables with multiple columns.




> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Ferdinand Xu
>              Labels: filter2
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to