[ https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018218#comment-16018218 ]
Junjie Chen edited comment on PARQUET-41 at 5/20/17 12:58 AM: -------------------------------------------------------------- Hi [~rdblue] In telecom example, query column is not unique if someone has two calls in short time such as 1 minutes. The number of row output depends on the how frequency the subscriber been recorded. In my test, the false positive was set to 0.05. About the difference of time speed up, I think is how many columns in table. In your calculation, the bloom filter is 10% of the data, but the proportion would be significant reduced in a 90+ columns table in telecom example (the time spend in read column maybe not significant reduced if parquet vectorization is enabled, but the size of bloom filter in a row group can be significantly shrink ). Multiple columns tables are very common in many customers, based on your calculation, we can spend very small size of index statistic to achieve at least 5 times speedup in HDFS scan stage in big tables with multiple columns and with a ~unique column. was (Author: junjie): Hi [~rdblue] In telecom example, query column is not unique if someone has two calls in short time such as 1 minutes. The number of row output depends on the how frequency the subscriber been recorded. In my test, the false positive was set to 0.05. About the difference of time speed up, I think is how many columns in table. In your calculation, the bloom filter is 10% of the data, but the proportion would be significant reduced in a 90+ columns table in telecom example (the time spend in read column maybe not significant reduced if parquet vectorization is enabled, but the size of bloom filter in a row group can be significantly shrink ). Multiple columns tables are very common in many customers, based on your calculation, we can spend very small size of index statistic to achieve at least 5 times speedup in HDFS scan stage in big tables with multiple columns. > Add bloom filters to parquet statistics > --------------------------------------- > > Key: PARQUET-41 > URL: https://issues.apache.org/jira/browse/PARQUET-41 > Project: Parquet > Issue Type: New Feature > Components: parquet-format, parquet-mr > Reporter: Alex Levenson > Assignee: Ferdinand Xu > Labels: filter2 > > For row groups with no dictionary, we could still produce a bloom filter. > This could be very useful in filtering entire row groups. > Pull request: > https://github.com/apache/parquet-mr/pull/215 -- This message was sent by Atlassian JIRA (v6.3.15#6346)