[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697463#comment-17697463
 ] 

Gang Wu commented on PARQUET-2254:
----------------------------------

Here are two questions: 1) creating bloom filters without explicit parameters, 
and 2) deciding which levels of filters to use for PPD. Both of them require 
additional input statistics like what is the data distribution of that column 
and what is the filter effectiveness in the past. Therefore I think parquet 
itself does not have to be that smart because it does not have those 
statistics. User would leverage those statistics from somewhere and config 
parquet writer to create bloom filters and decide which level of filters to use 
for PPD. WDYT? [~gszadovszky] [~miracle] 

> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to