[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697510#comment-17697510
 ] 

Gabor Szadovszky commented on PARQUET-2254:
-------------------------------------------

1) I think, for creating bloom filters we have the statistics to decide how 
much space the bloom filter shall occupy (we have the actual data). What we 
don't know if the bloom filter in itself will be useful or not. (Whould there 
be filtering on the related column and would it be Eq/NotEq/IsIn etc. like 
predicates.) This one shall be decided by the client by the already introduced 
properties. We do not write bloom filters by default anyway.
2) Of course it is hard to be smart for PPD since we don't know the actual data 
(we are just before reading it). But there is an actual order of checking the 
row group filters: statistics, dictionary, bloom filter. Checking the 
statistics first is obviously correct. What I am not sure about is if we want 
to check dictionary first and then the bloom filter or the other way around. 
Because of that question I am also unsure if it is a good practice to not write 
bloom filters if the whole column chunk is dictionary encoded.

> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to