[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

Gabor Szadovszky (Jira) Tue, 07 Mar 2023 00:43:09 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697301#comment-17697301
 ]


Gabor Szadovszky commented on PARQUET-2254:
-------------------------------------------

I think this is a good idea. Meanwhile, it would increase the memory footprint 
of the writer. However, if you plan to keep the current logic that the user 
decides the columns which bloom filters are generated for, it should be 
acceptable.
However, I think, we need to take one step back and investigate/synchronize the 
efforts around row group filtering. Or maybe it is only me for whom the 
following questions are not obvious? :)
* Is it always true that reading the dictionary for filtering is cheaper than 
reading the bloom filter? Bloom filters should be usually smaller than 
dictionaries and faster to be scanned for a value.
* Based on the previous one if we decide that it might worth reading the bloom 
filter before the dictionary it also questions the logic of not writing bloom 
filters in case of the whole column chunk is dictionary encoded.
* Meanwhile, if the whole column chunk is dictionary encoded but the dictionary 
is still small (the redundancy is high) then it might not worth writing a bloom 
filter since checking the dictionary might be cheaper.
What do you think?

> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general 
> scenarios, it is actually not sure how much the distinct value is. 
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size, 
> then we build multiple BloomFilter at the same time, we can use the largest 
> BloomFilter as a counting tool( If there is no hit when inserting a value, 
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size 
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

Reply via email to