[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697301#comment-17697301
]
Gabor Szadovszky commented on PARQUET-2254:
-------------------------------------------
I think this is a good idea. Meanwhile, it would increase the memory footprint
of the writer. However, if you plan to keep the current logic that the user
decides the columns which bloom filters are generated for, it should be
acceptable.
However, I think, we need to take one step back and investigate/synchronize the
efforts around row group filtering. Or maybe it is only me for whom the
following questions are not obvious? :)
* Is it always true that reading the dictionary for filtering is cheaper than
reading the bloom filter? Bloom filters should be usually smaller than
dictionaries and faster to be scanned for a value.
* Based on the previous one if we decide that it might worth reading the bloom
filter before the dictionary it also questions the logic of not writing bloom
filters in case of the whole column chunk is dictionary encoded.
* Meanwhile, if the whole column chunk is dictionary encoded but the dictionary
is still small (the redundancy is high) then it might not worth writing a bloom
filter since checking the dictionary might be cheaper.
What do you think?
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general
> scenarios, it is actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size,
> then we build multiple BloomFilter at the same time, we can use the largest
> BloomFilter as a counting tool( If there is no hit when inserting a value,
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you
--
This message was sent by Atlassian Jira
(v8.20.10#820010)