[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699223#comment-17699223
]
Gang Wu commented on PARQUET-2254:
----------------------------------
The optimization in the filter makes sense to me.
Back to the writing logic of bloom filter. I didn't get the entire idea but I
am interested in discussing it in the PR once you are ready.
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
>
> Now the usage is to specify the size, and then build BloomFilter. In general
> scenarios, it is actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> I have an idea that the user can specify a maximum BloomFilter filter size,
> then we build multiple BloomFilter at the same time, we can use the largest
> BloomFilter as a counting tool( If there is no hit when inserting a value,
> the counter will be +1, of course this may be imprecise but enough)
> Then at the end of the write, choose a BloomFilter of a more appropriate size
> when the file is finally written.
> I want to implement this feature and hope to get your opinions, thank you
--
This message was sent by Atlassian Jira
(v8.20.10#820010)