[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gang Wu resolved PARQUET-2254.
------------------------------
Fix Version/s: 1.14.0
Resolution: Fixed
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
> Fix For: 1.14.0
>
>
> *Why are the changes needed?*
> Now the usage of bloom filter is to specify the NDV(number of distinct
> values) or max bytes, and then build BloomFilter. In general scenarios, it is
> actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> *What changes were proposed in this pull request?*
> `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as
> candidates and inserts values in
> the candidates at the same time. Finally we will choose the smallest
> candidate to write out.
> *Does this PR introduce any user-facing change?*
> add new configuration:
> `parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable
> writing adaptive bloom filter.
> If it is true, the bloom filter will be generated with the optimal bit size
> according to the number of real data distinct values. If it is false, it will
> not take effect.
> Note that the maximum bytes of the bloom filter will not exceed
> `parquet.bloom.filter.max.bytes` configuration (if it is
> set too small, the generated bloom filter will not be efficient).
> `parquet.bloom.filter.candidates.number`: default 5, the number of candidate
> bloom filters written at the same time.
> When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate
> bloom filters will be inserted
> at the same time, finally a bloom filter with the optimal bit size will be
> selected and written to the file.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)