[ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2254.
------------------------------
    Fix Version/s: 1.14.0
       Resolution: Fixed

> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>             Fix For: 1.14.0
>
>
> *Why are the changes needed?*
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values) or max bytes, and then build BloomFilter. In general scenarios, it is 
> actually not sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> *What changes were proposed in this pull request?*
> `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` as 
> candidates and inserts values in
>  the candidates at the same time. Finally we will choose the smallest 
> candidate to write out.
> *Does this PR introduce any user-facing change?*
> add new configuration:
> `parquet.bloom.filter.adaptive.enabled` : default false, Whether to enable 
> writing adaptive bloom filter.  
> If it is true, the bloom filter will be generated with the optimal bit size 
> according to the number of real data distinct values. If it is false, it will 
> not take effect.
> Note that the maximum bytes of the bloom filter will not exceed 
> `parquet.bloom.filter.max.bytes` configuration (if it is 
> set too small, the generated bloom filter will not be efficient).
> `parquet.bloom.filter.candidates.number`: default 5, the number of candidate 
> bloom filters written at the same time.  
> When `parquet.bloom.filter.adaptive.enabled` is true, multiple candidate 
> bloom filters will be inserted 
> at the same time, finally a bloom filter with the optimal bit size will be 
> selected and written to the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to