[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717527#comment-17717527
]
ASF GitHub Bot commented on PARQUET-2254:
-----------------------------------------
yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1180099251
##########
parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java:
##########
@@ -503,6 +523,30 @@ public Builder withBloomFilterEnabled(boolean enabled) {
return this;
}
+ /**
+ * Whether to use dynamic bloom filter to automatically adjust the bloom
filter size according to
+ * `parquet.bloom.filter.max.bytes`.
+ * If NDV (number of distinct values) for a specified column is set, it
will be ignored
+ *
+ * @param enabled whether to use dynamic bloom filter
+ */
+ public Builder withDynamicBloomFilterEnabled(boolean enabled) {
+ this.dynamicBloomFilterEnabled.withDefaultValue(enabled);
+ return this;
+ }
+
+ /**
+ * When `DynamicBloomFilter` is enabled, set how many bloomFilters to
split as candidates.
Review Comment:
done
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct
> values), and then build BloomFilter. In general scenarios, it is actually not
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as
> candidates and inserts values in the candidates at the same time. Use the
> largest bloom filter as an approximate deduplication counter, and then remove
> incapable bloom filter candidates during data insertion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)