[
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720917#comment-17720917
]
ASF GitHub Bot commented on PARQUET-2254:
-----------------------------------------
yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188555345
##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##########
@@ -38,57 +38,51 @@
import org.apache.parquet.io.api.Binary;
/**
- * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as
candidates and inserts values in
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter`
as candidates and inserts values in
* the candidates at the same time.
* The purpose of this is to finally generate a bloom filter with the optimal
bit size according to the number
* of real data distinct values. Use the largest bloom filter as an
approximate deduplication counter, and then
* remove incapable bloom filter candidate during data insertion.
*/
-public class DynamicBlockBloomFilter implements BloomFilter {
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
- private static final Logger LOG =
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+ private static final Logger LOG =
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
// multiple candidates, inserting data at the same time. If the distinct
values are greater than the
// expected NDV of candidates, it will be removed. Finally we will choose
the smallest candidate to write out.
private final List<BloomFilterCandidate> candidates = new ArrayList<>();
// the largest among candidates and used as an approximate deduplication
counter
- private BloomFilterCandidate maxCandidate;
+ private BloomFilterCandidate largestCandidate;
// the accumulator of the number of distinct values that have been inserted
so far
- private int distinctValueCounter = 0;
+ private long distinctValueCounter = 0;
// indicates that the bloom filter candidate has been written out and new
data should be no longer allowed to be inserted
private boolean finalized = false;
+ // indicates the step size to find the NDV value corresponding to numBytes
+ private static final int NDV_STEP = 500;
private int maximumBytes = UPPER_BOUND_BYTES;
private int minimumBytes = LOWER_BOUND_BYTES;
// the hash strategy used in this bloom filter.
private final HashStrategy hashStrategy;
// the column to build bloom filter
private ColumnDescriptor column;
- public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp,
ColumnDescriptor column) {
- this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64,
fpp, candidatesNum, column);
- }
-
- public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int
candidatesNum, double fpp, ColumnDescriptor column) {
- this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp,
candidatesNum, column);
+ /**
+ * Given the maximum acceptable bytes size of bloom filter, generate
candidates according it.
+ *
+ * @param maximumBytes the maximum bit size of candidate
Review Comment:
It means byte size and I have changed variable name, thank you
> Build a BloomFilter with a more precise size
> --------------------------------------------
>
> Key: PARQUET-2254
> URL: https://issues.apache.org/jira/browse/PARQUET-2254
> Project: Parquet
> Issue Type: Improvement
> Reporter: Mars
> Assignee: Mars
> Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct
> values), and then build BloomFilter. In general scenarios, it is actually not
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as
> candidates and inserts values in the candidates at the same time. Use the
> largest bloom filter as an approximate deduplication counter, and then remove
> incapable bloom filter candidates during data insertion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)