[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

ASF GitHub Bot (Jira) Tue, 09 May 2023 05:57:52 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17720917#comment-17720917
 ]


ASF GitHub Bot commented on PARQUET-2254:
-----------------------------------------

yabola commented on code in PR #1042:
URL: https://github.com/apache/parquet-mr/pull/1042#discussion_r1188555345


##########
parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/AdaptiveBlockSplitBloomFilter.java:
##########
@@ -38,57 +38,51 @@
 import org.apache.parquet.io.api.Binary;
 
 /**
- * `DynamicBlockBloomFilter` contains multiple `BlockSplitBloomFilter` as 
candidates and inserts values in
+ * `AdaptiveBlockSplitBloomFilter` contains multiple `BlockSplitBloomFilter` 
as candidates and inserts values in
  * the candidates at the same time.
  * The purpose of this is to finally generate a bloom filter with the optimal 
bit size according to the number
  * of real data distinct values. Use the largest bloom filter as an 
approximate deduplication counter, and then
  * remove incapable bloom filter candidate during data insertion.
  */
-public class DynamicBlockBloomFilter implements BloomFilter {
+public class AdaptiveBlockSplitBloomFilter implements BloomFilter {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(DynamicBlockBloomFilter.class);
+  private static final Logger LOG = 
LoggerFactory.getLogger(AdaptiveBlockSplitBloomFilter.class);
 
   // multiple candidates, inserting data at the same time. If the distinct 
values are greater than the
   // expected NDV of candidates, it will be removed. Finally we will choose 
the smallest candidate to write out.
   private final List<BloomFilterCandidate> candidates = new ArrayList<>();
 
   // the largest among candidates and used as an approximate deduplication 
counter
-  private BloomFilterCandidate maxCandidate;
+  private BloomFilterCandidate largestCandidate;
 
   // the accumulator of the number of distinct values that have been inserted 
so far
-  private int distinctValueCounter = 0;
+  private long distinctValueCounter = 0;
 
   // indicates that the bloom filter candidate has been written out and new 
data should be no longer allowed to be inserted
   private boolean finalized = false;
 
+  // indicates the step size to find the NDV value corresponding to numBytes
+  private static final int NDV_STEP = 500;
   private int maximumBytes = UPPER_BOUND_BYTES;
   private int minimumBytes = LOWER_BOUND_BYTES;
   // the hash strategy used in this bloom filter.
   private final HashStrategy hashStrategy;
   // the column to build bloom filter
   private ColumnDescriptor column;
 
-  public DynamicBlockBloomFilter(int numBytes, int candidatesNum, double fpp, 
ColumnDescriptor column) {
-    this(numBytes, LOWER_BOUND_BYTES, UPPER_BOUND_BYTES, HashStrategy.XXH64, 
fpp, candidatesNum, column);
-  }
-
-  public DynamicBlockBloomFilter(int numBytes, int maximumBytes, int 
candidatesNum, double fpp, ColumnDescriptor column) {
-    this(numBytes, LOWER_BOUND_BYTES, maximumBytes, HashStrategy.XXH64, fpp, 
candidatesNum, column);
+  /**
+   * Given the maximum acceptable bytes size of bloom filter, generate 
candidates according it.
+   *
+   * @param maximumBytes  the maximum bit size of candidate

Review Comment:
   It means byte size and I have changed variable name, thank you 





> Build a BloomFilter with a more precise size
> --------------------------------------------
>
>                 Key: PARQUET-2254
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2254
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Mars
>            Assignee: Mars
>            Priority: Major
>
> h3. Why are the changes needed?
> Now the usage of bloom filter is to specify the NDV(number of distinct 
> values), and then build BloomFilter. In general scenarios, it is actually not 
> sure how much the distinct value is.
> If BloomFilter can be automatically generated according to the data, the file 
> size can be reduced and the reading efficiency can also be improved.
> h3. What changes were proposed in this pull request?
> {{DynamicBlockBloomFilter}} contains multiple {{BlockSplitBloomFilter}} as 
> candidates and inserts values in the candidates at the same time. Use the 
> largest bloom filter as an approximate deduplication counter, and then remove 
> incapable bloom filter candidates during data insertion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2254) Build a BloomFilter with a more precise size

Reply via email to