pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2690191875
##########
cpp/src/parquet/properties.h:
##########
@@ -169,6 +169,38 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
SizeStatisticsLevel::PageAndColumnChunk;
+struct PARQUET_EXPORT BloomFilterOptions {
+ /// Expected number of distinct values (NDV) in the bloom filter.
+ ///
+ /// Bloom filters are most effective for high-cardinality columns. A good
default
+ /// is to set ndv equal to the number of rows. Lower values reduce disk
usage but
+ /// may not be worthwhile for very small NDVs.
+ ///
+ /// Increasing ndv (without increasing fpp) increases disk and memory usage.
+ int32_t ndv = 1 << 20;
+
+ /// False-positive probability (FPP) of the bloom filter.
+ ///
+ /// Lower FPP values require more disk and memory space. For a fixed ndv, the
+ /// space requirement grows roughly proportional to log(1/fpp). Recommended
+ /// values are 0.1, 0.05, or 0.01. Very small values are counterproductive as
+ /// the bitset may exceed the size of the actual data. Set ndv appropriately
+ /// to minimize space usage.
+ ///
+ /// Below is a table to demonstrate estimated size using common values.
+ ///
+ /// | ndv | fpp | bits/key | theoretical | actual |
+ /// |:-----------|:------|:---------|:------------|:-------|
+ /// | 100,000 | 0.10 | ~6.0 | 75 KB | 128 KB |
Review Comment:
Do you want to update the table numbers to match those given by our
`OptimalNumOfBits` method? It's what matters here after all :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]