pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689717725
##########
cpp/src/parquet/properties.h:
##########
@@ -169,6 +169,37 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
SizeStatisticsLevel::PageAndColumnChunk;
+struct PARQUET_EXPORT BloomFilterOptions {
+ /// Expected number of distinct values (NDV) in the bloom filter.
+ ///
+ /// Bloom filters are most effective for high-cardinality columns. A good
default
+ /// is to set ndv equal to the number of rows. Lower values reduce disk
usage but
+ /// may not be worthwhile for very small NDVs.
+ ///
+ /// Increasing ndv (without increasing fpp) increases disk and memory usage.
+ int32_t ndv = 1 << 20;
+
+ /// False-positive probability (FPP) of the bloom filter.
+ ///
+ /// Lower FPP values require more disk and memory space. Recommended values
are
+ /// 0.1, 0.05, or 0.001. Very small values are counterproductive as the
bitset
+ /// may exceed the size of the actual data. Set ndv appropriately to minimize
+ /// space usage.
+ ///
+ /// Below is a table to demonstrate estimated size using common values.
+ ///
+ /// | ndv | fpp | bits/key | theoretical | actual (Po2) |
+ /// |:-----------|:------|:---------|:------------|:-------------|
+ /// | 100,000 | 0.10 | ~6.0 | 75 KB | **128 KB** |
+ /// | 100,000 | 0.01 | ~10.5 | 131 KB | **256 KB** |
Review Comment:
Hmm, that is unexpected. I'll open an issue on parquet-format. Regardless,
it seems this table should list the actual numbers as obtained through our own
`OptimalNumOfBits` function?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]