wgtmac commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689410412
##########
cpp/src/parquet/properties.h:
##########
@@ -169,6 +169,25 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
SizeStatisticsLevel::PageAndColumnChunk;
+struct PARQUET_EXPORT BloomFilterOptions {
+ // Expected number of distinct values (NDV) in the bloom filter.
+ //
+ // Bloom filters are most effective for high-cardinality columns. A good
default
+ // is to set ndv equal to the number of rows. Lower values reduce disk usage
but
+ // may not be worthwhile for very small NDVs.
+ //
+ // Increasing ndv (without increasing fpp) increases disk and memory usage.
Review Comment:
I'm poor at math so I used gemini to generate the table for common settings
and also verified the result by writing a simple test to print the actual size
```
ndv=100000, fpp=0.1 -> bytes=131072 (128 KiB)
ndv=100000, fpp=0.01 -> bytes=131072 (128 KiB)
ndv=1000000, fpp=0.1 -> bytes=1048576 (1024 KiB)
ndv=1000000, fpp=0.01 -> bytes=2097152 (2048 KiB)
ndv=10000000, fpp=0.1 -> bytes=8388608 (8192 KiB)
ndv=10000000, fpp=0.01 -> bytes=16777216 (16384 KiB)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]