pitrou commented on code in PR #37400:
URL: https://github.com/apache/arrow/pull/37400#discussion_r2689680147


##########
cpp/src/parquet/properties.h:
##########
@@ -169,6 +169,37 @@ static constexpr bool DEFAULT_IS_PAGE_INDEX_ENABLED = true;
 static constexpr SizeStatisticsLevel DEFAULT_SIZE_STATISTICS_LEVEL =
     SizeStatisticsLevel::PageAndColumnChunk;
 
+struct PARQUET_EXPORT BloomFilterOptions {
+  /// Expected number of distinct values (NDV) in the bloom filter.
+  ///
+  /// Bloom filters are most effective for high-cardinality columns. A good 
default
+  /// is to set ndv equal to the number of rows. Lower values reduce disk 
usage but
+  /// may not be worthwhile for very small NDVs.
+  ///
+  /// Increasing ndv (without increasing fpp) increases disk and memory usage.
+  int32_t ndv = 1 << 20;
+
+  /// False-positive probability (FPP) of the bloom filter.
+  ///
+  /// Lower FPP values require more disk and memory space. Recommended values 
are

Review Comment:
   (it's not precisely proportional to that, but it seems to be a reasonable 
rule of thumb)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to