hiltontj commented on code in PR #5705:
URL: https://github.com/apache/arrow-rs/pull/5705#discussion_r1587027649


##########
parquet/src/bloom_filter/mod.rs:
##########
@@ -16,7 +16,61 @@
 // under the License.
 
 //! Bloom filter implementation specific to Parquet, as described
-//! in the 
[spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md).
+//! in the [spec][parquet-bf-spec].
+//!
+//! # Bloom Filter Size
+//!
+//! Parquet uses the [Split Block Bloom Filter][sbbf-paper] (SBBF) as its 
bloom filter
+//! implementation. For each column upon which bloom filters are enabled, 
there will be an SBBF
+//! stored in the metadata for each row group in the parquet file. The size of 
each filter is
+//! initialized using a calculation based on the desired number of distinct 
values (NDV) and false
+//! positive probability (FPP). The FPP for a SBBF can be approximated 
as<sup>[1][bf-formulae]</sup>:
+//!
+//! ```text
+//! f = (1 - e^(-k * n / m))^k
+//! ```
+//!
+//! Where, `f` is the FPP, `k` the number of hash functions, `n` the NDV, and 
`m` the total number
+//! of bits in the bloom filter. This can be re-arranged to determine the 
total number of bits
+//! required to achieve a given FPP and NDV:
+//!
+//! ```text
+//! m = -k * n / ln(1 - f^(1/k))
+//! ```
+//!
+//! SBBFs use eight hash functions to cleanly fit in SIMD 
lanes<sup>[2][sbbf-paper]</sup>, therefore
+//! `k` is set to 8. The SBBF will spread those `m` bits accross a set of `b` 
blocks that
+//! are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen 
as:
+//!
+//! ```text
+//! b = NP2(m/8) / 32
+//! ```
+//!
+//! Where, `NP2` denotes *the next power of two*, and `m` is divided by 8 to 
be represented as bytes.

Review Comment:
   There may be a more idiomatic way of expressing *the next power of two* as a 
function than `NP2`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to