This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git


The following commit(s) were added to refs/heads/master by this push:
     new 2bdc9c1eece docs: add sizing explanation to bloom filter docs in 
parquet (#5705)
2bdc9c1eece is described below

commit 2bdc9c1eece2f734c114d0e5c8f04677c1c69576
Author: Trevor Hilton <[email protected]>
AuthorDate: Sat May 4 05:00:45 2024 -0700

    docs: add sizing explanation to bloom filter docs in parquet (#5705)
    
    * docs: add sizing explanation to bloom filter docs in parquet
    
    Added documentation detailing the sizing of bloom filters in the parquet
    crate.
    
    * docs: fix doc comment typo
    
    * docs: clarify doc comment on bloom filters in metadata
    
    Updated the bloom filter module doc comment to clarify that the metadata 
stores the offset/length of the bloom filter, and not the bloom filter in its 
entirety.
    
    Co-authored-by: Andrew Lamb <[email protected]>
    
    ---------
    
    Co-authored-by: Andrew Lamb <[email protected]>
---
 parquet/src/bloom_filter/mod.rs | 56 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/parquet/src/bloom_filter/mod.rs b/parquet/src/bloom_filter/mod.rs
index d99c7251902..ebdb3999424 100644
--- a/parquet/src/bloom_filter/mod.rs
+++ b/parquet/src/bloom_filter/mod.rs
@@ -16,7 +16,61 @@
 // under the License.
 
 //! Bloom filter implementation specific to Parquet, as described
-//! in the 
[spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md).
+//! in the [spec][parquet-bf-spec].
+//!
+//! # Bloom Filter Size
+//!
+//! Parquet uses the [Split Block Bloom Filter][sbbf-paper] (SBBF) as its 
bloom filter
+//! implementation. For each column upon which bloom filters are enabled, the 
offset and length of an SBBF
+//! is stored in  the metadata for each row group in the parquet file. The 
size of each filter is
+//! initialized using a calculation based on the desired number of distinct 
values (NDV) and false
+//! positive probability (FPP). The FPP for a SBBF can be approximated 
as<sup>[1][bf-formulae]</sup>:
+//!
+//! ```text
+//! f = (1 - e^(-k * n / m))^k
+//! ```
+//!
+//! Where, `f` is the FPP, `k` the number of hash functions, `n` the NDV, and 
`m` the total number
+//! of bits in the bloom filter. This can be re-arranged to determine the 
total number of bits
+//! required to achieve a given FPP and NDV:
+//!
+//! ```text
+//! m = -k * n / ln(1 - f^(1/k))
+//! ```
+//!
+//! SBBFs use eight hash functions to cleanly fit in SIMD 
lanes<sup>[2][sbbf-paper]</sup>, therefore
+//! `k` is set to 8. The SBBF will spread those `m` bits accross a set of `b` 
blocks that
+//! are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen 
as:
+//!
+//! ```text
+//! b = NP2(m/8) / 32
+//! ```
+//!
+//! Where, `NP2` denotes *the next power of two*, and `m` is divided by 8 to 
be represented as bytes.
+//!
+//! Here is a table of calculated sizes for various FPP and NDV:
+//!
+//! | NDV       | FPP       | b       | Size (KB) |
+//! |-----------|-----------|---------|-----------|
+//! | 10,000    | 0.1       | 256     | 8         |
+//! | 10,000    | 0.01      | 512     | 16        |
+//! | 10,000    | 0.001     | 1,024   | 32        |
+//! | 10,000    | 0.0001    | 1,024   | 32        |
+//! | 100,000   | 0.1       | 4,096   | 128       |
+//! | 100,000   | 0.01      | 4,096   | 128       |
+//! | 100,000   | 0.001     | 8,192   | 256       |
+//! | 100,000   | 0.0001    | 16,384  | 512       |
+//! | 100,000   | 0.00001   | 16,384  | 512       |
+//! | 1,000,000 | 0.1       | 32,768  | 1,024     |
+//! | 1,000,000 | 0.01      | 65,536  | 2,048     |
+//! | 1,000,000 | 0.001     | 65,536  | 2,048     |
+//! | 1,000,000 | 0.0001    | 131,072 | 4,096     |
+//! | 1,000,000 | 0.00001   | 131,072 | 4,096     |
+//! | 1,000,000 | 0.000001  | 262,144 | 8,192     |
+//!
+//! [parquet-bf-spec]: 
https://github.com/apache/parquet-format/blob/master/BloomFilter.md
+//! [sbbf-paper]: https://arxiv.org/pdf/2101.01719
+//! [bf-formulae]: http://tfk.mit.edu/pdf/bloom.pdf
 
 use crate::data_type::AsBytes;
 use crate::errors::ParquetError;

Reply via email to