This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/master by this push:
new 2bdc9c1eece docs: add sizing explanation to bloom filter docs in
parquet (#5705)
2bdc9c1eece is described below
commit 2bdc9c1eece2f734c114d0e5c8f04677c1c69576
Author: Trevor Hilton <[email protected]>
AuthorDate: Sat May 4 05:00:45 2024 -0700
docs: add sizing explanation to bloom filter docs in parquet (#5705)
* docs: add sizing explanation to bloom filter docs in parquet
Added documentation detailing the sizing of bloom filters in the parquet
crate.
* docs: fix doc comment typo
* docs: clarify doc comment on bloom filters in metadata
Updated the bloom filter module doc comment to clarify that the metadata
stores the offset/length of the bloom filter, and not the bloom filter in its
entirety.
Co-authored-by: Andrew Lamb <[email protected]>
---------
Co-authored-by: Andrew Lamb <[email protected]>
---
parquet/src/bloom_filter/mod.rs | 56 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 55 insertions(+), 1 deletion(-)
diff --git a/parquet/src/bloom_filter/mod.rs b/parquet/src/bloom_filter/mod.rs
index d99c7251902..ebdb3999424 100644
--- a/parquet/src/bloom_filter/mod.rs
+++ b/parquet/src/bloom_filter/mod.rs
@@ -16,7 +16,61 @@
// under the License.
//! Bloom filter implementation specific to Parquet, as described
-//! in the
[spec](https://github.com/apache/parquet-format/blob/master/BloomFilter.md).
+//! in the [spec][parquet-bf-spec].
+//!
+//! # Bloom Filter Size
+//!
+//! Parquet uses the [Split Block Bloom Filter][sbbf-paper] (SBBF) as its
bloom filter
+//! implementation. For each column upon which bloom filters are enabled, the
offset and length of an SBBF
+//! is stored in the metadata for each row group in the parquet file. The
size of each filter is
+//! initialized using a calculation based on the desired number of distinct
values (NDV) and false
+//! positive probability (FPP). The FPP for a SBBF can be approximated
as<sup>[1][bf-formulae]</sup>:
+//!
+//! ```text
+//! f = (1 - e^(-k * n / m))^k
+//! ```
+//!
+//! Where, `f` is the FPP, `k` the number of hash functions, `n` the NDV, and
`m` the total number
+//! of bits in the bloom filter. This can be re-arranged to determine the
total number of bits
+//! required to achieve a given FPP and NDV:
+//!
+//! ```text
+//! m = -k * n / ln(1 - f^(1/k))
+//! ```
+//!
+//! SBBFs use eight hash functions to cleanly fit in SIMD
lanes<sup>[2][sbbf-paper]</sup>, therefore
+//! `k` is set to 8. The SBBF will spread those `m` bits accross a set of `b`
blocks that
+//! are each 256 bits, i.e., 32 bytes, in size. The number of blocks is chosen
as:
+//!
+//! ```text
+//! b = NP2(m/8) / 32
+//! ```
+//!
+//! Where, `NP2` denotes *the next power of two*, and `m` is divided by 8 to
be represented as bytes.
+//!
+//! Here is a table of calculated sizes for various FPP and NDV:
+//!
+//! | NDV | FPP | b | Size (KB) |
+//! |-----------|-----------|---------|-----------|
+//! | 10,000 | 0.1 | 256 | 8 |
+//! | 10,000 | 0.01 | 512 | 16 |
+//! | 10,000 | 0.001 | 1,024 | 32 |
+//! | 10,000 | 0.0001 | 1,024 | 32 |
+//! | 100,000 | 0.1 | 4,096 | 128 |
+//! | 100,000 | 0.01 | 4,096 | 128 |
+//! | 100,000 | 0.001 | 8,192 | 256 |
+//! | 100,000 | 0.0001 | 16,384 | 512 |
+//! | 100,000 | 0.00001 | 16,384 | 512 |
+//! | 1,000,000 | 0.1 | 32,768 | 1,024 |
+//! | 1,000,000 | 0.01 | 65,536 | 2,048 |
+//! | 1,000,000 | 0.001 | 65,536 | 2,048 |
+//! | 1,000,000 | 0.0001 | 131,072 | 4,096 |
+//! | 1,000,000 | 0.00001 | 131,072 | 4,096 |
+//! | 1,000,000 | 0.000001 | 262,144 | 8,192 |
+//!
+//! [parquet-bf-spec]:
https://github.com/apache/parquet-format/blob/master/BloomFilter.md
+//! [sbbf-paper]: https://arxiv.org/pdf/2101.01719
+//! [bf-formulae]: http://tfk.mit.edu/pdf/bloom.pdf
use crate::data_type::AsBytes;
use crate::errors::ParquetError;