This is an automated email from the ASF dual-hosted git repository. blue pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push: new 3fb10e0 PARQUET-1630: Update Bloom filter format (#146) 3fb10e0 is described below commit 3fb10e00c2204bf1c6cc91e094c59e84cefcee33 Author: Chen, Junjie <jimmyjc...@tencent.com> AuthorDate: Tue Aug 27 07:27:32 2019 +0800 PARQUET-1630: Update Bloom filter format (#146) --- BloomFilter.md | 18 ++++++++++++++---- doc/images/FileLayoutBloomFilter1.png | Bin 0 -> 44025 bytes doc/images/FileLayoutBloomFilter2.png | Bin 0 -> 34018 bytes 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/BloomFilter.md b/BloomFilter.md index b8208c8..2fa24e9 100644 --- a/BloomFilter.md +++ b/BloomFilter.md @@ -264,10 +264,13 @@ false positive rates: | 41 | 0.001 % | #### File Format -The Bloom filter data of a column chunk, which contains the size of the filter in bytes, the -algorithm, the hash function and the Bloom filter bitset, is stored near the footer. The Bloom -filter data offset is stored in column chunk metadata. Here are Bloom filter definitions in -thrift: + +Each multi-block Bloom filter is required to work for only one column chunk. The data of a multi-block +bloom filter consists of the bloom filter header followed by the bloom filter bitset. The bloom filter +header encodes the size of the bloom filter bit set in bytes that is used to read the bitset. + +Here are the Bloom filter definitions in thrift: + ``` /** Block-based algorithm type annotation. **/ @@ -323,6 +326,13 @@ struct ColumnMetaData { ``` +The Bloom filters are grouped by row group and with data for each column in the same order as the file schema. +The Bloom filter data can be stored before the page indexes after all row groups. The file layout looks like: + ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter2.png) + +Or it can be stored between row groups, the file layout looks like: + ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter1.png) + #### Encryption In the case of columns with sensitive data, the Bloom filter exposes a subset of sensitive information such as the presence of value. Therefore the Bloom filter of columns with sensitive diff --git a/doc/images/FileLayoutBloomFilter1.png b/doc/images/FileLayoutBloomFilter1.png new file mode 100644 index 0000000..3b21738 Binary files /dev/null and b/doc/images/FileLayoutBloomFilter1.png differ diff --git a/doc/images/FileLayoutBloomFilter2.png b/doc/images/FileLayoutBloomFilter2.png new file mode 100755 index 0000000..6bbf770 Binary files /dev/null and b/doc/images/FileLayoutBloomFilter2.png differ