This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 3fb10e0  PARQUET-1630: Update Bloom filter format (#146)
3fb10e0 is described below

commit 3fb10e00c2204bf1c6cc91e094c59e84cefcee33
Author: Chen, Junjie <jimmyjc...@tencent.com>
AuthorDate: Tue Aug 27 07:27:32 2019 +0800

    PARQUET-1630: Update Bloom filter format (#146)
---
 BloomFilter.md                        |  18 ++++++++++++++----
 doc/images/FileLayoutBloomFilter1.png | Bin 0 -> 44025 bytes
 doc/images/FileLayoutBloomFilter2.png | Bin 0 -> 34018 bytes
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/BloomFilter.md b/BloomFilter.md
index b8208c8..2fa24e9 100644
--- a/BloomFilter.md
+++ b/BloomFilter.md
@@ -264,10 +264,13 @@ false positive rates:
 |                       41   |  0.001 %                   |
 
 #### File Format
-The Bloom filter data of a column chunk, which contains the size of the filter 
in bytes, the
-algorithm, the hash function and the Bloom filter bitset, is stored near the 
footer. The Bloom
-filter data offset is stored in column chunk metadata. Here are Bloom filter 
definitions in
-thrift:
+
+Each multi-block Bloom filter is required to work for only one column chunk. 
The data of a multi-block
+bloom filter consists of the bloom filter header followed by the bloom filter 
bitset. The bloom filter
+header encodes the size of the bloom filter bit set in bytes that is used to 
read the bitset.
+
+Here are the Bloom filter definitions in thrift:
+
 
 ```
 /** Block-based algorithm type annotation. **/
@@ -323,6 +326,13 @@ struct ColumnMetaData {
 
 ```
 
+The Bloom filters are grouped by row group and with data for each column in 
the same order as the file schema.
+The Bloom filter data can be stored before the page indexes after all row 
groups. The file layout looks like:
+ ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter2.png)
+
+Or it can be stored between row groups, the file layout looks like:
+ ![File Layout - Bloom filter footer](doc/images/FileLayoutBloomFilter1.png)
+
 #### Encryption
 In the case of columns with sensitive data, the Bloom filter exposes a subset 
of sensitive
 information such as the presence of value. Therefore the Bloom filter of 
columns with sensitive
diff --git a/doc/images/FileLayoutBloomFilter1.png 
b/doc/images/FileLayoutBloomFilter1.png
new file mode 100644
index 0000000..3b21738
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter1.png differ
diff --git a/doc/images/FileLayoutBloomFilter2.png 
b/doc/images/FileLayoutBloomFilter2.png
new file mode 100755
index 0000000..6bbf770
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter2.png differ

Reply via email to