[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636560#comment-16636560
 ] 

ASF GitHub Bot commented on PARQUET-41:
---------------------------------------

cjjnjust closed pull request #99: PARQUET-41: add bloom filter
URL: https://github.com/apache/parquet-format/pull/99
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 3b15cfef..66e676a0 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -473,6 +473,7 @@ enum PageType {
   INDEX_PAGE = 1;
   DICTIONARY_PAGE = 2;
   DATA_PAGE_V2 = 3;
+  BLOOM_FILTER_PAGE = 4;
 }
 
 /**
@@ -552,6 +553,44 @@ struct DataPageHeaderV2 {
   8: optional Statistics statistics;
 }
 
+/** Block-based algorithm type annotation. **/
+struct SplitBlockAlgorithm {}
+
+/** The algorithm used in Bloom filter. **/
+union BloomFilterAlgorithm {
+  /** Block-based Bloom filter. **/
+  1: SplitBlockAlgorithm BLOCK;
+}
+
+/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the 
original SMHasher
+ * repo by Austin Appleby.
+ **/
+struct Murmur3 {}
+
+/** 
+ * The hash function used in Bloom filter. This function takes the hash of a 
column value
+ * using plain encoding.
+ **/
+union BloomFilterHash {
+  /** Murmur3 Hash Strategy. **/
+  1: Murmur3 MURMUR3;
+}
+
+/**
+  * Bloom filter header is stored at beginning of Bloom filter data of each 
column
+  * and followed by its bitset.
+  **/
+struct BloomFilterPageHeader {
+  /** The size of bitset in bytes **/
+  1: required i32 numBytes;
+
+  /** The algorithm for setting bits. **/
+  2: required BloomFilterAlgorithm algorithm;
+
+  /** The hash function used for Bloom filter. **/
+  3: required BloomFilterHash hash;
+}
+
 struct PageHeader {
   /** the type of the page: indicates which of the *_header fields is set **/
   1: required PageType type
@@ -572,6 +611,7 @@ struct PageHeader {
   6: optional IndexPageHeader index_page_header;
   7: optional DictionaryPageHeader dictionary_page_header;
   8: optional DataPageHeaderV2 data_page_header_v2;
+  9: optional BloomFilterPageHeader bloom_filter_page_header;
 }
 
 /**
@@ -658,6 +698,9 @@ struct ColumnMetaData {
    * This information can be used to determine if all data pages are
    * dictionary encoded for example **/
   13: optional list<PageEncodingStats> encoding_stats;
+
+  /** Byte offset from beginning of file to Bloom filter data. **/
+  14: optional i64 bloom_filter_offset;
 }
 
 struct ColumnChunk {


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Junjie Chen
>            Priority: Major
>              Labels: filter2, pull-request-available
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to