Daniel Becker has uploaded a new patch set (#8). ( 
http://gerrit.cloudera.org:8080/17262 )

Change subject: WIP - IMPALA-10642: Write support for Parquet Bloom filters - 
most common types
......................................................................

WIP - IMPALA-10642: Write support for Parquet Bloom filters - most common types

This change adds support for writing Parquet Bloom filters for the types
for which read support was added in IMPALA-10640.

Writing of Parquet Bloom filters can be controlled by the
'parquet_bloom_filter_write' query option which has the following
possible values:
  NEVER      - never write Parquet Bloom filters
  IF_NO_DICT - write Parquet Bloom filters if specified in the table
               properties AND if the row group is not fully
               dictionary encoded
  ALWAYS     - always write Parquet Bloom filters if specified in the
               table properties, even if the row group is fully
               dictionary encoded

Introduced the 'parquet.bloom.filter.columns' table property. It is a
comma separated pairs of 'col_name:bytes' pairs. The 'bytes' part means
the size of the bitset of the Bloom filter, and is optional. If the size
is not given, it will be the maximal Bloom filter size
(ParquetBloomFilter::MAX_BYTES).
Example: "col1:1024,col2,col4:100'.

Testing:
  - Added a test in tests/query_test/test_parquet_bloom_filter.py that
    uses Impala to write the same table as in the test file
    'testdata/data/parquet-bloom-filtering.parquet' and checks whether the
    Parquet Bloom filter header and bitset are identical.
  - TODO: Test falling back from dict encoding to plain and using Bloom
    filters.
  - Test table properties.

Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-bloom-filter-util.cc
M be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/dict-encoding.h
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
M be/src/util/parquet-bloom-filter.h
M common/thrift/DataSinks.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M tests/query_test/test_parquet_bloom_filter.py
20 files changed, 603 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/8
--
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>

Reply via email to