Daniel Becker has uploaded a new patch set (#13). ( http://gerrit.cloudera.org:8080/17262 )
Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types ...................................................................... IMPALA-10642: Write support for Parquet Bloom filters - most common types This change adds support for writing Parquet Bloom filters for the types for which read support was added in IMPALA-10640. Writing of Parquet Bloom filters can be controlled by the 'parquet_bloom_filter_write' query option and the 'parquet.bloom.filter.columns' table property. The query option has the following possible values: NEVER - never write Parquet Bloom filters IF_NO_DICT - write Parquet Bloom filters if specified in the table properties AND if the row group is not fully dictionary encoded (the number of distinct values exceeds the maximum dictionary size) ALWAYS - always write Parquet Bloom filters if specified in the table properties, even if the row group is fully dictionary encoded The 'parquet.bloom.filter.columns' table property is a comma separated list of 'col_name:bytes' pairs. The 'bytes' part means the size of the bitset of the Bloom filter, and is optional. If the size is not given, it will be the maximal Bloom filter size (ParquetBloomFilter::MAX_BYTES). Example: "col1:1024,col2,col4:100'. Testing: - Added a test in tests/query_test/test_parquet_bloom_filter.py that uses Impala to write the same table as in the test file 'testdata/data/parquet-bloom-filtering.parquet' and checks whether the Parquet Bloom filter header and bitset are identical. - 'test_fallback_from_dict' tests falling back from dict encoding to plain and using Bloom filters. - 'test_fallback_from_dict_if_no_bloom_tbl_props' tests falling back from dict encoding to plain when Bloom filters are NOT enabled. Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792 --- M be/src/exec/hdfs-table-sink.cc M be/src/exec/hdfs-table-sink.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/hdfs-parquet-table-writer.h M be/src/exec/parquet/parquet-bloom-filter-util.cc M be/src/exec/parquet/parquet-bloom-filter-util.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M be/src/util/dict-encoding.h M be/src/util/parquet-bloom-filter-test.cc M be/src/util/parquet-bloom-filter.cc M be/src/util/parquet-bloom-filter.h M common/thrift/DataSinks.thrift M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java M tests/query_test/test_parquet_bloom_filter.py 20 files changed, 706 insertions(+), 30 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/62/17262/13 -- To view, visit http://gerrit.cloudera.org:8080/17262 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792 Gerrit-Change-Number: 17262 Gerrit-PatchSet: 13 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>