Lars Volker has uploaded a new patch set (#7). Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types ......................................................................
IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types This change adds functionality to write and read parquet::Statistics for Decimal, String, and Timestamp values. As an exception, we don't read statistics for CHAR columns, since CHAR support is broken in Impala (IMPALA-1652). This change also switches from using the deprecated fields 'min' and 'max' to populate the new fields 'min_value' and 'max_value' in parquet::Statistics, that were added in parquet-format pull request #46. The HdfsParquetScanner will preferably read the new fields if they are populated and if the column order 'TypeDefinedOrder' has been used to compute the statistics. For columns without a column order set or with only the deprecated fields populated, the scanner will read them only if they are of simple numeric type, i.e. boolean, integer, or floating point. This change removes the validation of the Parquet Statistics we write to Hive from the tests, since Hive does not write the new fields. Instead it adds a parquet file written by Hive that uses the deprecated fields for its statistics. It uses that file to exercise the fallback logic for supported types in a test. This change also cleans up the interface of ParquetPlainEncoder in parquet-common.h. Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/parquet-column-stats.cc M be/src/exec/parquet-column-stats.h M be/src/exec/parquet-column-stats.inline.h M be/src/exec/parquet-common.h M be/src/exec/parquet-metadata-utils.cc M be/src/exec/parquet-metadata-utils.h M be/src/exec/parquet-plain-test.cc M be/src/util/dict-encoding.h M common/thrift/parquet.thrift M testdata/data/README A testdata/data/deprecated_statistics.parquet A testdata/workloads/functional-query/queries/QueryTest/parquet-deprecated-stats.test M testdata/workloads/functional-query/queries/QueryTest/parquet_stats.test M tests/query_test/test_insert_parquet.py M tests/query_test/test_parquet_stats.py 18 files changed, 884 insertions(+), 259 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/6563/7 -- To view, visit http://gerrit.cloudera.org:8080/6563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>