Csaba Ringhofer has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/19793 )
Change subject: IMPALA-6433: Add read support for PageHeaderV2 ...................................................................... IMPALA-6433: Add read support for PageHeaderV2 Parquet v2 means several changes in Parquet files compared to v1: 1. file version = 2 instead of 1 https://github.com/apache/parquet-format/blob/c185faf0c4fc0c7d3075d1abd4ed0985cdbe5d87/src/main/thrift/parquet.thrift#L1016 Before this patch Impala rejected Parquet files with version!=1. 2. possible use of DataPageHeaderV2 instead DataPageHeader https://github.com/apache/parquet-format/blob/c185faf0c4fc0c7d3075d1abd4ed0985cdbe5d87/src/main/thrift/parquet.thrift#L561 The main differences compared to V1 DataPageHeader: a. rep/def levels are not compressed, so the compressed part contains only the actual encoded values b. rep/def levels must be RLE encoded (Impala only supports RLE encoded levels even for V1 pages) c. compression can be turned on/off per page (member is_compressed) d. number of nulls (member num_nulls) is required - in v1 it was included in statistics which is optional e. number of rows is required (member num_rows) which can help with matching collection items with the top level collection Most of the BE changes deal with implementing a. Some of the rep/def level logic is moved from ParquetColumnReader to ParquetColumnChunkReader to allow handling all v1/v2 differences in ParquetColumnChunkReader. The patch adds support for understanding v2 data pages but does not implement some potential optimizations: a. would allow an optimization for queries that need only the nullness of a column but not the actual value: as the values are not needed the decompression of the page data can be skipped. This optimization is not implemented - currently Impala materializes both the null bit and the value for all columns regardless of whether the value is actually needed. d. could be also used for optimizations / addittional validity checks but it is not used currently e. could make skipping rows easier but is not used, as the existing scanner has to be able to skip rows efficiently also in v1 files so it can't rely on num_rows 3. possible use of new encodings (e.g. DELTA_BINARY_PACKED) No new encoding is added - when an unsupported encoding is encountered Impala returns an error. parquet-mr uses new encodings (DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY) for most types if the file version is 2, so with this patch Impala is not yet able to read all v2 Parquet tables written by Hive. 4. Encoding PLAIN_DICTIONARY is deprecated and RLE_DICTIONARY is used instead. The semantics of the two encodingsare exactly the same. Addittional changes: Some responsibilites are moved from ParquetColumnReader to ParquetColumnChunkReader: - ParquetColumnChunkReader decodes rep/def level sizes to hide v1/v2 differences (see 2.a.) - ParquetColumnChunkReader skips empty data pages in ReadNextDataPageHeader - the state machine of ParquetColumnChunkReader is simplified by separating data page header reading / reading rest of the page Testing: - added 2 v2 Parquet test tables (written by Hive) - added EE and fuzz tests for the test tables above - manual tested v2 Parquet files written by pyarrow - ran core tests TODO: no test for is_compressed is added yet as generating files where some pages are compressed while some are not is tricky with existing writers. Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca --- M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-column-chunk-reader.cc M be/src/exec/parquet/parquet-column-chunk-reader.h M be/src/exec/parquet/parquet-column-readers.cc M be/src/exec/parquet/parquet-column-readers.h M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-level-decoder.cc M be/src/exec/parquet/parquet-level-decoder.h M be/src/exec/parquet/parquet-metadata-utils.cc M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test M tests/query_test/test_scanners.py M tests/query_test/test_scanners_fuzz.py 14 files changed, 457 insertions(+), 178 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/19793/3 -- To view, visit http://gerrit.cloudera.org:8080/19793 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca Gerrit-Change-Number: 19793 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer <[email protected]>
