Hello Tamas Mate, Gergely Fürnstáhl, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18240

to look at the new patch set (#2).

Change subject: IMPALA-11053: Impala should be able to read migrated 
partitioned Iceberg tables
......................................................................

IMPALA-11053: Impala should be able to read migrated partitioned Iceberg tables

When Hive (and probably other engines as well) converts a legacy Hive
table to Iceberg it doesn't rewrite the data files. It means that the
data files don't have write ids neither partition column data. Currently
Impala expects the partition columns to be present in the data files,
so it is not be able to read converted partitioned tables.

With this patch Impala loads partition values from the Iceberg metadata.
The extra metadata information is attached to the file descriptor
objects and propageted to the scanners. This metadata contains the
Iceberg data file format (later it could be used to handle mixed-format
tables), and partition data.

We use the partition data in the HdfsScanner to create the template
tuple that contains the partition values of identity-partitioned
columns. This is not only true to migrated tables, but all Iceberg
tables with identity partitions, which means we also save some IO
and CPU time for such columns. The partition information could also
be used for Dynamic Partition Pruning later.

We use the (human-readable) string representation of the partition data
when storing them in the flat buffers. This helps debugging, also
it provides the needed flexibility when the partition columns
evolve (e.g. INT -> BIGINT, DECIMAL(4,2) -> DECIMAL(6,2)).

Testing
 * e2e test for all data types that can be used to partition a table
 * e2e test for migrated partitioned table + schema evolution (without
   renaming columns)
 * e2e for table where all column is used as identity-partitions

Change-Id: Iac11a02de709d43532056f71359c49d20c1be2b8
---
M be/src/exec/CMakeLists.txt
A be/src/exec/file-metadata-utils.cc
A be/src/exec/file-metadata-utils.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/runtime/dml-exec-state.cc
M be/src/scheduling/scheduler.cc
M common/fbs/CatalogObjects.fbs
M common/fbs/IcebergObjects.fbs
M common/protobuf/planner.proto
M common/thrift/CatalogObjects.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/FeIcebergTable.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/util/IcebergUtil.java
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part/metadata/283c54cb-5a45-4a2c-bca8-4bfa0e61cdbd-m0.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part/metadata/snap-6167994413873848621-1-283c54cb-5a45-4a2c-bca8-4bfa0e61cdbd.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part/metadata/v1.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part/metadata/v2.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part/p_bool=true/p_int=1/p_bigint=11/p_float=1.1/p_double=2.222/p_decimal=123.321/p_date=2022-02-22/p_string=impala/000000_0
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part_orc/metadata/db72fbf2-f9f6-4985-8a5f-fd9f632f2c77-m0.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part_orc/metadata/snap-7569365419257304230-1-db72fbf2-f9f6-4985-8a5f-fd9f632f2c77.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part_orc/metadata/v1.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part_orc/metadata/v2.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part_orc/metadata/version-hint.text
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part_orc/p_bool=true/p_int=1/p_bigint=11/p_float=1.1/p_double=2.222/p_decimal=123.321/p_date=2022-02-22/p_string=impala/000000_0
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution/metadata/2d05a7d4-c229-44c3-860e-e77e46e71a19-m0.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution/metadata/snap-6654673546382518186-1-2d05a7d4-c229-44c3-860e-e77e46e71a19.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution/metadata/v1.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution/metadata/v2.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution/metadata/version-hint.text
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution/p_int_long=1/p_float_double=1.1/p_dec_dec=2.718/000000_0
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution/p_int_long=1/p_float_double=1.1/p_dec_dec=3.141/000000_0
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc/metadata/8db62f0e-38e5-434b-94dc-c84210302ad8-m0.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc/metadata/snap-888589552112488046-1-8db62f0e-38e5-434b-94dc-c84210302ad8.avro
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc/metadata/v1.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc/metadata/v2.metadata.json
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc/metadata/version-hint.text
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc/p_int_long=1/p_float_double=1.1/p_dec_dec=2.718/000000_0
A 
testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc/p_int_long=1/p_float_double=1.1/p_dec_dec=3.141/000000_0
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A 
testdata/workloads/functional-query/queries/QueryTest/iceberg-migrated-tables.test
M tests/query_test/test_iceberg.py
52 files changed, 1,751 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/18240/2
--
To view, visit http://gerrit.cloudera.org:8080/18240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iac11a02de709d43532056f71359c49d20c1be2b8
Gerrit-Change-Number: 18240
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>

Reply via email to