Gabor Kaszab created IMPALA-12861: ------------------------------------- Summary: File formats are confused when Iceberg tables has mixed formats Key: IMPALA-12861 URL: https://issues.apache.org/jira/browse/IMPALA-12861 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 4.3.0 Reporter: Gabor Kaszab
*Repro steps:* create table mixed_ice (i int, year int) partitioned by spec (year) stored as iceberg tblproperties('format-version'='2'); 1) populate one partition with Impala (parquet) insert into mixed_ice values (1, 2024), (2, 2024); 2) change the write format: alter table mixed_ice set tblproperties ('write.format.default'='orc'); 3) populate another partition with Hive (orc) insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025); 4) then query just the parquet partition: explain select * from mixed_ice where year = 2024; {code:java} | F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1 | | PLAN-ROOT SINK | | | output exprs: default.mixed_ice.i, default.mixed_ice.year | | | mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0 | | | | | 01:EXCHANGE [UNPARTITIONED] | | mem-estimate=16.00KB mem-reservation=0B thread-reservation=0 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | | | | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 | | Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB thread-reservation=2 | | DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED] | | | mem-estimate=48.00KB mem-reservation=0B thread-reservation=0 | | 00:SCAN HDFS [default.mixed_ice, RANDOM] | | HDFS partitions=1/1 files=1 size=602B | | Iceberg snapshot id: 4964066258730898133 | | skipped Iceberg predicates: `year` = CAST(2024 AS INT) | | stored statistics: | | table: rows=5 size=945B | | columns: unavailable | | extrapolated-rows=disabled max-scan-range-rows=5 | | file formats: [ORC, PARQUET] | | mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1 | | tuple-ids=0 row-size=8B cardinality=2 | | in pipelines: 00(GETNEXT) | +------------------------------------------------------------------------------------------+ {code} Note, the file formats: [ORC, PARQUET] part even though this query only reads a parquet files. *Some analyis:* When IcebergScanNode [is created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129] it holds the correct information about file formats (Parquet). Later on the parent class, HdfsScanNode also tries to populate the file formats [here|[https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L513].] It uses what [getSampledOrRawPartition()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431] returns. In this use case the 'sampledPartitions_' is null, so will return 'partitions_' Apparently, this 'partitions_' member holds the partition with the ORC file so it adds ORC to the fileFormats_. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org