Gabor Kaszab created IMPALA-12861:
-------------------------------------

             Summary: File formats are confused when Iceberg tables has mixed 
formats
                 Key: IMPALA-12861
                 URL: https://issues.apache.org/jira/browse/IMPALA-12861
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 4.3.0
            Reporter: Gabor Kaszab


*Repro steps:*
create table mixed_ice (i int, year int) partitioned by spec (year) stored as 
iceberg tblproperties('format-version'='2');
 
1) populate one partition with Impala (parquet)
insert into mixed_ice values (1, 2024), (2, 2024);
 
2) change the write format:
alter table mixed_ice set tblproperties ('write.format.default'='orc');
 
3) populate another partition with Hive (orc)
insert into mixed_ice values (1, 2025), (2, 2025), (3, 2025);
 
4) then query just the parquet partition:
explain select * from mixed_ice where year = 2024;
{code:java}
| F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         
           |
| Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB 
thread-reservation=1      |
|   PLAN-ROOT SINK                                                              
           |
|   |  output exprs: default.mixed_ice.i, default.mixed_ice.year                
           |
|   |  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB 
thread-reservation=0 |
|   |                                                                           
           |
|   01:EXCHANGE [UNPARTITIONED]                                                 
           |
|      mem-estimate=16.00KB mem-reservation=0B thread-reservation=0             
           |
|      tuple-ids=0 row-size=8B cardinality=2                                    
           |
|      in pipelines: 00(GETNEXT)                                                
           |
|                                                                               
           |
| F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                                
           |
| Per-Host Resources: mem-estimate=64.05MB mem-reservation=32.00KB 
thread-reservation=2    |
|   DATASTREAM SINK [FRAGMENT=F01, EXCHANGE=01, UNPARTITIONED]                  
           |
|   |  mem-estimate=48.00KB mem-reservation=0B thread-reservation=0             
           |
|   00:SCAN HDFS [default.mixed_ice, RANDOM]                                    
           |
|      HDFS partitions=1/1 files=1 size=602B                                    
           |
|      Iceberg snapshot id: 4964066258730898133                                 
           |
|      skipped Iceberg predicates: `year` = CAST(2024 AS INT)                   
           |
|      stored statistics:                                                       
           |
|        table: rows=5 size=945B                                                
           |
|        columns: unavailable                                                   
           |
|      extrapolated-rows=disabled max-scan-range-rows=5                         
           |
|      file formats: [ORC, PARQUET]                                             
           |
|      mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1        
           |
|      tuple-ids=0 row-size=8B cardinality=2                                    
           |
|      in pipelines: 00(GETNEXT)                                                
           |
+------------------------------------------------------------------------------------------+
 {code}
Note, the file formats: [ORC, PARQUET] part even  though this query only reads 
a parquet files.
 
*Some analyis:*
When IcebergScanNode [is 
created|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/IcebergScanNode.java#L129]
 it holds the correct information about file formats (Parquet).

Later on the parent class, HdfsScanNode also tries to populate the file formats 
[here|[https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L513].]
 
It uses what 
[getSampledOrRawPartition()|https://github.com/apache/impala/blob/cc63757c10cdf70e511596c0ded7d20674af2c4b/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L431]
 returns. In this use case the 'sampledPartitions_' is null, so will return 
'partitions_'
 
Apparently, this 'partitions_' member holds the partition with the ORC file so 
it adds ORC to the fileFormats_.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to