[I] Fail to merge schema field for patitioned table with dict. [datafusion]

via GitHub Thu, 04 Sep 2025 10:38:02 -0700


valkum opened a new issue, #17421:
URL: https://github.com/apache/datafusion/issues/17421


   ### Describe the bug
   
   I am trying out datafusion for some refactoring. I am testing with the 
following setup:
   ```
   
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
   | column_name     | data_type                                                
                                                                                
     | is_nullable |
   
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
   | name            | Utf8View                                                 
                                                                                
     | NO          |
   | group           | Dictionary(UInt16, Utf8)                                 
                                                                                
     | NO          |
   | market          | Struct([Field { name: "market", data_type: 
Dictionary(UInt16, Utf8), nullable: true, dict_id: 1, dict_is_ordered: false, 
metadata: {} }, ...])| YES         |
   |...              | ...                                                      
                                                                                
     | ...         |
   
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
   ```
   
   I am trying to use `group` as a partition key and storing the table as a 
parquet hive.
   When reading the hive either with datafusion-cli or using Rust with
   ```
    ctx.register_parquet(
           "test",
           src,
           ParquetReadOptions::default()
               .file_sort_order(vec![vec![
                   SortExpr::new(col("name"), true, false),
                   SortExpr::new(col("group"), true, false),
               ]])
               .table_partition_cols(vec![
                   (
                       "group".to_string(),
                       DataType::Dictionary(Box::new(DataType::UInt16), 
Box::new(DataType::Utf8)),
                   ),
               ]),
       )
       .await
   ```
   I am getting the following error:
   `Arrow error: Schema error: Fail to merge schema field 'market' because from 
dict_id = 1 does not match 0`.
   
   I assume this is caused because the order of dict creation is different. Do 
note that this does not occur with 
`datafusion.execution.keep_partition_by_columns = True`. But then you run into 
https://github.com/apache/datafusion/issues/17420
   
   ### To Reproduce
   
   Create table with schema from above.
   ```
   COPY test TO 'test_out' STORED AS PARQUET PARTITIONED BY group;
   
   CREATE EXTERNAL TABLE test2
   STORED AS PARQUET
   PARTITIONED BY (group)
   LOCATION 'test_out';
   ```
   
   ### Expected behavior
   
   The hive/partitioned table should be loaded.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Fail to merge schema field for patitioned table with dict. [datafusion]

Reply via email to