valkum opened a new issue, #17421:
URL: https://github.com/apache/datafusion/issues/17421
### Describe the bug
I am trying out datafusion for some refactoring. I am testing with the
following setup:
```
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| column_name | data_type
| is_nullable |
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
| name | Utf8View
| NO |
| group | Dictionary(UInt16, Utf8)
| NO |
| market | Struct([Field { name: "market", data_type:
Dictionary(UInt16, Utf8), nullable: true, dict_id: 1, dict_is_ordered: false,
metadata: {} }, ...])| YES |
|... | ...
| ... |
+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-------------+
```
I am trying to use `group` as a partition key and storing the table as a
parquet hive.
When reading the hive either with datafusion-cli or using Rust with
```
ctx.register_parquet(
"test",
src,
ParquetReadOptions::default()
.file_sort_order(vec![vec![
SortExpr::new(col("name"), true, false),
SortExpr::new(col("group"), true, false),
]])
.table_partition_cols(vec![
(
"group".to_string(),
DataType::Dictionary(Box::new(DataType::UInt16),
Box::new(DataType::Utf8)),
),
]),
)
.await
```
I am getting the following error:
`Arrow error: Schema error: Fail to merge schema field 'market' because from
dict_id = 1 does not match 0`.
I assume this is caused because the order of dict creation is different. Do
note that this does not occur with
`datafusion.execution.keep_partition_by_columns = True`. But then you run into
https://github.com/apache/datafusion/issues/17420
### To Reproduce
Create table with schema from above.
```
COPY test TO 'test_out' STORED AS PARQUET PARTITIONED BY group;
CREATE EXTERNAL TABLE test2
STORED AS PARQUET
PARTITIONED BY (group)
LOCATION 'test_out';
```
### Expected behavior
The hive/partitioned table should be loaded.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]