gabotechs opened a new issue, #20041:
URL: https://github.com/apache/datafusion/issues/20041
### Describe the bug
When reading parquet files with dictionary-encoded columns, if a file has
constant column values (detected from statistics where min == max), the scan
fails with a schema mismatch
error:
ArrowError(InvalidArgumentError("column types must match schema types,
expected Dictionary(UInt16, Utf8) but found Utf8 at column index 1"))
The root cause is in constant_value_from_stats() in opener.rs. When
statistics indicate a column has a constant value, that value is used as a
literal replacement in the projection. However, the statistics store values
using the "unpacked" type (e.g., Utf8) rather than the dictionary type (e.g.,
Dictionary(UInt16, Utf8)), causing a type mismatch when constructing the output
batch.
### To Reproduce
Steps to reproduce in @gene-bordegaray's PR here
https://github.com/datafusion-contrib/datafusion-distributed/pull/324
### Expected behavior
The query should succeed, with the constant value correctly cast to the
expected dictionary type before being used as a literal replacement in the
projection.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]