kosiew opened a new pull request, #17281:
URL: https://github.com/apache/datafusion/pull/17281
## Which issue does this PR close?
* Closes #16579
## Rationale for this change
Evolving data sources often have structural mismatches with the expected
table schema, especially when nested `Struct` types are involved. This PR
introduces robust handling for schema adaptation and column casting within
Apache DataFusion to ensure compatibility and correctness when processing such
evolving schemas.
## What changes are included in this PR?
* Introduces `cast_column` for recursively casting nested `StructArray`
fields to match target schema.
* Adds compatibility checks to prevent casting nullable fields to
non-nullable targets.
* Updates the `SchemaAdapter` and `SchemaMapping` logic to leverage
`cast_column`.
* Adds thorough unit tests covering:
* Casting structs with reordering, extra, and missing fields
* Preserving parent nullability
* Structs containing arrays and maps
* Schema mapping and record batch transformation
* Fixes column casting behavior in `pruning_predicate` by using
`cast_column` instead of generic Arrow cast.
* Updates documentation:
* Adds new guide: `docs/source/library-user-guide/schema_adapter.md`
* References this guide in main user and API docs
## Are these changes tested?
Yes, extensive tests are included that:
* Validate the `cast_column` logic across multiple complex nested schemas.
* Verify compatibility validation logic.
* Test end-to-end behavior of `SchemaAdapter::map_batch()` with various
structural transformations.
* Ensure correct pruning predicate behavior when stats use structs with
different field types.
## Are there any user-facing changes?
Yes:
* Users benefit from improved compatibility when reading nested structured
data with evolving schemas.
* Documentation has been expanded to include a new section explaining how
schema adaptation works and how to use `cast_column`.
There are no breaking changes to public APIs.
---
This change enhances DataFusion's resilience to schema drift and paves the
way for more robust handling of semi-structured data. ✨
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]