erratic-pattern commented on issue #19049:
URL: https://github.com/apache/datafusion/issues/19049#issuecomment-3643125781
It's not clear to me whether or not this is an actual bug or not. It seems
reasonable to expect metadata to be consistent for field names across union
branches. However it could also be problematic for queries that either:
a. Shadow existing column names with new, unrelated columns. Variable
shadowing in Rust is very normal, but I'm not sure if this is considered
reasonable in SQL queries.
b. Queries that introduce constant literals for fields on one side of a
union, such as this reproducer.
Perhaps we need to adjust
[intersect_metadata_for_union](https://github.com/apache/datafusion/blob/7b4593f36e880ca1c43746d5c4465fff5a3901c3/datafusion/expr/src/expr.rs#L506-L520)
to either:
a. Avoid intersecting a branch that contains *empty metadata*, and instead
preserve/intersect only the branches that contain non-empty metadata. This
avoids destructive loss of metadata when one union branch is empty.
b. Avoid intersecting a branch that contains *empty metadata* on a field
that is a *constant literal*. This is a more restrictive version of option a
that might result in fewer unintended consequences.
c. *union* the metadata instead of *intersecting* the metadata. This ensures
there is no metadata lost, but I am not sure what consequences this might have
since it could populate metadata in the output schema when it was not intended
to be there.
I am curious if anyone has opinions about any of these approaches, or if
there is another way to look at this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]