cetra3 opened a new issue, #22447:
URL: https://github.com/apache/datafusion/issues/22447
### Describe the bug
`recompute_schema()` has a pretty well documented contract that if you
change the schema that it should be recomputed.
However, the current check only checks the *length* of the schema, not the
types or names/aliases:
From `recompute_schema()`:
```rust
LogicalPlan::Union(Union { inputs, schema }) => {
let first_input_schema = inputs[0].schema();
if schema.fields().len() == first_input_schema.fields().len() {
// If inputs are not pruned do not change schema
Ok(LogicalPlan::Union(Union { inputs, schema }))
} else {
// A note on `Union`s constructed via `try_new_by_name`:
//
// At this point, the schema for each input should have
// the same width. Thus, we do not need to save whether a
// `Union` was created `BY NAME`, and can safely rely on the
// `try_new` initializer to derive the new schema based on
// column positions.
Ok(LogicalPlan::Union(Union::try_new(inputs)?))
}
}
```
### To Reproduce
```rust
#[test]
fn test_recompute_schema_union_type_mismatch() -> Result<()> {
let schema_i32 =
Schema::new(vec![Field::new("a", DataType::Int32, false)]);
let schema_i64 =
Schema::new(vec![Field::new("a", DataType::Int64, false)]);
// Build a Union whose schema starts out as Int32 (matching its inputs).
let original = Union::try_new(vec![
Arc::new(table_scan(Some("t1"), &schema_i32, None)?.build()?),
Arc::new(table_scan(Some("t2"), &schema_i32, None)?.build()?),
])?;
assert_eq!(
original.schema.field(0).data_type(),
&DataType::Int32,
"sanity: starting schema is Int32"
);
// Now simulate something like a type-coercion pass replacing the
// inputs with Int64-typed versions while leaving the Union's cached
// schema untouched. Same width, different types.
let stale = LogicalPlan::Union(Union {
inputs: vec![
Arc::new(table_scan(Some("t1"), &schema_i64, None)?.build()?),
Arc::new(table_scan(Some("t2"), &schema_i64, None)?.build()?),
],
schema: Arc::clone(&original.schema),
});
let recomputed = stale.recompute_schema()?;
assert_eq!(
recomputed.schema().field(0).data_type(),
&DataType::Int64,
"Union schema should track the new Int64 input types after \
recompute_schema(), but the width-only check left it stale"
);
Ok(())
}
```
### Expected behavior
The schema is updated
### Additional context
I'm trying to rewrite all exprs/aliases using a pattern for some
optimizations. Rewriting exprs to use a new alias, even if it's consistent,
you need to recompute the schema for *all* nodes.
Union is the only one that breaks that I've found but there might be others.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]