Jefffrey commented on code in PR #18286:
URL: https://github.com/apache/datafusion/pull/18286#discussion_r2493427812
##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -633,8 +633,46 @@ impl LogicalPlan {
LogicalPlan::Dml(_) => Ok(self),
LogicalPlan::Copy(_) => Ok(self),
LogicalPlan::Values(Values { schema, values }) => {
- // todo it isn't clear why the schema is not recomputed here
- Ok(LogicalPlan::Values(Values { schema, values }))
+ // We cannot compute the correct schema if we only use values.
+ //
+ // For example, given the following plan:
+ // Projection: col_1, col_2
+ // Values: (Float32(1), Float32(10)), (Float32(100),
Float32(10))
+ //
+ // We wouldn't know about `col_1`, and `col_2` if we only
relied on `values`.
+ // To correctly recompute the new schema, we also need to
retain some information
+ // from the original schema.
+ let new_plan =
LogicalPlanBuilder::values(values.clone())?.build()?;
+
+ let qualified_fields = schema
+ .iter()
+ .zip(new_plan.schema().fields())
+ .map(|((table_ref, old_field), new_field)| {
+ // `old_field`'s data type is unknown but
`new_field`'s is known
+ if old_field.data_type().is_null()
+ && !new_field.data_type().is_null()
+ {
+ let field = old_field
+ .as_ref()
+ .clone()
+ .with_data_type(new_field.data_type().clone());
+ (table_ref.cloned(), Arc::new(field))
Review Comment:
I'm still trying to wrap my head around what's happening here. From what I
can tell, there are two things to consider here:
1. When recomputing `new_plan`, we lose the column names on the old plan
(`self`) so this is a way of keeping that
2. If we ignore the nullability from `self` and go with `new_plan`
nullability that also runs into other issue? (Though from what I see this may
still cause an error, just later in the pipeline/has a different message)
Am I correct in my understanding here?
I wonder for point 1 if we can do this in a nicer way, perhaps by
introducing something like `LogicalPlanBuilder::values_with_names` to preserve
the column names from the start instead of amending after the fact.
For point 2 I'm not sure, perhaps an explicit check upfront is better 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]