Re: [PR] replace reassign_predicate_columns helper with PhysicalExpr::with_schema [datafusion]

via GitHub Thu, 24 Apr 2025 09:54:31 -0700


berkaysynnada commented on code in PR #15779:
URL: https://github.com/apache/datafusion/pull/15779#discussion_r2058262957



##########
datafusion/physical-expr-common/src/physical_expr.rs:
##########
@@ -333,6 +333,15 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + 
DynEq + DynHash {
         // This is a safe default behavior.
         Ok(None)
     }
+
+    /// Adapt this [`PhysicalExpr`] to a new schema.
+    /// For example, `Column("b", 1)` can be adapted to `Column("b", 0)`
+    /// given the schema `Schema::new(vec![("b", DataType::Int32)])`.
+    fn with_schema(&self, _schema: &Schema) -> Result<Option<Arc<dyn 
PhysicalExpr>>> {

Review Comment:
   > There's some missing types. Do you mean:
   
   Yes, exactly like that
   
   > I don't fully follow. What is this function expected to do for different 
operators, and how does it behave differently across operators?
   
   Every operator takes an input schema, and provides an output schema, 
possibly after doing some modifications on the input schema. So, that 
modification of schemas should be trackable by some API's. For example, this is 
an example schema tree:
   ```
   A: [a+b] + [Sum(x)] + [y]
   B: -- [a+b] + [Sum(x)] + [y]
   C: ---- [a+b] + [x] + [y]
   D: ------[a+b]
   E: --------[a] + [b] + [c]
   F: ------[x] + [y]
   G: --------[z] + [y] + [x]
   ```
   We should be able to ask, for example to D, "update `(a+b)@0*2` as if it's 
in your input schema", and it will say it is `(a@0+b@1)*2`. The inverse 
question should also be possible. This "foo" expression is written according to 
your input schema, give me as if it is in your output schema. 
   
   TLDR, what I ask from the ExecutionPlan is, operators should be able to 
track the projection mappings of their input to output columns.
   
   > Tangential but I generally don't like the idea of having a bool parameter 
like that and would rather it be two functions 
update_expression_for_input_schema and update_expression_for_output_schema or 
something like that.
   
   We can do that in 2 different methods for sure
   
   > Back on topic: how is this different than say 
PhysicalExpression::with_schema(expr, ExecutionPlan::schema(plan))?
   
   When you provide only 1 schema, and if you don't know the relation between 
the schema and expression, you cannot resolve the same name issues. Your 
suggestion seems to provide a flexibility, as you can just give a custom schema 
and an expression, but what I try to point out is, we cannot do that easily, 
because we need to track from the source of the expression, until its target. 
Therefore; I'm suggesting an ExecutionPlan API which helps tracking the all 
passes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] replace reassign_predicate_columns helper with PhysicalExpr::with_schema [datafusion]

Reply via email to