alamb opened a new issue, #18845: URL: https://github.com/apache/datafusion/issues/18845
### Is your feature request related to a problem or challenge? Many parts of the planning process require knowing the type, nullability and recently the field information for each `Expr` * [`Expr::to_field`](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#method.to_field) * [`Expr::get_type`](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#method.get_type) * [`Expr::nullable`](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html#method.nullable) The implementation of these methods recomputes the type information on demand, for example https://github.com/apache/datafusion/blob/4f95e6d70249d0d839939facc67f95cd7f54526e/datafusion/expr/src/expr_schema.rs#L109 This is slow for several reasons: 1. The computation is re-computed many [times](https://github.com/search?q=repo%3Aapache%2Fdatafusion%20get_type&type=code) [during](repo:apache/datafusion nullable) planning 2. The computations themselves are sometimes non trivial (e.g. what @pepijnve added in https://github.com/apache/datafusion/pull/17813 for case expressions) 3. The calculations have to walk the entire expression tree, each time even when they don't change I have some small evidence that this type computation takes a non trivial amount of planning time, as for example, when I removed some of the cloning in `to_field` in this PR, I saw a significant improvement in planning time (2-3%) - https://github.com/apache/datafusion/pull/18415 ### Describe the solution you'd like I would like to find some way to avoid recomputing the results of these calculations for Expressions ### Describe alternatives you've considered DataFusion already has a similar pattern for `ExecutionPlan` where the properties are computed once and then each ExecutionPlan must report the properties. * https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#tymethod.properties * https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.PlanProperties.html We also need to be careful about rolling this change incrementally rather than as one giant PR out as it will be a potentially disruptive API change (as Exprs are used all over the place) So one way to cache this would be to add a method like ```rust /// Cached result of computing type for struct ExprProperties { field: FieldRef, table_reference: Option<TableReference> } ``` trait ExprSchemable { /// return cached properties for the Expr, if available fn try_properties(&self) -> Option<&ExprProperties>; } impl Expr { fn get_type(&self) -> DataType { // first check if we have pre-computed results if let Some(properties) = self.try_properties() { return properties.field.data_type() } // otherwise compute as normal } // add similar checks for nullable and to_field } Then we could add an `ExprProperties` field to various `Expr`s over multiple PRs and roll out the cached values ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
