LiaCastaneda commented on code in PR #22853:
URL: https://github.com/apache/datafusion/pull/22853#discussion_r3518559616


##########
datafusion/expr/src/higher_order_function.rs:
##########
@@ -239,6 +239,15 @@ pub struct LambdaArgument {
     /// For example, for `array_transform([2], v -> -v)`,
     /// this will be `vec![Field::new("v", DataType::Int32, true)]`
     params: Vec<FieldRef>,
+    /// Indices into `params` of the parameters that are actually referenced
+    /// by `body` (taking nested-lambda shadowing into account).
+    ///
+    /// `None` means "no information, assume every declared parameter is used"
+    /// — that is the backwards-compatible behavior of [`Self::new`]. When set,
+    /// [`Self::evaluate`] skips evaluating and pushing the closures for the
+    /// parameters not listed here, so unused declared parameters do not shift
+    /// the columns the body's compressed indices expect.
+    used_param_indices: Option<Vec<usize>>,

Review Comment:
   If we pass `None` in `new()` and assume all parameters are used, would it be 
better to compute the indices directly from `params: Vec<FieldRef>`? That way 
this field doesn't need to be an `Option` it always holds the used parameter 
indices, which happen to be all of them when called via `new()`



##########
datafusion/expr/src/higher_order_function.rs:
##########
@@ -257,26 +266,64 @@ pub struct LambdaArgument {
 }
 
 impl LambdaArgument {
+    /// Build a [`LambdaArgument`] that treats every declared parameter as
+    /// used. This is the backwards-compatible behavior. Prefer
+    /// [`Self::new_with_used_params`] when the caller knows which subset of
+    /// the lambda's parameters the body actually references — otherwise the
+    /// merged batch will still contain columns for unused parameters.
     pub fn new(
         params: Vec<FieldRef>,
         body: Arc<dyn PhysicalExpr>,
         captures: Option<RecordBatch>,
     ) -> Self {
+        Self::new_with_used_params(params, body, captures, None)
+    }
+
+    /// Build a [`LambdaArgument`] knowing which subset of `params` (by name)
+    /// the lambda body actually references.
+    ///
+    /// When `used_params` is `Some(set)`, [`Self::evaluate`] only evaluates
+    /// and pushes the closures whose corresponding parameter name is in
+    /// `set`, in the original declaration order of `params`. Unused declared
+    /// parameters leave no slot in the merged batch, so the body's compressed
+    /// column indices line up directly. When `used_params` is `None`,
+    /// behavior is identical to [`Self::new`].

Review Comment:
   Should we just require `used_params` to be set in this function? Otherwise 
the behavior is the same as `new`, so I'm not sure why we'd need both for that 
case



##########
datafusion/physical-expr/src/expressions/lambda.rs:
##########
@@ -129,6 +141,7 @@ impl LambdaExpr {
             body,
             projected_body,
             projection,
+            used_params: used_param_names,

Review Comment:
   Since `used_param_names` is already computed here during construction, could 
the fix live entirely in `projected_body` instead? Rather than threading 
`used_params` through to `LambdaArgument`, then LambdaArgument stays unchanged 
and external callers don't need to think about used params at all 
   Left a rough idea here 
https://github.com/LiaCastaneda/datafusion/tree/lia/lambda-body-centric-fix



##########
datafusion/expr/src/higher_order_function.rs:
##########


Review Comment:
   Also, the body is dynamic, which technically means for higher order 
functions indices positions can differ for each query no? for example if you 
have a Higher Order function with parameters (x,y,x) for a given query you can 
use x,y or y or all of them. This essentially means external callers of 
`LambdaArgument::new_with_used_params` would have to walk the body themselves 
to figure out which params are referenced.



##########
datafusion/expr/src/higher_order_function.rs:
##########


Review Comment:
   Providing an api where we have to specify the indices of the parameters that 
will be used in the body feels a bit unergonomic. Have you considered somehow 
extracting this from the body and handling this internally so it's invisible to 
the caller?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to