Re: [PR] Add lambda support and array_transform udf [datafusion]

via GitHub Mon, 20 Apr 2026 08:47:46 -0700


gstvg commented on PR #21679:
URL: https://github.com/apache/datafusion/pull/21679#issuecomment-4282272021


   @rluvaton @pepijnve @LiaCastaneda Thanks for bringing this up. This also 
applies to the accumulator parameter of the first lambda as well. In the past, 
I used the datatype of the start value on a local PoC. But thinking about it 
again, there are edge cases like `reduce([1, 2], [], (acc, v) -> 
array_concat(acc, [to_string(v)]), str_array -> str_array[0])`, the datatype of 
the start argument, `[]`, would be `List(Null)` which is not what the user 
wants (`List(Utf8)`). Via SQL, that would require a explict cast on the initial 
value to work which is not great. Via expr_api I would expect the user to 
provide the start value with the correct type and when consuming a spark plan 
it should be of the correct type anyways.
   
   On the other hand, DuckDB 
[list_reduce](https://duckdb.org/docs/current/sql/functions/list#list_reducelist-lambdaxy-initial_value)
 without initial value would be restricted to either functions that reduce into 
the same type as the list value (or worse, to expressions that are correctly 
planned with an accumulator with `DataType::Null`). Some duckdb related issues 
[duckdb#17009](https://www.github.com/duckdb/duckdb/issues/17009) 
[duckdb#21032](https://github.com/duckdb/duckdb/pull/21032)
   
   I checked clickhouse and snowflake and both require an initial value just 
like spark. @LiaCastaneda trino link also show that it only requires an initial 
value
   
   I'll experiment with a additional `HigherOrderUDF` method (we may keep both 
or use only this new one) and implement reduce with it:
   
   ```rust
       enum LambdaParametersStatus {
           Finished,
           CallAgain
       }
   
       fn multi_step_lambda_parameters(
           &self,
           value_fields: &[FieldRef],
           previous_step: Option<Vec<Vec<Field>>>,
       ) -> Result<(Vec<Vec<Field>>, LambdaParametersStatus)> {
           Ok((self.lambda_parameters(value_fields)?, 
LambdaParametersStatus::Finished))
       }
   ```
   
   > how can we do that in [...] when we are creating the physical expr ourself 
(like Comet does).
   
   The spark lambda variable include it's datatype and nullability, so I think 
that `lambda_parameters` doesn't need to be invoked, then I assume the rest of 
the execution to be the same as with sql or expr_api
   
   @LiaCastaneda For invoking the current version would indeed work for 
reduce/list_reduce, but the problem is during planning as the accumulator 
parameter datatype of the merge lambda depends on the lambda itself, and the 
parameter of finish lambda depends on the output of the merge lambda, which is 
not possible to express with the current single step `lambda_parameters`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add lambda support and array_transform udf [datafusion]

Reply via email to