gstvg commented on PR #21679: URL: https://github.com/apache/datafusion/pull/21679#issuecomment-4282272021
@rluvaton @pepijnve @LiaCastaneda Thanks for bringing this up. This also applies to the accumulator parameter of the first lambda as well. In the past, I used the datatype of the start value on a local PoC. But thinking about it again, there are edge cases like `reduce([1, 2], [], (acc, v) -> array_concat(acc, [to_string(v)]), str_array -> str_array[0])`, the datatype of the start argument, `[]`, would be `List(Null)` which is not what the user wants (`List(Utf8)`). Via SQL, that would require a explict cast on the initial value to work which is not great. Via expr_api I would expect the user to provide the start value with the correct type and when consuming a spark plan it should be of the correct type anyways. On the other hand, DuckDB [list_reduce](https://duckdb.org/docs/current/sql/functions/list#list_reducelist-lambdaxy-initial_value) without initial value would be restricted to either functions that reduce into the same type as the list value (or worse, to expressions that are correctly planned with an accumulator with `DataType::Null`). Some duckdb related issues [duckdb#17009](https://www.github.com/duckdb/duckdb/issues/17009) [duckdb#21032](https://github.com/duckdb/duckdb/pull/21032) I checked clickhouse and snowflake and both require an initial value just like spark. @LiaCastaneda trino link also show that it only requires an initial value I'll experiment with a additional `HigherOrderUDF` method (we may keep both or use only this new one) and implement reduce with it: ```rust enum LambdaParametersStatus { Finished, CallAgain } fn multi_step_lambda_parameters( &self, value_fields: &[FieldRef], previous_step: Option<Vec<Vec<Field>>>, ) -> Result<(Vec<Vec<Field>>, LambdaParametersStatus)> { Ok((self.lambda_parameters(value_fields)?, LambdaParametersStatus::Finished)) } ``` > how can we do that in [...] when we are creating the physical expr ourself (like Comet does). The spark lambda variable include it's datatype and nullability, so I think that `lambda_parameters` doesn't need to be invoked, then I assume the rest of the execution to be the same as with sql or expr_api @LiaCastaneda For invoking the current version would indeed work for reduce/list_reduce, but the problem is during planning as the accumulator parameter datatype of the merge lambda depends on the lambda itself, and the parameter of finish lambda depends on the output of the merge lambda, which is not possible to express with the current single step `lambda_parameters` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
