andygrove commented on PR #19514:
URL: https://github.com/apache/datafusion/pull/19514#issuecomment-3694864572
> > Thanks @andygrove the numbers make huge sense to me, what prob concerns
is DF using `make_scalar_function` in lots of places so the same problem can be
relevant for others.
> > Would you mind adding more details what is the exact reason and what
scalar/array combinations are the most trouble?
> > this info might be important how DF treats bultin functions now
>
> Basically, the `make_scalar_function` functions (there are three
implementations, two of which are identical and one with slightly different
behavior) convert scalar arguments into arrays. This adds unnecessary overhead
in some cases because Arrow has specialized implementations of some kernels
with fast paths for scalar arguments. For example, for `contains`, Arrow has
`pub fn contains(left: &dyn Datum, right: &dyn Datum)` and `Datum` is
implemented for arrays and scalars. The implementation has special handling for
scalars.
>
> The three implementations of `make_scalar_function` can be found in these
files:
>
> * `datafusion/functions/src/utils.rs`
>
> * `datafusion/functions-nested/src/utils.rs`
>
> * `datafusion/spark/src/function/functions_nested_utils.rs`
The issue is that `make_scalar_function` expected ArrayRefs:
```
F: Fn(&[ArrayRef]) -> Result<ArrayRef>,
```
It would perhaps be better if it accepted `ColumnarValue`, which supports
both arrays and scalars. This is probably a large refactor.
```
F: Fn(&[ColumnarValue]) -> Result<ArrayRef>,
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]