This is a very important feature, thanks for working on it!

Spark uses codegen by default, and it's a bit unfortunate to see that
codegen support is treated as a non-goal. I think it's better to not ask
the UDF implementations to provide two different code paths for interpreted
evaluation and codegen evaluation. The Expression API does so and it's very
painful. Many bugs were found due to inconsistencies between
the interpreted and codegen code paths.

Now, Spark has the infra to call arbitrary Java functions in
both interpreted and codegen code paths, see StaticInvoke and Invoke. I
think we are able to define the UDF API in the most efficient way.
For example, a UDF that takes long and string, and returns int:

class MyUDF implements ... {
  int call(long arg1, UTF8String arg2) { ... }
}

There is no compile-time type-safety. But there is also no boxing, no extra
InternalRow building, no separated interpreted and codegen code paths. The
UDF will report input data types and result data type, so the analyzer can
check if the call method is valid via reflection, and we still have
query-compile-time type safety. It also simplifies development as we can
just use the Invoke expression to invoke UDFs.

On Tue, Feb 9, 2021 at 2:52 AM Ryan Blue <b...@apache.org> wrote:

> Hi everyone,
>
> I'd like to start a discussion for adding a FunctionCatalog interface to
> catalog plugins. This will allow catalogs to expose functions to Spark,
> similar to how the TableCatalog interface allows a catalog to expose
> tables. The proposal doc is available here:
> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>
> Here's a high-level summary of some of the main design choices:
> * Adds the ability to list and load functions, not to create or modify
> them in an external catalog
> * Supports scalar, aggregate, and partial aggregate functions
> * Uses load and bind steps for better error messages and simpler
> implementations
> * Like the DSv2 table read and write APIs, it uses InternalRow to pass data
> * Can be extended using mix-in interfaces to add vectorization, codegen,
> and other future features
>
> There is also a PR with the proposed API:
> https://github.com/apache/spark/pull/24559/files
>
> Let's discuss the proposal here rather than on that PR, to get better
> visibility. Also, please take the time to read the proposal first. That
> really helps clear up misconceptions.
>
>
>
> --
> Ryan Blue
>

Reply via email to