Wenchen,

There are a few issues with the Invoke approach, and I don’t think that it
is really much better for the additional complexity of the API.

First I think that you’re confusing codegen to call a function with codegen
to implement a function. The non-goal refers to supporting codegen to
*implement* a UDF. That’s what could have differences between the called
version and generated version. But the Invoke option isn’t any different in
that case because Invoke codegen is only used to call a method, and we can
codegen int result = udfName(x, y) just like we can codegen int result =
udfName(row).

The Invoke approach also has a problem with expressiveness. Consider a map
function that builds a map from its inputs as key/value pairs: map("x", r *
cos(theta), "y", r * sin(theta)). If this requires a defined Java function,
then there must be lots of implementations for different numbers of pairs,
for different types, etc:

public MapData buildMap(String k1, int v1);
...
public MapData buildMap(String k1, long v1);
...
public MapData buildMap(String k1, float v1);
...
public MapData buildMap(String k1, double v1);
public MapData buildMap(String k1, double v1, String k2, double v2);
public MapData buildMap(String k1, double v1, String k2, double v2,
String k3, double v3);
...

Clearly, this and many other use cases would fall back to varargs instead.
In that case, there is little benefit to using invoke because all of the
arguments will get collected into an Object[] anyway. That’s basically the
same thing as using a row object, just without convenience functions that
return specific types like getString, forcing implementations to cast
instead. And, the Invoke approach has a performance *penalty* when existing
rows could be simply projected using a wrapper.

There’s also a massive performance penalty for the Invoke approach when
falling back to non-codegen because the function is loaded and invoked each
time eval is called. It is much cheaper to use a method in an interface.

Next, the Invoke approach is much more complicated for implementers to use.
Should they use String or UTF8String? What representations are supported
and how will Spark detect and produce those representations? What if a
function uses both String *and* UTF8String? Will Spark detect this for each
parameter? Having one or two functions called by Spark is much easier to
maintain in Spark and avoid a lot of debugging headaches when something
goes wrong.

On Mon, Feb 8, 2021 at 12:00 PM Wenchen Fan <cloud0...@gmail.com> wrote:

This is a very important feature, thanks for working on it!
>
> Spark uses codegen by default, and it's a bit unfortunate to see that
> codegen support is treated as a non-goal. I think it's better to not ask
> the UDF implementations to provide two different code paths for interpreted
> evaluation and codegen evaluation. The Expression API does so and it's very
> painful. Many bugs were found due to inconsistencies between
> the interpreted and codegen code paths.
>
> Now, Spark has the infra to call arbitrary Java functions in
> both interpreted and codegen code paths, see StaticInvoke and Invoke. I
> think we are able to define the UDF API in the most efficient way.
> For example, a UDF that takes long and string, and returns int:
>
> class MyUDF implements ... {
>   int call(long arg1, UTF8String arg2) { ... }
> }
>
> There is no compile-time type-safety. But there is also no boxing, no
> extra InternalRow building, no separated interpreted and codegen code
> paths. The UDF will report input data types and result data type, so the
> analyzer can check if the call method is valid via reflection, and we
> still have query-compile-time type safety. It also simplifies development
> as we can just use the Invoke expression to invoke UDFs.
>
> On Tue, Feb 9, 2021 at 2:52 AM Ryan Blue <b...@apache.org> wrote:
>
>> Hi everyone,
>>
>> I'd like to start a discussion for adding a FunctionCatalog interface to
>> catalog plugins. This will allow catalogs to expose functions to Spark,
>> similar to how the TableCatalog interface allows a catalog to expose
>> tables. The proposal doc is available here:
>> https://docs.google.com/document/d/1PLBieHIlxZjmoUB0ERF-VozCRJ0xw2j3qKvUNWpWA2U/edit
>>
>> Here's a high-level summary of some of the main design choices:
>> * Adds the ability to list and load functions, not to create or modify
>> them in an external catalog
>> * Supports scalar, aggregate, and partial aggregate functions
>> * Uses load and bind steps for better error messages and simpler
>> implementations
>> * Like the DSv2 table read and write APIs, it uses InternalRow to pass
>> data
>> * Can be extended using mix-in interfaces to add vectorization, codegen,
>> and other future features
>>
>> There is also a PR with the proposed API:
>> https://github.com/apache/spark/pull/24559/files
>>
>> Let's discuss the proposal here rather than on that PR, to get better
>> visibility. Also, please take the time to read the proposal first. That
>> really helps clear up misconceptions.
>>
>>
>>
>> --
>> Ryan Blue
>>
> --
Ryan Blue

Reply via email to