I don’t think that using Invoke really works. The usability is poor if
something goes wrong and it can’t handle varargs or parameterized use cases
well. There isn’t a significant enough performance penalty for passing data
as a row to justify making the API much more difficult and less expressive.
I don’t think that it makes much sense to move forward with the idea.

More replies inline.

On Tue, Feb 9, 2021 at 2:37 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> There’s also a massive performance penalty for the Invoke approach when
> falling back to non-codegen because the function is loaded and invoked each
> time eval is called. It is much cheaper to use a method in an interface.
>
> Can you elaborate? Using the row parameter or individual parameters
> shouldn't change the life cycle of the UDF instance.
>
The eval method looks up the method and invokes it every time using
reflection. That’s quite a bit slower than calling an interface method on
an UDF instance.

> Should they use String or UTF8String? What representations are supported
> and how will Spark detect and produce those representations?
>
> It's the same as InternalRow. We can just copy-paste the document of
> InternalRow to explain the corresponding java type for each data type.
>
My point is that having a single method signature that uses InternalRow and
is inherited from an interface is much easier for users and Spark. If a
user forgets to use UTF8String and writes a method with String instead,
then the UDF wouldn’t work. What then? Does Spark detect that the wrong
type was used? It would need to or else it would be difficult for a UDF
developer to tell what is wrong. And this is a runtime issue so it is
caught late.
-- 
Ryan Blue

Reply via email to