Re: [DISCUSS] SPIP: FunctionCatalog

Ryan Blue Thu, 04 Mar 2021 08:58:09 -0800

Okay, great. I'll update the SPIP doc and call a vote in the next day or
two.


On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen <xkro...@apache.org> wrote:

> +1 on Dongjoon's proposal. This is a very nice compromise between the
> reflective/magic-method approach and the InternalRow approach, providing
> a lot of flexibility for our users, and allowing for the more complicated
> reflection-based approach to evolve at its own pace, since you can always
> fall back to InternalRow for situations which aren't yet supported by
> reflection.
>
> We can even consider having Spark code detect that you haven't overridden
> the default produceResult (IIRC this is discoverable via reflection), and
> raise an error at query analysis time instead of at runtime when it can't
> find a reflective method or an overridden produceResult.
>
> I'm very pleased we have found a compromise that everyone seems happy
> with! Big thanks to everyone who participated.
>
> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge <jzh...@apache.org> wrote:
>
>> +1 Good plan to move forward.
>>
>> Thank you all for the constructive and comprehensive discussions in this
>> thread! Decisions on this important feature will have ramifications for
>> years to come.
>>
>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
>>> variants and prefer the "magical methods", then we can have a single
>>> ScalarFunction interface which has the row-parameter API (with a
>>> default implementation to fail) and documents to describe the "magical
>>> methods" (which can be done later).
>>>
>>> I'll start the PR review this week to check the naming, doc, etc.
>>>
>>> Thanks all for the discussion here and let's move forward!
>>>
>>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Good point, Dongjoon. I think we can probably come to some compromise
>>>> here:
>>>>
>>>>    - Remove SupportsInvoke since it isn’t really needed. We should
>>>>    always try to find the right method to invoke in the codegen path.
>>>>    - Add a default implementation of produceResult so that
>>>>    implementations don’t have to use it. If they don’t implement it and a
>>>>    magic function can’t be found, then it will throw
>>>>    UnsupportedOperationException
>>>>
>>>> This is assuming that we can agree not to introduce all of the
>>>> ScalarFunction interface variations, which would have limited utility
>>>> because of type erasure.
>>>>
>>>> Does that sound like a good plan to everyone? If so, I’ll update the
>>>> SPIP doc so we can move forward.
>>>>
>>>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> We shared many opinions in different perspectives.
>>>>> However, we didn't reach a consensus even on a partial merge by
>>>>> excluding something
>>>>> (on the PR by me, on this mailing thread by Wenchen).
>>>>>
>>>>> For the following claims, we have another alternative to mitigate it.
>>>>>
>>>>>     > I don't like it because it promotes the row-parameter API and
>>>>> forces users to implement it, even if the users want to only use the
>>>>> individual-parameters API.
>>>>>
>>>>> Why don't we merge the AS-IS PR by adding something instead of
>>>>> excluding something?
>>>>>
>>>>>     - R produceResult(InternalRow input);
>>>>>     + default R produceResult(InternalRow input) throws Exception {
>>>>>     +   throw new UnsupportedOperationException();
>>>>>     + }
>>>>>
>>>>> By providing the default implementation, it will not *forcing users to
>>>>> implement it* technically.
>>>>> And, we can provide a document about our expected usage properly.
>>>>> What do you think?
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>>>>> cost of using Object[], and primitive types need to do boxing
>>>>>>
>>>>>> The question is not whether to use the magic functions, which would
>>>>>> not need boxing. The question here is whether to use multiple
>>>>>> ScalarFunction interfaces. Those interfaces would require boxing or
>>>>>> using Object[] so there isn’t a benefit.
>>>>>>
>>>>>> If we do want to reuse one UDF for different types, using “magical
>>>>>> methods” solves the problem
>>>>>>
>>>>>> Yes, that’s correct. We agree that magic methods are a good option
>>>>>> for this.
>>>>>>
>>>>>> Again, the question we need to decide is whether to use InternalRow
>>>>>> or interfaces like ScalarFunction2 for non-codegen. The option to
>>>>>> use multiple interfaces is limited by type erasure because you can only
>>>>>> have one set of type parameters. If you wanted to support both 
>>>>>> ScalarFunction2<Integer,
>>>>>> Integer> and ScalarFunction2<Long, Long> you’d have to fall back to 
>>>>>> ScalarFunction2<Object,
>>>>>> Object> and cast.
>>>>>>
>>>>>> The point is that type erasure will commonly lead either to many
>>>>>> different implementation classes (one for each type combination) or will
>>>>>> lead to parameterizing by Object, which defeats the purpose.
>>>>>>
>>>>>> The alternative adds safety because correct types are produced by
>>>>>> calls like getLong(0). Yes, this depends on the implementation
>>>>>> making the correct calls, but it is better than using Object and
>>>>>> casting.
>>>>>>
>>>>>> I can’t think of real use cases that will force the
>>>>>> individual-parameters approach to use Object instead of concrete types.
>>>>>>
>>>>>> I think this is addressed by the type erasure discussion above. A
>>>>>> simple Plus method would require Object or 12 implementations for 2
>>>>>> arguments and 4 numeric types.
>>>>>>
>>>>>> And basically all varargs cases would need to use Object[]. Consider
>>>>>> a UDF to create a map that requires string keys and some consistent type
>>>>>> for values. This would be easy with the InternalRow API because you
>>>>>> can use getString(pos) and get(pos + 1, valueType) to get the
>>>>>> key/value pairs. Use of UTF8String vs String will be checked at
>>>>>> compile time.
>>>>>>
>>>>>> I agree that Object[] is worse than InternalRow
>>>>>>
>>>>>> Yes, and if we are always using Object because of type erasure or
>>>>>> using magic methods to get better performance, the utility of the
>>>>>> parameterized interfaces is very limited.
>>>>>>
>>>>>> Because we want to expose the magic functions, the use of
>>>>>> ScalarFunction2 and similar is extremely limited because it is only
>>>>>> for non-codegen. Varargs is by far the more common case. The
>>>>>> InternalRow interface is also a very simple way to get started and
>>>>>> ensures that Spark can always find the right method after the function is
>>>>>> bound to input types.
>>>>>>
>>>>>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>>>>>> cost of using Object[], and primitive types need to do boxing. And
>>>>>>> this is a runtime failure, which is absolutely worse than
>>>>>>> query-compile-time checks. Also, don't forget my previous point: users 
>>>>>>> need
>>>>>>> to specify the type and index such as row.getLong(0), which is
>>>>>>> error-prone.
>>>>>>>
>>>>>>> > But we don’t do that for any of the similar UDFs today so I’m
>>>>>>> skeptical that this would actually be a high enough priority to 
>>>>>>> implement.
>>>>>>>
>>>>>>> I'd say this is a must-have if we go with the individual-parameters
>>>>>>> approach. The Scala UDF today checks the method signature at 
>>>>>>> compile-time,
>>>>>>> thanks to the Scala type tag. The Java UDF today doesn't check and is 
>>>>>>> hard
>>>>>>> to use.
>>>>>>>
>>>>>>> > You can’t implement ScalarFunction2<Integer, Integer> and
>>>>>>> ScalarFunction2<Long, Long>.
>>>>>>>
>>>>>>> Can you elaborate? We have function binding and we can use
>>>>>>> different UDFs for different input types. If we do want to reuse one UDF
>>>>>>> for different types, using "magical methods" solves the problem:
>>>>>>> class MyUDF {
>>>>>>>   def call(i: Int): Int = ...
>>>>>>>   def call(l: Long): Long = ...
>>>>>>> }
>>>>>>>
>>>>>>> On the other side, I don't think the row-parameter approach can
>>>>>>> solve this problem. The best I can think of is:
>>>>>>> class MyUDF implement ScalaFunction[Object] {
>>>>>>>   def call(row: InternalRow): Object = {
>>>>>>>     if (int input) row.getInt(0) ... else row.getLong(0) ...
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> This is worse because: 1) it needs to do if-else to check different
>>>>>>> input types. 2) the return type can only be Object and cause boxing 
>>>>>>> issues.
>>>>>>>
>>>>>>> I agree that Object[] is worse than InternalRow. But I can't think
>>>>>>> of real use cases that will force the individual-parameters approach to 
>>>>>>> use
>>>>>>> Object instead of concrete types.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>
>>>>>>>> Thanks for adding your perspective, Erik!
>>>>>>>>
>>>>>>>> If the input is string type but the UDF implementation calls
>>>>>>>> row.getLong(0), it returns wrong data
>>>>>>>>
>>>>>>>> I think this is misleading. It is true for UnsafeRow, but there is
>>>>>>>> no reason why InternalRow should return incorrect values.
>>>>>>>>
>>>>>>>> The implementation in GenericInternalRow
>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
>>>>>>>> would throw a ClassCastException. I don’t think that using a row
>>>>>>>> is a bad option simply because UnsafeRow is unsafe.
>>>>>>>>
>>>>>>>> It’s unlikely that UnsafeRow would be used to pass the data. The
>>>>>>>> implementation would evaluate each argument expression and set the 
>>>>>>>> result
>>>>>>>> in a generic row, then pass that row to the UDF. We can use whatever
>>>>>>>> implementation we choose to provide better guarantees than unsafe.
>>>>>>>>
>>>>>>>> I think we should consider query-compile-time checks as
>>>>>>>> nearly-as-good as Java-compile-time checks for the purposes of safety.
>>>>>>>>
>>>>>>>> I don’t think I agree with this. A failure at query analysis time
>>>>>>>> vs runtime still requires going back to a separate project, fixing
>>>>>>>> something, and rebuilding. The time needed to fix a problem goes up
>>>>>>>> significantly vs. compile-time checks. And that is even worse if the 
>>>>>>>> UDF is
>>>>>>>> maintained by someone else.
>>>>>>>>
>>>>>>>> I think we also need to consider how common it would be that a use
>>>>>>>> case can have the query-compile-time checks. Going through this in more
>>>>>>>> detail below makes me think that it is unlikely that these checks 
>>>>>>>> would be
>>>>>>>> used often because of the limitations of using an interface with type
>>>>>>>> erasure.
>>>>>>>>
>>>>>>>> I believe that Wenchen’s proposal will provide stronger
>>>>>>>> query-compile-time safety
>>>>>>>>
>>>>>>>> The proposal could have better safety for each argument, assuming
>>>>>>>> that we detect failures by looking at the parameter types using 
>>>>>>>> reflection
>>>>>>>> in the analyzer. But we don’t do that for any of the similar UDFs 
>>>>>>>> today so
>>>>>>>> I’m skeptical that this would actually be a high enough priority to
>>>>>>>> implement.
>>>>>>>>
>>>>>>>> As Erik pointed out, type erasure also limits the effectiveness.
>>>>>>>> You can’t implement ScalarFunction2<Integer, Integer> and 
>>>>>>>> ScalarFunction2<Long,
>>>>>>>> Long>. You can handle those cases using InternalRow or you can
>>>>>>>> handle them using VarargScalarFunction<Object>. That forces many
>>>>>>>> use cases into varargs with Object, where you don’t get any of the
>>>>>>>> proposed analyzer benefits and lose compile-time checks. The only time 
>>>>>>>> the
>>>>>>>> additional checks (if implemented) would help is when only one set of
>>>>>>>> argument types is needed because implementing ScalarFunction<Object,
>>>>>>>> Object> defeats the purpose.
>>>>>>>>
>>>>>>>> It’s worth noting that safety for the magic methods would be
>>>>>>>> identical between the two options, so the trade-off to consider is for
>>>>>>>> varargs and non-codegen cases. Combining the limitations discussed, 
>>>>>>>> this
>>>>>>>> has better safety guarantees only if you need just one set of types for
>>>>>>>> each number of arguments and are using the non-codegen path. Since 
>>>>>>>> varargs
>>>>>>>> is one of the primary reasons to use this API, then I don’t think that 
>>>>>>>> it
>>>>>>>> is a good idea to use Object[] instead of InternalRow.
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] SPIP: FunctionCatalog

Reply via email to