+1 on Dongjoon's proposal. Great to see this is getting moved forward and
thanks everyone for the insightful discussion!



On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue <rb...@netflix.com> wrote:

> Okay, great. I'll update the SPIP doc and call a vote in the next day or
> two.
>
> On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen <xkro...@apache.org> wrote:
>
>> +1 on Dongjoon's proposal. This is a very nice compromise between the
>> reflective/magic-method approach and the InternalRow approach, providing
>> a lot of flexibility for our users, and allowing for the more complicated
>> reflection-based approach to evolve at its own pace, since you can always
>> fall back to InternalRow for situations which aren't yet supported by
>> reflection.
>>
>> We can even consider having Spark code detect that you haven't overridden
>> the default produceResult (IIRC this is discoverable via reflection),
>> and raise an error at query analysis time instead of at runtime when it
>> can't find a reflective method or an overridden produceResult.
>>
>> I'm very pleased we have found a compromise that everyone seems happy
>> with! Big thanks to everyone who participated.
>>
>> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge <jzh...@apache.org> wrote:
>>
>>> +1 Good plan to move forward.
>>>
>>> Thank you all for the constructive and comprehensive discussions in this
>>> thread! Decisions on this important feature will have ramifications for
>>> years to come.
>>>
>>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>
>>>> +1 to this proposal. If people don't like the ScalarFunction0,1, ...
>>>> variants and prefer the "magical methods", then we can have a single
>>>> ScalarFunction interface which has the row-parameter API (with a
>>>> default implementation to fail) and documents to describe the "magical
>>>> methods" (which can be done later).
>>>>
>>>> I'll start the PR review this week to check the naming, doc, etc.
>>>>
>>>> Thanks all for the discussion here and let's move forward!
>>>>
>>>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> Good point, Dongjoon. I think we can probably come to some compromise
>>>>> here:
>>>>>
>>>>>    - Remove SupportsInvoke since it isn’t really needed. We should
>>>>>    always try to find the right method to invoke in the codegen path.
>>>>>    - Add a default implementation of produceResult so that
>>>>>    implementations don’t have to use it. If they don’t implement it and a
>>>>>    magic function can’t be found, then it will throw
>>>>>    UnsupportedOperationException
>>>>>
>>>>> This is assuming that we can agree not to introduce all of the
>>>>> ScalarFunction interface variations, which would have limited utility
>>>>> because of type erasure.
>>>>>
>>>>> Does that sound like a good plan to everyone? If so, I’ll update the
>>>>> SPIP doc so we can move forward.
>>>>>
>>>>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> We shared many opinions in different perspectives.
>>>>>> However, we didn't reach a consensus even on a partial merge by
>>>>>> excluding something
>>>>>> (on the PR by me, on this mailing thread by Wenchen).
>>>>>>
>>>>>> For the following claims, we have another alternative to mitigate it.
>>>>>>
>>>>>>     > I don't like it because it promotes the row-parameter API and
>>>>>> forces users to implement it, even if the users want to only use the
>>>>>> individual-parameters API.
>>>>>>
>>>>>> Why don't we merge the AS-IS PR by adding something instead of
>>>>>> excluding something?
>>>>>>
>>>>>>     - R produceResult(InternalRow input);
>>>>>>     + default R produceResult(InternalRow input) throws Exception {
>>>>>>     +   throw new UnsupportedOperationException();
>>>>>>     + }
>>>>>>
>>>>>> By providing the default implementation, it will not *forcing users
>>>>>> to implement it* technically.
>>>>>> And, we can provide a document about our expected usage properly.
>>>>>> What do you think?
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>>>>>> cost of using Object[], and primitive types need to do boxing
>>>>>>>
>>>>>>> The question is not whether to use the magic functions, which would
>>>>>>> not need boxing. The question here is whether to use multiple
>>>>>>> ScalarFunction interfaces. Those interfaces would require boxing or
>>>>>>> using Object[] so there isn’t a benefit.
>>>>>>>
>>>>>>> If we do want to reuse one UDF for different types, using “magical
>>>>>>> methods” solves the problem
>>>>>>>
>>>>>>> Yes, that’s correct. We agree that magic methods are a good option
>>>>>>> for this.
>>>>>>>
>>>>>>> Again, the question we need to decide is whether to use InternalRow
>>>>>>> or interfaces like ScalarFunction2 for non-codegen. The option to
>>>>>>> use multiple interfaces is limited by type erasure because you can only
>>>>>>> have one set of type parameters. If you wanted to support both 
>>>>>>> ScalarFunction2<Integer,
>>>>>>> Integer> and ScalarFunction2<Long, Long> you’d have to fall back to 
>>>>>>> ScalarFunction2<Object,
>>>>>>> Object> and cast.
>>>>>>>
>>>>>>> The point is that type erasure will commonly lead either to many
>>>>>>> different implementation classes (one for each type combination) or will
>>>>>>> lead to parameterizing by Object, which defeats the purpose.
>>>>>>>
>>>>>>> The alternative adds safety because correct types are produced by
>>>>>>> calls like getLong(0). Yes, this depends on the implementation
>>>>>>> making the correct calls, but it is better than using Object and
>>>>>>> casting.
>>>>>>>
>>>>>>> I can’t think of real use cases that will force the
>>>>>>> individual-parameters approach to use Object instead of concrete types.
>>>>>>>
>>>>>>> I think this is addressed by the type erasure discussion above. A
>>>>>>> simple Plus method would require Object or 12 implementations for 2
>>>>>>> arguments and 4 numeric types.
>>>>>>>
>>>>>>> And basically all varargs cases would need to use Object[].
>>>>>>> Consider a UDF to create a map that requires string keys and some
>>>>>>> consistent type for values. This would be easy with the InternalRow
>>>>>>> API because you can use getString(pos) and get(pos + 1, valueType)
>>>>>>> to get the key/value pairs. Use of UTF8String vs String will be
>>>>>>> checked at compile time.
>>>>>>>
>>>>>>> I agree that Object[] is worse than InternalRow
>>>>>>>
>>>>>>> Yes, and if we are always using Object because of type erasure or
>>>>>>> using magic methods to get better performance, the utility of the
>>>>>>> parameterized interfaces is very limited.
>>>>>>>
>>>>>>> Because we want to expose the magic functions, the use of
>>>>>>> ScalarFunction2 and similar is extremely limited because it is only
>>>>>>> for non-codegen. Varargs is by far the more common case. The
>>>>>>> InternalRow interface is also a very simple way to get started and
>>>>>>> ensures that Spark can always find the right method after the function 
>>>>>>> is
>>>>>>> bound to input types.
>>>>>>>
>>>>>>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the
>>>>>>>> cost of using Object[], and primitive types need to do boxing. And
>>>>>>>> this is a runtime failure, which is absolutely worse than
>>>>>>>> query-compile-time checks. Also, don't forget my previous point: users 
>>>>>>>> need
>>>>>>>> to specify the type and index such as row.getLong(0), which is
>>>>>>>> error-prone.
>>>>>>>>
>>>>>>>> > But we don’t do that for any of the similar UDFs today so I’m
>>>>>>>> skeptical that this would actually be a high enough priority to 
>>>>>>>> implement.
>>>>>>>>
>>>>>>>> I'd say this is a must-have if we go with the individual-parameters
>>>>>>>> approach. The Scala UDF today checks the method signature at 
>>>>>>>> compile-time,
>>>>>>>> thanks to the Scala type tag. The Java UDF today doesn't check and is 
>>>>>>>> hard
>>>>>>>> to use.
>>>>>>>>
>>>>>>>> > You can’t implement ScalarFunction2<Integer, Integer> and
>>>>>>>> ScalarFunction2<Long, Long>.
>>>>>>>>
>>>>>>>> Can you elaborate? We have function binding and we can use
>>>>>>>> different UDFs for different input types. If we do want to reuse one 
>>>>>>>> UDF
>>>>>>>> for different types, using "magical methods" solves the problem:
>>>>>>>> class MyUDF {
>>>>>>>>   def call(i: Int): Int = ...
>>>>>>>>   def call(l: Long): Long = ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> On the other side, I don't think the row-parameter approach can
>>>>>>>> solve this problem. The best I can think of is:
>>>>>>>> class MyUDF implement ScalaFunction[Object] {
>>>>>>>>   def call(row: InternalRow): Object = {
>>>>>>>>     if (int input) row.getInt(0) ... else row.getLong(0) ...
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> This is worse because: 1) it needs to do if-else to check different
>>>>>>>> input types. 2) the return type can only be Object and cause boxing 
>>>>>>>> issues.
>>>>>>>>
>>>>>>>> I agree that Object[] is worse than InternalRow. But I can't think
>>>>>>>> of real use cases that will force the individual-parameters approach 
>>>>>>>> to use
>>>>>>>> Object instead of concrete types.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>>>>
>>>>>>>>> Thanks for adding your perspective, Erik!
>>>>>>>>>
>>>>>>>>> If the input is string type but the UDF implementation calls
>>>>>>>>> row.getLong(0), it returns wrong data
>>>>>>>>>
>>>>>>>>> I think this is misleading. It is true for UnsafeRow, but there
>>>>>>>>> is no reason why InternalRow should return incorrect values.
>>>>>>>>>
>>>>>>>>> The implementation in GenericInternalRow
>>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
>>>>>>>>> would throw a ClassCastException. I don’t think that using a row
>>>>>>>>> is a bad option simply because UnsafeRow is unsafe.
>>>>>>>>>
>>>>>>>>> It’s unlikely that UnsafeRow would be used to pass the data. The
>>>>>>>>> implementation would evaluate each argument expression and set the 
>>>>>>>>> result
>>>>>>>>> in a generic row, then pass that row to the UDF. We can use whatever
>>>>>>>>> implementation we choose to provide better guarantees than unsafe.
>>>>>>>>>
>>>>>>>>> I think we should consider query-compile-time checks as
>>>>>>>>> nearly-as-good as Java-compile-time checks for the purposes of safety.
>>>>>>>>>
>>>>>>>>> I don’t think I agree with this. A failure at query analysis time
>>>>>>>>> vs runtime still requires going back to a separate project, fixing
>>>>>>>>> something, and rebuilding. The time needed to fix a problem goes up
>>>>>>>>> significantly vs. compile-time checks. And that is even worse if the 
>>>>>>>>> UDF is
>>>>>>>>> maintained by someone else.
>>>>>>>>>
>>>>>>>>> I think we also need to consider how common it would be that a use
>>>>>>>>> case can have the query-compile-time checks. Going through this in 
>>>>>>>>> more
>>>>>>>>> detail below makes me think that it is unlikely that these checks 
>>>>>>>>> would be
>>>>>>>>> used often because of the limitations of using an interface with type
>>>>>>>>> erasure.
>>>>>>>>>
>>>>>>>>> I believe that Wenchen’s proposal will provide stronger
>>>>>>>>> query-compile-time safety
>>>>>>>>>
>>>>>>>>> The proposal could have better safety for each argument, assuming
>>>>>>>>> that we detect failures by looking at the parameter types using 
>>>>>>>>> reflection
>>>>>>>>> in the analyzer. But we don’t do that for any of the similar UDFs 
>>>>>>>>> today so
>>>>>>>>> I’m skeptical that this would actually be a high enough priority to
>>>>>>>>> implement.
>>>>>>>>>
>>>>>>>>> As Erik pointed out, type erasure also limits the effectiveness.
>>>>>>>>> You can’t implement ScalarFunction2<Integer, Integer> and 
>>>>>>>>> ScalarFunction2<Long,
>>>>>>>>> Long>. You can handle those cases using InternalRow or you can
>>>>>>>>> handle them using VarargScalarFunction<Object>. That forces many
>>>>>>>>> use cases into varargs with Object, where you don’t get any of
>>>>>>>>> the proposed analyzer benefits and lose compile-time checks. The only 
>>>>>>>>> time
>>>>>>>>> the additional checks (if implemented) would help is when only one 
>>>>>>>>> set of
>>>>>>>>> argument types is needed because implementing ScalarFunction<Object,
>>>>>>>>> Object> defeats the purpose.
>>>>>>>>>
>>>>>>>>> It’s worth noting that safety for the magic methods would be
>>>>>>>>> identical between the two options, so the trade-off to consider is for
>>>>>>>>> varargs and non-codegen cases. Combining the limitations discussed, 
>>>>>>>>> this
>>>>>>>>> has better safety guarantees only if you need just one set of types 
>>>>>>>>> for
>>>>>>>>> each number of arguments and are using the non-codegen path. Since 
>>>>>>>>> varargs
>>>>>>>>> is one of the primary reasons to use this API, then I don’t think 
>>>>>>>>> that it
>>>>>>>>> is a good idea to use Object[] instead of InternalRow.
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to