+1 on Dongjoon's proposal. Great to see this is getting moved forward and thanks everyone for the insightful discussion!
On Thu, Mar 4, 2021 at 8:58 AM Ryan Blue <rb...@netflix.com> wrote: > Okay, great. I'll update the SPIP doc and call a vote in the next day or > two. > > On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen <xkro...@apache.org> wrote: > >> +1 on Dongjoon's proposal. This is a very nice compromise between the >> reflective/magic-method approach and the InternalRow approach, providing >> a lot of flexibility for our users, and allowing for the more complicated >> reflection-based approach to evolve at its own pace, since you can always >> fall back to InternalRow for situations which aren't yet supported by >> reflection. >> >> We can even consider having Spark code detect that you haven't overridden >> the default produceResult (IIRC this is discoverable via reflection), >> and raise an error at query analysis time instead of at runtime when it >> can't find a reflective method or an overridden produceResult. >> >> I'm very pleased we have found a compromise that everyone seems happy >> with! Big thanks to everyone who participated. >> >> On Wed, Mar 3, 2021 at 8:34 PM John Zhuge <jzh...@apache.org> wrote: >> >>> +1 Good plan to move forward. >>> >>> Thank you all for the constructive and comprehensive discussions in this >>> thread! Decisions on this important feature will have ramifications for >>> years to come. >>> >>> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> +1 to this proposal. If people don't like the ScalarFunction0,1, ... >>>> variants and prefer the "magical methods", then we can have a single >>>> ScalarFunction interface which has the row-parameter API (with a >>>> default implementation to fail) and documents to describe the "magical >>>> methods" (which can be done later). >>>> >>>> I'll start the PR review this week to check the naming, doc, etc. >>>> >>>> Thanks all for the discussion here and let's move forward! >>>> >>>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> Good point, Dongjoon. I think we can probably come to some compromise >>>>> here: >>>>> >>>>> - Remove SupportsInvoke since it isn’t really needed. We should >>>>> always try to find the right method to invoke in the codegen path. >>>>> - Add a default implementation of produceResult so that >>>>> implementations don’t have to use it. If they don’t implement it and a >>>>> magic function can’t be found, then it will throw >>>>> UnsupportedOperationException >>>>> >>>>> This is assuming that we can agree not to introduce all of the >>>>> ScalarFunction interface variations, which would have limited utility >>>>> because of type erasure. >>>>> >>>>> Does that sound like a good plan to everyone? If so, I’ll update the >>>>> SPIP doc so we can move forward. >>>>> >>>>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, All. >>>>>> >>>>>> We shared many opinions in different perspectives. >>>>>> However, we didn't reach a consensus even on a partial merge by >>>>>> excluding something >>>>>> (on the PR by me, on this mailing thread by Wenchen). >>>>>> >>>>>> For the following claims, we have another alternative to mitigate it. >>>>>> >>>>>> > I don't like it because it promotes the row-parameter API and >>>>>> forces users to implement it, even if the users want to only use the >>>>>> individual-parameters API. >>>>>> >>>>>> Why don't we merge the AS-IS PR by adding something instead of >>>>>> excluding something? >>>>>> >>>>>> - R produceResult(InternalRow input); >>>>>> + default R produceResult(InternalRow input) throws Exception { >>>>>> + throw new UnsupportedOperationException(); >>>>>> + } >>>>>> >>>>>> By providing the default implementation, it will not *forcing users >>>>>> to implement it* technically. >>>>>> And, we can provide a document about our expected usage properly. >>>>>> What do you think? >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rb...@netflix.com> wrote: >>>>>> >>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the >>>>>>> cost of using Object[], and primitive types need to do boxing >>>>>>> >>>>>>> The question is not whether to use the magic functions, which would >>>>>>> not need boxing. The question here is whether to use multiple >>>>>>> ScalarFunction interfaces. Those interfaces would require boxing or >>>>>>> using Object[] so there isn’t a benefit. >>>>>>> >>>>>>> If we do want to reuse one UDF for different types, using “magical >>>>>>> methods” solves the problem >>>>>>> >>>>>>> Yes, that’s correct. We agree that magic methods are a good option >>>>>>> for this. >>>>>>> >>>>>>> Again, the question we need to decide is whether to use InternalRow >>>>>>> or interfaces like ScalarFunction2 for non-codegen. The option to >>>>>>> use multiple interfaces is limited by type erasure because you can only >>>>>>> have one set of type parameters. If you wanted to support both >>>>>>> ScalarFunction2<Integer, >>>>>>> Integer> and ScalarFunction2<Long, Long> you’d have to fall back to >>>>>>> ScalarFunction2<Object, >>>>>>> Object> and cast. >>>>>>> >>>>>>> The point is that type erasure will commonly lead either to many >>>>>>> different implementation classes (one for each type combination) or will >>>>>>> lead to parameterizing by Object, which defeats the purpose. >>>>>>> >>>>>>> The alternative adds safety because correct types are produced by >>>>>>> calls like getLong(0). Yes, this depends on the implementation >>>>>>> making the correct calls, but it is better than using Object and >>>>>>> casting. >>>>>>> >>>>>>> I can’t think of real use cases that will force the >>>>>>> individual-parameters approach to use Object instead of concrete types. >>>>>>> >>>>>>> I think this is addressed by the type erasure discussion above. A >>>>>>> simple Plus method would require Object or 12 implementations for 2 >>>>>>> arguments and 4 numeric types. >>>>>>> >>>>>>> And basically all varargs cases would need to use Object[]. >>>>>>> Consider a UDF to create a map that requires string keys and some >>>>>>> consistent type for values. This would be easy with the InternalRow >>>>>>> API because you can use getString(pos) and get(pos + 1, valueType) >>>>>>> to get the key/value pairs. Use of UTF8String vs String will be >>>>>>> checked at compile time. >>>>>>> >>>>>>> I agree that Object[] is worse than InternalRow >>>>>>> >>>>>>> Yes, and if we are always using Object because of type erasure or >>>>>>> using magic methods to get better performance, the utility of the >>>>>>> parameterized interfaces is very limited. >>>>>>> >>>>>>> Because we want to expose the magic functions, the use of >>>>>>> ScalarFunction2 and similar is extremely limited because it is only >>>>>>> for non-codegen. Varargs is by far the more common case. The >>>>>>> InternalRow interface is also a very simple way to get started and >>>>>>> ensures that Spark can always find the right method after the function >>>>>>> is >>>>>>> bound to input types. >>>>>>> >>>>>>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the >>>>>>>> cost of using Object[], and primitive types need to do boxing. And >>>>>>>> this is a runtime failure, which is absolutely worse than >>>>>>>> query-compile-time checks. Also, don't forget my previous point: users >>>>>>>> need >>>>>>>> to specify the type and index such as row.getLong(0), which is >>>>>>>> error-prone. >>>>>>>> >>>>>>>> > But we don’t do that for any of the similar UDFs today so I’m >>>>>>>> skeptical that this would actually be a high enough priority to >>>>>>>> implement. >>>>>>>> >>>>>>>> I'd say this is a must-have if we go with the individual-parameters >>>>>>>> approach. The Scala UDF today checks the method signature at >>>>>>>> compile-time, >>>>>>>> thanks to the Scala type tag. The Java UDF today doesn't check and is >>>>>>>> hard >>>>>>>> to use. >>>>>>>> >>>>>>>> > You can’t implement ScalarFunction2<Integer, Integer> and >>>>>>>> ScalarFunction2<Long, Long>. >>>>>>>> >>>>>>>> Can you elaborate? We have function binding and we can use >>>>>>>> different UDFs for different input types. If we do want to reuse one >>>>>>>> UDF >>>>>>>> for different types, using "magical methods" solves the problem: >>>>>>>> class MyUDF { >>>>>>>> def call(i: Int): Int = ... >>>>>>>> def call(l: Long): Long = ... >>>>>>>> } >>>>>>>> >>>>>>>> On the other side, I don't think the row-parameter approach can >>>>>>>> solve this problem. The best I can think of is: >>>>>>>> class MyUDF implement ScalaFunction[Object] { >>>>>>>> def call(row: InternalRow): Object = { >>>>>>>> if (int input) row.getInt(0) ... else row.getLong(0) ... >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> This is worse because: 1) it needs to do if-else to check different >>>>>>>> input types. 2) the return type can only be Object and cause boxing >>>>>>>> issues. >>>>>>>> >>>>>>>> I agree that Object[] is worse than InternalRow. But I can't think >>>>>>>> of real use cases that will force the individual-parameters approach >>>>>>>> to use >>>>>>>> Object instead of concrete types. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote: >>>>>>>> >>>>>>>>> Thanks for adding your perspective, Erik! >>>>>>>>> >>>>>>>>> If the input is string type but the UDF implementation calls >>>>>>>>> row.getLong(0), it returns wrong data >>>>>>>>> >>>>>>>>> I think this is misleading. It is true for UnsafeRow, but there >>>>>>>>> is no reason why InternalRow should return incorrect values. >>>>>>>>> >>>>>>>>> The implementation in GenericInternalRow >>>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35> >>>>>>>>> would throw a ClassCastException. I don’t think that using a row >>>>>>>>> is a bad option simply because UnsafeRow is unsafe. >>>>>>>>> >>>>>>>>> It’s unlikely that UnsafeRow would be used to pass the data. The >>>>>>>>> implementation would evaluate each argument expression and set the >>>>>>>>> result >>>>>>>>> in a generic row, then pass that row to the UDF. We can use whatever >>>>>>>>> implementation we choose to provide better guarantees than unsafe. >>>>>>>>> >>>>>>>>> I think we should consider query-compile-time checks as >>>>>>>>> nearly-as-good as Java-compile-time checks for the purposes of safety. >>>>>>>>> >>>>>>>>> I don’t think I agree with this. A failure at query analysis time >>>>>>>>> vs runtime still requires going back to a separate project, fixing >>>>>>>>> something, and rebuilding. The time needed to fix a problem goes up >>>>>>>>> significantly vs. compile-time checks. And that is even worse if the >>>>>>>>> UDF is >>>>>>>>> maintained by someone else. >>>>>>>>> >>>>>>>>> I think we also need to consider how common it would be that a use >>>>>>>>> case can have the query-compile-time checks. Going through this in >>>>>>>>> more >>>>>>>>> detail below makes me think that it is unlikely that these checks >>>>>>>>> would be >>>>>>>>> used often because of the limitations of using an interface with type >>>>>>>>> erasure. >>>>>>>>> >>>>>>>>> I believe that Wenchen’s proposal will provide stronger >>>>>>>>> query-compile-time safety >>>>>>>>> >>>>>>>>> The proposal could have better safety for each argument, assuming >>>>>>>>> that we detect failures by looking at the parameter types using >>>>>>>>> reflection >>>>>>>>> in the analyzer. But we don’t do that for any of the similar UDFs >>>>>>>>> today so >>>>>>>>> I’m skeptical that this would actually be a high enough priority to >>>>>>>>> implement. >>>>>>>>> >>>>>>>>> As Erik pointed out, type erasure also limits the effectiveness. >>>>>>>>> You can’t implement ScalarFunction2<Integer, Integer> and >>>>>>>>> ScalarFunction2<Long, >>>>>>>>> Long>. You can handle those cases using InternalRow or you can >>>>>>>>> handle them using VarargScalarFunction<Object>. That forces many >>>>>>>>> use cases into varargs with Object, where you don’t get any of >>>>>>>>> the proposed analyzer benefits and lose compile-time checks. The only >>>>>>>>> time >>>>>>>>> the additional checks (if implemented) would help is when only one >>>>>>>>> set of >>>>>>>>> argument types is needed because implementing ScalarFunction<Object, >>>>>>>>> Object> defeats the purpose. >>>>>>>>> >>>>>>>>> It’s worth noting that safety for the magic methods would be >>>>>>>>> identical between the two options, so the trade-off to consider is for >>>>>>>>> varargs and non-codegen cases. Combining the limitations discussed, >>>>>>>>> this >>>>>>>>> has better safety guarantees only if you need just one set of types >>>>>>>>> for >>>>>>>>> each number of arguments and are using the non-codegen path. Since >>>>>>>>> varargs >>>>>>>>> is one of the primary reasons to use this API, then I don’t think >>>>>>>>> that it >>>>>>>>> is a good idea to use Object[] instead of InternalRow. >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>> >>> -- >>> John Zhuge >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >