Okay, great. I'll update the SPIP doc and call a vote in the next day or two.
On Thu, Mar 4, 2021 at 8:26 AM Erik Krogen <xkro...@apache.org> wrote: > +1 on Dongjoon's proposal. This is a very nice compromise between the > reflective/magic-method approach and the InternalRow approach, providing > a lot of flexibility for our users, and allowing for the more complicated > reflection-based approach to evolve at its own pace, since you can always > fall back to InternalRow for situations which aren't yet supported by > reflection. > > We can even consider having Spark code detect that you haven't overridden > the default produceResult (IIRC this is discoverable via reflection), and > raise an error at query analysis time instead of at runtime when it can't > find a reflective method or an overridden produceResult. > > I'm very pleased we have found a compromise that everyone seems happy > with! Big thanks to everyone who participated. > > On Wed, Mar 3, 2021 at 8:34 PM John Zhuge <jzh...@apache.org> wrote: > >> +1 Good plan to move forward. >> >> Thank you all for the constructive and comprehensive discussions in this >> thread! Decisions on this important feature will have ramifications for >> years to come. >> >> On Wed, Mar 3, 2021 at 7:42 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> +1 to this proposal. If people don't like the ScalarFunction0,1, ... >>> variants and prefer the "magical methods", then we can have a single >>> ScalarFunction interface which has the row-parameter API (with a >>> default implementation to fail) and documents to describe the "magical >>> methods" (which can be done later). >>> >>> I'll start the PR review this week to check the naming, doc, etc. >>> >>> Thanks all for the discussion here and let's move forward! >>> >>> On Thu, Mar 4, 2021 at 9:53 AM Ryan Blue <rb...@netflix.com> wrote: >>> >>>> Good point, Dongjoon. I think we can probably come to some compromise >>>> here: >>>> >>>> - Remove SupportsInvoke since it isn’t really needed. We should >>>> always try to find the right method to invoke in the codegen path. >>>> - Add a default implementation of produceResult so that >>>> implementations don’t have to use it. If they don’t implement it and a >>>> magic function can’t be found, then it will throw >>>> UnsupportedOperationException >>>> >>>> This is assuming that we can agree not to introduce all of the >>>> ScalarFunction interface variations, which would have limited utility >>>> because of type erasure. >>>> >>>> Does that sound like a good plan to everyone? If so, I’ll update the >>>> SPIP doc so we can move forward. >>>> >>>> On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> Hi, All. >>>>> >>>>> We shared many opinions in different perspectives. >>>>> However, we didn't reach a consensus even on a partial merge by >>>>> excluding something >>>>> (on the PR by me, on this mailing thread by Wenchen). >>>>> >>>>> For the following claims, we have another alternative to mitigate it. >>>>> >>>>> > I don't like it because it promotes the row-parameter API and >>>>> forces users to implement it, even if the users want to only use the >>>>> individual-parameters API. >>>>> >>>>> Why don't we merge the AS-IS PR by adding something instead of >>>>> excluding something? >>>>> >>>>> - R produceResult(InternalRow input); >>>>> + default R produceResult(InternalRow input) throws Exception { >>>>> + throw new UnsupportedOperationException(); >>>>> + } >>>>> >>>>> By providing the default implementation, it will not *forcing users to >>>>> implement it* technically. >>>>> And, we can provide a document about our expected usage properly. >>>>> What do you think? >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> >>>>> >>>>> On Wed, Mar 3, 2021 at 10:28 AM Ryan Blue <rb...@netflix.com> wrote: >>>>> >>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the >>>>>> cost of using Object[], and primitive types need to do boxing >>>>>> >>>>>> The question is not whether to use the magic functions, which would >>>>>> not need boxing. The question here is whether to use multiple >>>>>> ScalarFunction interfaces. Those interfaces would require boxing or >>>>>> using Object[] so there isn’t a benefit. >>>>>> >>>>>> If we do want to reuse one UDF for different types, using “magical >>>>>> methods” solves the problem >>>>>> >>>>>> Yes, that’s correct. We agree that magic methods are a good option >>>>>> for this. >>>>>> >>>>>> Again, the question we need to decide is whether to use InternalRow >>>>>> or interfaces like ScalarFunction2 for non-codegen. The option to >>>>>> use multiple interfaces is limited by type erasure because you can only >>>>>> have one set of type parameters. If you wanted to support both >>>>>> ScalarFunction2<Integer, >>>>>> Integer> and ScalarFunction2<Long, Long> you’d have to fall back to >>>>>> ScalarFunction2<Object, >>>>>> Object> and cast. >>>>>> >>>>>> The point is that type erasure will commonly lead either to many >>>>>> different implementation classes (one for each type combination) or will >>>>>> lead to parameterizing by Object, which defeats the purpose. >>>>>> >>>>>> The alternative adds safety because correct types are produced by >>>>>> calls like getLong(0). Yes, this depends on the implementation >>>>>> making the correct calls, but it is better than using Object and >>>>>> casting. >>>>>> >>>>>> I can’t think of real use cases that will force the >>>>>> individual-parameters approach to use Object instead of concrete types. >>>>>> >>>>>> I think this is addressed by the type erasure discussion above. A >>>>>> simple Plus method would require Object or 12 implementations for 2 >>>>>> arguments and 4 numeric types. >>>>>> >>>>>> And basically all varargs cases would need to use Object[]. Consider >>>>>> a UDF to create a map that requires string keys and some consistent type >>>>>> for values. This would be easy with the InternalRow API because you >>>>>> can use getString(pos) and get(pos + 1, valueType) to get the >>>>>> key/value pairs. Use of UTF8String vs String will be checked at >>>>>> compile time. >>>>>> >>>>>> I agree that Object[] is worse than InternalRow >>>>>> >>>>>> Yes, and if we are always using Object because of type erasure or >>>>>> using magic methods to get better performance, the utility of the >>>>>> parameterized interfaces is very limited. >>>>>> >>>>>> Because we want to expose the magic functions, the use of >>>>>> ScalarFunction2 and similar is extremely limited because it is only >>>>>> for non-codegen. Varargs is by far the more common case. The >>>>>> InternalRow interface is also a very simple way to get started and >>>>>> ensures that Spark can always find the right method after the function is >>>>>> bound to input types. >>>>>> >>>>>> On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Yes, GenericInternalRow is safe if when type mismatches, with the >>>>>>> cost of using Object[], and primitive types need to do boxing. And >>>>>>> this is a runtime failure, which is absolutely worse than >>>>>>> query-compile-time checks. Also, don't forget my previous point: users >>>>>>> need >>>>>>> to specify the type and index such as row.getLong(0), which is >>>>>>> error-prone. >>>>>>> >>>>>>> > But we don’t do that for any of the similar UDFs today so I’m >>>>>>> skeptical that this would actually be a high enough priority to >>>>>>> implement. >>>>>>> >>>>>>> I'd say this is a must-have if we go with the individual-parameters >>>>>>> approach. The Scala UDF today checks the method signature at >>>>>>> compile-time, >>>>>>> thanks to the Scala type tag. The Java UDF today doesn't check and is >>>>>>> hard >>>>>>> to use. >>>>>>> >>>>>>> > You can’t implement ScalarFunction2<Integer, Integer> and >>>>>>> ScalarFunction2<Long, Long>. >>>>>>> >>>>>>> Can you elaborate? We have function binding and we can use >>>>>>> different UDFs for different input types. If we do want to reuse one UDF >>>>>>> for different types, using "magical methods" solves the problem: >>>>>>> class MyUDF { >>>>>>> def call(i: Int): Int = ... >>>>>>> def call(l: Long): Long = ... >>>>>>> } >>>>>>> >>>>>>> On the other side, I don't think the row-parameter approach can >>>>>>> solve this problem. The best I can think of is: >>>>>>> class MyUDF implement ScalaFunction[Object] { >>>>>>> def call(row: InternalRow): Object = { >>>>>>> if (int input) row.getInt(0) ... else row.getLong(0) ... >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> This is worse because: 1) it needs to do if-else to check different >>>>>>> input types. 2) the return type can only be Object and cause boxing >>>>>>> issues. >>>>>>> >>>>>>> I agree that Object[] is worse than InternalRow. But I can't think >>>>>>> of real use cases that will force the individual-parameters approach to >>>>>>> use >>>>>>> Object instead of concrete types. >>>>>>> >>>>>>> >>>>>>> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote: >>>>>>> >>>>>>>> Thanks for adding your perspective, Erik! >>>>>>>> >>>>>>>> If the input is string type but the UDF implementation calls >>>>>>>> row.getLong(0), it returns wrong data >>>>>>>> >>>>>>>> I think this is misleading. It is true for UnsafeRow, but there is >>>>>>>> no reason why InternalRow should return incorrect values. >>>>>>>> >>>>>>>> The implementation in GenericInternalRow >>>>>>>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35> >>>>>>>> would throw a ClassCastException. I don’t think that using a row >>>>>>>> is a bad option simply because UnsafeRow is unsafe. >>>>>>>> >>>>>>>> It’s unlikely that UnsafeRow would be used to pass the data. The >>>>>>>> implementation would evaluate each argument expression and set the >>>>>>>> result >>>>>>>> in a generic row, then pass that row to the UDF. We can use whatever >>>>>>>> implementation we choose to provide better guarantees than unsafe. >>>>>>>> >>>>>>>> I think we should consider query-compile-time checks as >>>>>>>> nearly-as-good as Java-compile-time checks for the purposes of safety. >>>>>>>> >>>>>>>> I don’t think I agree with this. A failure at query analysis time >>>>>>>> vs runtime still requires going back to a separate project, fixing >>>>>>>> something, and rebuilding. The time needed to fix a problem goes up >>>>>>>> significantly vs. compile-time checks. And that is even worse if the >>>>>>>> UDF is >>>>>>>> maintained by someone else. >>>>>>>> >>>>>>>> I think we also need to consider how common it would be that a use >>>>>>>> case can have the query-compile-time checks. Going through this in more >>>>>>>> detail below makes me think that it is unlikely that these checks >>>>>>>> would be >>>>>>>> used often because of the limitations of using an interface with type >>>>>>>> erasure. >>>>>>>> >>>>>>>> I believe that Wenchen’s proposal will provide stronger >>>>>>>> query-compile-time safety >>>>>>>> >>>>>>>> The proposal could have better safety for each argument, assuming >>>>>>>> that we detect failures by looking at the parameter types using >>>>>>>> reflection >>>>>>>> in the analyzer. But we don’t do that for any of the similar UDFs >>>>>>>> today so >>>>>>>> I’m skeptical that this would actually be a high enough priority to >>>>>>>> implement. >>>>>>>> >>>>>>>> As Erik pointed out, type erasure also limits the effectiveness. >>>>>>>> You can’t implement ScalarFunction2<Integer, Integer> and >>>>>>>> ScalarFunction2<Long, >>>>>>>> Long>. You can handle those cases using InternalRow or you can >>>>>>>> handle them using VarargScalarFunction<Object>. That forces many >>>>>>>> use cases into varargs with Object, where you don’t get any of the >>>>>>>> proposed analyzer benefits and lose compile-time checks. The only time >>>>>>>> the >>>>>>>> additional checks (if implemented) would help is when only one set of >>>>>>>> argument types is needed because implementing ScalarFunction<Object, >>>>>>>> Object> defeats the purpose. >>>>>>>> >>>>>>>> It’s worth noting that safety for the magic methods would be >>>>>>>> identical between the two options, so the trade-off to consider is for >>>>>>>> varargs and non-codegen cases. Combining the limitations discussed, >>>>>>>> this >>>>>>>> has better safety guarantees only if you need just one set of types for >>>>>>>> each number of arguments and are using the non-codegen path. Since >>>>>>>> varargs >>>>>>>> is one of the primary reasons to use this API, then I don’t think that >>>>>>>> it >>>>>>>> is a good idea to use Object[] instead of InternalRow. >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >> >> -- >> John Zhuge >> > -- Ryan Blue Software Engineer Netflix