Yes, GenericInternalRow is safe if when type mismatches, with the cost of
using Object[], and primitive types need to do boxing

The question is not whether to use the magic functions, which would not
need boxing. The question here is whether to use multiple ScalarFunction
interfaces. Those interfaces would require boxing or using Object[] so
there isn’t a benefit.

If we do want to reuse one UDF for different types, using “magical methods”
solves the problem

Yes, that’s correct. We agree that magic methods are a good option for this.

Again, the question we need to decide is whether to use InternalRow or
interfaces like ScalarFunction2 for non-codegen. The option to use multiple
interfaces is limited by type erasure because you can only have one set of
type parameters. If you wanted to support both ScalarFunction2<Integer,
Integer> and ScalarFunction2<Long, Long> you’d have to fall back to
ScalarFunction2<Object,
Object> and cast.

The point is that type erasure will commonly lead either to many different
implementation classes (one for each type combination) or will lead to
parameterizing by Object, which defeats the purpose.

The alternative adds safety because correct types are produced by calls
like getLong(0). Yes, this depends on the implementation making the correct
calls, but it is better than using Object and casting.

I can’t think of real use cases that will force the individual-parameters
approach to use Object instead of concrete types.

I think this is addressed by the type erasure discussion above. A simple
Plus method would require Object or 12 implementations for 2 arguments and
4 numeric types.

And basically all varargs cases would need to use Object[]. Consider a UDF
to create a map that requires string keys and some consistent type for
values. This would be easy with the InternalRow API because you can use
getString(pos) and get(pos + 1, valueType) to get the key/value pairs. Use
of UTF8String vs String will be checked at compile time.

I agree that Object[] is worse than InternalRow

Yes, and if we are always using Object because of type erasure or using
magic methods to get better performance, the utility of the parameterized
interfaces is very limited.

Because we want to expose the magic functions, the use of ScalarFunction2
and similar is extremely limited because it is only for non-codegen.
Varargs is by far the more common case. The InternalRow interface is also a
very simple way to get started and ensures that Spark can always find the
right method after the function is bound to input types.

On Tue, Mar 2, 2021 at 6:35 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Yes, GenericInternalRow is safe if when type mismatches, with the cost of
> using Object[], and primitive types need to do boxing. And this is a
> runtime failure, which is absolutely worse than query-compile-time checks.
> Also, don't forget my previous point: users need to specify the type and
> index such as row.getLong(0), which is error-prone.
>
> > But we don’t do that for any of the similar UDFs today so I’m skeptical
> that this would actually be a high enough priority to implement.
>
> I'd say this is a must-have if we go with the individual-parameters
> approach. The Scala UDF today checks the method signature at compile-time,
> thanks to the Scala type tag. The Java UDF today doesn't check and is hard
> to use.
>
> > You can’t implement ScalarFunction2<Integer, Integer> and
> ScalarFunction2<Long, Long>.
>
> Can you elaborate? We have function binding and we can use different UDFs
> for different input types. If we do want to reuse one UDF
> for different types, using "magical methods" solves the problem:
> class MyUDF {
>   def call(i: Int): Int = ...
>   def call(l: Long): Long = ...
> }
>
> On the other side, I don't think the row-parameter approach can solve this
> problem. The best I can think of is:
> class MyUDF implement ScalaFunction[Object] {
>   def call(row: InternalRow): Object = {
>     if (int input) row.getInt(0) ... else row.getLong(0) ...
>   }
> }
>
> This is worse because: 1) it needs to do if-else to check different input
> types. 2) the return type can only be Object and cause boxing issues.
>
> I agree that Object[] is worse than InternalRow. But I can't think of
> real use cases that will force the individual-parameters approach to use
> Object instead of concrete types.
>
>
> On Tue, Mar 2, 2021 at 3:36 AM Ryan Blue <rb...@netflix.com> wrote:
>
>> Thanks for adding your perspective, Erik!
>>
>> If the input is string type but the UDF implementation calls
>> row.getLong(0), it returns wrong data
>>
>> I think this is misleading. It is true for UnsafeRow, but there is no
>> reason why InternalRow should return incorrect values.
>>
>> The implementation in GenericInternalRow
>> <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
>> would throw a ClassCastException. I don’t think that using a row is a
>> bad option simply because UnsafeRow is unsafe.
>>
>> It’s unlikely that UnsafeRow would be used to pass the data. The
>> implementation would evaluate each argument expression and set the result
>> in a generic row, then pass that row to the UDF. We can use whatever
>> implementation we choose to provide better guarantees than unsafe.
>>
>> I think we should consider query-compile-time checks as nearly-as-good as
>> Java-compile-time checks for the purposes of safety.
>>
>> I don’t think I agree with this. A failure at query analysis time vs
>> runtime still requires going back to a separate project, fixing something,
>> and rebuilding. The time needed to fix a problem goes up significantly vs.
>> compile-time checks. And that is even worse if the UDF is maintained by
>> someone else.
>>
>> I think we also need to consider how common it would be that a use case
>> can have the query-compile-time checks. Going through this in more detail
>> below makes me think that it is unlikely that these checks would be used
>> often because of the limitations of using an interface with type erasure.
>>
>> I believe that Wenchen’s proposal will provide stronger
>> query-compile-time safety
>>
>> The proposal could have better safety for each argument, assuming that we
>> detect failures by looking at the parameter types using reflection in the
>> analyzer. But we don’t do that for any of the similar UDFs today so I’m
>> skeptical that this would actually be a high enough priority to implement.
>>
>> As Erik pointed out, type erasure also limits the effectiveness. You
>> can’t implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long,
>> Long>. You can handle those cases using InternalRow or you can handle
>> them using VarargScalarFunction<Object>. That forces many use cases into
>> varargs with Object, where you don’t get any of the proposed analyzer
>> benefits and lose compile-time checks. The only time the additional checks
>> (if implemented) would help is when only one set of argument types is
>> needed because implementing ScalarFunction<Object, Object> defeats the
>> purpose.
>>
>> It’s worth noting that safety for the magic methods would be identical
>> between the two options, so the trade-off to consider is for varargs and
>> non-codegen cases. Combining the limitations discussed, this has better
>> safety guarantees only if you need just one set of types for each number of
>> arguments and are using the non-codegen path. Since varargs is one of the
>> primary reasons to use this API, then I don’t think that it is a good idea
>> to use Object[] instead of InternalRow.
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to