Thanks for adding your perspective, Erik!

If the input is string type but the UDF implementation calls row.getLong(0),
it returns wrong data

I think this is misleading. It is true for UnsafeRow, but there is no
reason why InternalRow should return incorrect values.

The implementation in GenericInternalRow
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35>
would throw a ClassCastException. I don’t think that using a row is a bad
option simply because UnsafeRow is unsafe.

It’s unlikely that UnsafeRow would be used to pass the data. The
implementation would evaluate each argument expression and set the result
in a generic row, then pass that row to the UDF. We can use whatever
implementation we choose to provide better guarantees than unsafe.

I think we should consider query-compile-time checks as nearly-as-good as
Java-compile-time checks for the purposes of safety.

I don’t think I agree with this. A failure at query analysis time vs
runtime still requires going back to a separate project, fixing something,
and rebuilding. The time needed to fix a problem goes up significantly vs.
compile-time checks. And that is even worse if the UDF is maintained by
someone else.

I think we also need to consider how common it would be that a use case can
have the query-compile-time checks. Going through this in more detail below
makes me think that it is unlikely that these checks would be used often
because of the limitations of using an interface with type erasure.

I believe that Wenchen’s proposal will provide stronger query-compile-time
safety

The proposal could have better safety for each argument, assuming that we
detect failures by looking at the parameter types using reflection in the
analyzer. But we don’t do that for any of the similar UDFs today so I’m
skeptical that this would actually be a high enough priority to implement.

As Erik pointed out, type erasure also limits the effectiveness. You can’t
implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long, Long>.
You can handle those cases using InternalRow or you can handle them using
VarargScalarFunction<Object>. That forces many use cases into varargs with
Object, where you don’t get any of the proposed analyzer benefits and lose
compile-time checks. The only time the additional checks (if implemented)
would help is when only one set of argument types is needed because
implementing ScalarFunction<Object, Object> defeats the purpose.

It’s worth noting that safety for the magic methods would be identical
between the two options, so the trade-off to consider is for varargs and
non-codegen cases. Combining the limitations discussed, this has better
safety guarantees only if you need just one set of types for each number of
arguments and are using the non-codegen path. Since varargs is one of the
primary reasons to use this API, then I don’t think that it is a good idea
to use Object[] instead of InternalRow.
-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to