Thanks for adding your perspective, Erik! If the input is string type but the UDF implementation calls row.getLong(0), it returns wrong data
I think this is misleading. It is true for UnsafeRow, but there is no reason why InternalRow should return incorrect values. The implementation in GenericInternalRow <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L35> would throw a ClassCastException. I don’t think that using a row is a bad option simply because UnsafeRow is unsafe. It’s unlikely that UnsafeRow would be used to pass the data. The implementation would evaluate each argument expression and set the result in a generic row, then pass that row to the UDF. We can use whatever implementation we choose to provide better guarantees than unsafe. I think we should consider query-compile-time checks as nearly-as-good as Java-compile-time checks for the purposes of safety. I don’t think I agree with this. A failure at query analysis time vs runtime still requires going back to a separate project, fixing something, and rebuilding. The time needed to fix a problem goes up significantly vs. compile-time checks. And that is even worse if the UDF is maintained by someone else. I think we also need to consider how common it would be that a use case can have the query-compile-time checks. Going through this in more detail below makes me think that it is unlikely that these checks would be used often because of the limitations of using an interface with type erasure. I believe that Wenchen’s proposal will provide stronger query-compile-time safety The proposal could have better safety for each argument, assuming that we detect failures by looking at the parameter types using reflection in the analyzer. But we don’t do that for any of the similar UDFs today so I’m skeptical that this would actually be a high enough priority to implement. As Erik pointed out, type erasure also limits the effectiveness. You can’t implement ScalarFunction2<Integer, Integer> and ScalarFunction2<Long, Long>. You can handle those cases using InternalRow or you can handle them using VarargScalarFunction<Object>. That forces many use cases into varargs with Object, where you don’t get any of the proposed analyzer benefits and lose compile-time checks. The only time the additional checks (if implemented) would help is when only one set of argument types is needed because implementing ScalarFunction<Object, Object> defeats the purpose. It’s worth noting that safety for the magic methods would be identical between the two options, so the trade-off to consider is for varargs and non-codegen cases. Combining the limitations discussed, this has better safety guarantees only if you need just one set of types for each number of arguments and are using the non-codegen path. Since varargs is one of the primary reasons to use this API, then I don’t think that it is a good idea to use Object[] instead of InternalRow. -- Ryan Blue Software Engineer Netflix