Marius Feteanu created SPARK-20212:
--------------------------------------

             Summary: UDFs with Option[Primitive Type] don't work as expected
                 Key: SPARK-20212
                 URL: https://issues.apache.org/jira/browse/SPARK-20212
             Project: Spark
          Issue Type: Bug
          Components: Optimizer
    Affects Versions: 2.1.0
            Reporter: Marius Feteanu
            Priority: Minor


The documenation for ScalaUDF says:

{code:none}
Note that if you use primitive parameters, you are not able to check if it is
null or not, and the UDF will return null for you if the primitive input is
null. Use boxed type or [[Option]] if you wanna do the null-handling yourself.
{code}

This works with boxed types:

{code:none}
import org.apache.spark.sql.functions.{col, udf}
import spark.implicits._

def is_null_box(x:java.lang.Long):String = {
    x match {
        case _:java.lang.Long => "Yep"
        case null => "No man"
    }
}
val is_null_box_udf = udf(is_null_box _)
val sample = (1L to 5L).toList.map(x=>new 
java.lang.Long(x))++List[java.lang.Long](null, null)
val df = sample.toDF("col1")
df.select(is_null_box_udf(col("col1"))).show(10)
{code}

But does not work with Option\[Long\] as expected:

{code:none}
import org.apache.spark.sql.functions.{col, udf}
import spark.implicits._

def is_null_opt(x:Option[Long]):String = {
    x match {
        case Some(_:Long) => "Yep"
        case None => "No man"
    }
}
val is_null_opt_udf = udf(is_null_opt _)
val sample = (1L to 5L)
// This does not help: val sample = (1L to 5L).map(Some(_)).toList
val df = sample.toDF("col1")
df.select(is_null_opt_udf(col("col1"))).show(10)
{code}

That throws:
{code:none}
Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to 
scala.Option
  at $anonfun$1.apply(<console>:56)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:89)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:88)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1069)
{code}

The current workaround is to use boxed types but it makes for code that looks 
funny.

If you just use Long instead of boxing the code may break in subtle ways (i.e.  
it does not fail it returns null). That's documented but easy to miss (i.e. not 
part of the bug but if someone "corrects" boxed functions to use primitive 
types then they might get surprising results):
{code:none}
import org.apache.spark.sql.functions.{col, udf, expr}
import spark.implicits._

def is_null_opt(x:Long):String = {
    Option(x) match {
        case Some(_:Long) => "Yep"
        case None => "No man"
    }
}
val is_null_opt_udf = udf(is_null_opt _)
val sample = (1L to 5L)
val df = sample.toDF("col3").select(expr("CASE WHEN col3=2 THEN NULL ELSE col3 
END").alias("col3"))
df.printSchema
df.select(is_null_opt_udf(col("col3"))).show(10)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to