Albert Meltzer created SPARK-20312:
--------------------------------------

             Summary: query optimizer calls udf with null values when it 
doesn't expect them
                 Key: SPARK-20312
                 URL: https://issues.apache.org/jira/browse/SPARK-20312
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.1.0
            Reporter: Albert Meltzer


When optimizing an outer join, spark passes an empty row to both sides to see 
if nulls would be ignored (side comment: for half-outer joins it subsequently 
ignores the assessment on the dominant side).

For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" 
might result in checking the right side first, and an exception if the udf 
doesn't expect a null input (given the left side first).

A example is SIMILAR to the following (see actual query plans separately):

def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT 
NULL added filter on the result

val df1 = sparkSession
  .table(...)
  .select("col1", "col2") // LongType both
val df11 = df1
  .filter(df1("col1").isNotNull)
  .withColumn("col3", functions.udf(func)(df1("col1"))
  .repartition(df1("col2"))
  .sortWithinPartitions(df1("col2"))

val df2 = ... // load other data containing col2, similarly repartition and sort

val df3 =
  df1.join(df2, Seq("col2"), "left_outer")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to