[ https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986189#comment-15986189 ]
Takeshi Yamamuro commented on SPARK-20312: ------------------------------------------ I checked there was no issue in the current master. IIUC this issue has been fixed in SPARK-20359. This fix has already been applied into branch-2.1 > query optimizer calls udf with null values when it doesn't expect them > ---------------------------------------------------------------------- > > Key: SPARK-20312 > URL: https://issues.apache.org/jira/browse/SPARK-20312 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Albert Meltzer > > When optimizing an outer join, spark passes an empty row to both sides to see > if nulls would be ignored (side comment: for half-outer joins it subsequently > ignores the assessment on the dominant side). > For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT > NULL}} might result in checking the right side first, and an exception if the > udf doesn't expect a null input (given the left side first). > A example is SIMILAR to the following (see actual query plans separately): > {noformat} > def func(value: Any): Int = ... // return AnyVal which probably causes a IS > NOT NULL added filter on the result > val df1 = sparkSession > .table(...) > .select("col1", "col2") // LongType both > val df11 = df1 > .filter(df1("col1").isNotNull) > .withColumn("col3", functions.udf(func)(df1("col1")) > .repartition(df1("col2")) > .sortWithinPartitions(df1("col2")) > val df2 = ... // load other data containing col2, similarly repartition and > sort > val df3 = > df1.join(df2, Seq("col2"), "left_outer") > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org