Albert Meltzer created SPARK-20312: -------------------------------------- Summary: query optimizer calls udf with null values when it doesn't expect them Key: SPARK-20312 URL: https://issues.apache.org/jira/browse/SPARK-20312 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Albert Meltzer
When optimizing an outer join, spark passes an empty row to both sides to see if nulls would be ignored (side comment: for half-outer joins it subsequently ignores the assessment on the dominant side). For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" might result in checking the right side first, and an exception if the udf doesn't expect a null input (given the left side first). A example is SIMILAR to the following (see actual query plans separately): def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT NULL added filter on the result val df1 = sparkSession .table(...) .select("col1", "col2") // LongType both val df11 = df1 .filter(df1("col1").isNotNull) .withColumn("col3", functions.udf(func)(df1("col1")) .repartition(df1("col2")) .sortWithinPartitions(df1("col2")) val df2 = ... // load other data containing col2, similarly repartition and sort val df3 = df1.join(df2, Seq("col2"), "left_outer") -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org