[jira] [Updated] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-27 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20312:
-
Fix Version/s: 2.3.0
   2.2.0
   2.1.1

> query optimizer calls udf with null values when it doesn't expect them
> --
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
> Fix For: 2.1.1, 2.2.0, 2.3.0
>
>
> When optimizing an outer join, spark passes an empty row to both sides to see 
> if nulls would be ignored (side comment: for half-outer joins it subsequently 
> ignores the assessment on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT 
> NULL}} might result in checking the right side first, and an exception if the 
> udf doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS 
> NOT NULL added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and 
> sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20312:
-
Component/s: (was: Spark Core)
 SQL

> query optimizer calls udf with null values when it doesn't expect them
> --
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> When optimizing an outer join, spark passes an empty row to both sides to see 
> if nulls would be ignored (side comment: for half-outer joins it subsequently 
> ignores the assessment on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT 
> NULL}} might result in checking the right side first, and an exception if the 
> udf doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS 
> NOT NULL added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and 
> sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them

2017-04-12 Thread Albert Meltzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Meltzer updated SPARK-20312:
---
Description: 
When optimizing an outer join, spark passes an empty row to both sides to see 
if nulls would be ignored (side comment: for half-outer joins it subsequently 
ignores the assessment on the dominant side).

For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT NULL}} 
might result in checking the right side first, and an exception if the udf 
doesn't expect a null input (given the left side first).

A example is SIMILAR to the following (see actual query plans separately):

{noformat}
def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT 
NULL added filter on the result

val df1 = sparkSession
  .table(...)
  .select("col1", "col2") // LongType both
val df11 = df1
  .filter(df1("col1").isNotNull)
  .withColumn("col3", functions.udf(func)(df1("col1"))
  .repartition(df1("col2"))
  .sortWithinPartitions(df1("col2"))

val df2 = ... // load other data containing col2, similarly repartition and sort

val df3 =
  df1.join(df2, Seq("col2"), "left_outer")
{noformat}


  was:
When optimizing an outer join, spark passes an empty row to both sides to see 
if nulls would be ignored (side comment: for half-outer joins it subsequently 
ignores the assessment on the dominant side).

For some reason, a condition such as "x IS NOT NULL && udf(x) IS NOT NULL" 
might result in checking the right side first, and an exception if the udf 
doesn't expect a null input (given the left side first).

A example is SIMILAR to the following (see actual query plans separately):

def func(value: Any): Int = ... // return AnyVal which probably causes a IS NOT 
NULL added filter on the result

val df1 = sparkSession
  .table(...)
  .select("col1", "col2") // LongType both
val df11 = df1
  .filter(df1("col1").isNotNull)
  .withColumn("col3", functions.udf(func)(df1("col1"))
  .repartition(df1("col2"))
  .sortWithinPartitions(df1("col2"))

val df2 = ... // load other data containing col2, similarly repartition and sort

val df3 =
  df1.join(df2, Seq("col2"), "left_outer")


> query optimizer calls udf with null values when it doesn't expect them
> --
>
> Key: SPARK-20312
> URL: https://issues.apache.org/jira/browse/SPARK-20312
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Albert Meltzer
>
> When optimizing an outer join, spark passes an empty row to both sides to see 
> if nulls would be ignored (side comment: for half-outer joins it subsequently 
> ignores the assessment on the dominant side).
> For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT 
> NULL}} might result in checking the right side first, and an exception if the 
> udf doesn't expect a null input (given the left side first).
> A example is SIMILAR to the following (see actual query plans separately):
> {noformat}
> def func(value: Any): Int = ... // return AnyVal which probably causes a IS 
> NOT NULL added filter on the result
> val df1 = sparkSession
>   .table(...)
>   .select("col1", "col2") // LongType both
> val df11 = df1
>   .filter(df1("col1").isNotNull)
>   .withColumn("col3", functions.udf(func)(df1("col1"))
>   .repartition(df1("col2"))
>   .sortWithinPartitions(df1("col2"))
> val df2 = ... // load other data containing col2, similarly repartition and 
> sort
> val df3 =
>   df1.join(df2, Seq("col2"), "left_outer")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org