[ https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins updated SPARK-40963: ---------------------------------- Description: Example: {noformat} select c1, explode(c4) as c5 from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(1, 2)), (2, array(2, 3)), (3, null) as data(c1, c2) ) ); +---+---+ |c1 |c5 | +---+---+ |1 |1 | |1 |2 | |2 |2 | |2 |3 | |3 |0 | +---+---+ {noformat} In the last row, {{c5}} is 0, but should be {{NULL}}. Another example: {noformat} select c1, exists(c4, x -> x is null) as c5 from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(1, 2)), (2, array(2, 3)), (3, null) as data(c1, c2) ) ); +---+-----+ |c1 |c5 | +---+-----+ |1 |false| |1 |false| |2 |false| |2 |false| |3 |false| +---+-----+ {noformat} In the last row, {{false}} should be {{true}}. In both cases, at the time {{CreateArray(c3)}} is instantiated, {{c3}}'s nullability is incorrect because the new projection created by {{ExtractGenerator}} uses {{generatorOutput}} from {{explode_outer(c2)}} as a projection list. {{generatorOutput}} doesn't take into account that {{explode_outer(c2)}} is an _outer_ explode, so the nullability setting is lost. {{UpdateAttributeNullability}} will eventually fix the nullable setting for attributes referring to {{c3}}. but it doesn't fix the {{containsNull}} for {{c4} in {{explode(c4)}} (first example) or {{exists(c4, x -> x is null)}} (second example). This example fails with a {{NullPointerException}} {noformat} select c1, inline_outer(c4) from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(named_struct('a', 1, 'b', 2))), (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), (3, null) as data(c1, c2) ) ); 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) {noformat} was: Example: {noformat} select c1, explode(c4) as c5 from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(1, 2)), (2, array(2, 3)), (3, null) as data(c1, c2) ) ); +---+---+ |c1 |c5 | +---+---+ |1 |1 | |1 |2 | |2 |2 | |2 |3 | |3 |0 | +---+---+ {noformat} In the last row, {{c5}} is 0, but should be {{NULL}}. Another example: {noformat} select c1, exists(c4, x -> x is null) as c5 from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(1, 2)), (2, array(2, 3)), (3, null) as data(c1, c2) ) ); +---+-----+ |c1 |c5 | +---+-----+ |1 |false| |1 |false| |2 |false| |2 |false| |3 |false| +---+-----+ {noformat} In the last row, {{false}} should be {{true}}. In both cases, at the time {{CreateArray(c3)}} is instantiated, {{c3}}'s nullability is incorrect because the new projection created by {{ExtractGenerator}} uses {{generatorOutput}} from {{explode_outer(c2)}} as a projection list {{generatorOutput}} doesn't take into account that {{explode_outer(c2)}} is an _outer_ explode, so the nullability setting is lost. {{UpdateAttributeNullability}} will eventually fix the nullable setting for attributes referring to {{c3}}. but it doesn't fix the {{containsNull}} for {{c4} in {{explode(c4)}} (first example) or {{exists(c4, x -> x is null)}} (second example). This example fails with a {{NullPointerException}} {noformat} select c1, inline_outer(c4) from ( select c1, array(c3) as c4 from ( select c1, explode_outer(c2) as c3 from values (1, array(named_struct('a', 1, 'b', 2))), (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), (3, null) as data(c1, c2) ) ); 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) {noformat} > ExtractGenerator sets incorrect nullability in new Project > ---------------------------------------------------------- > > Key: SPARK-40963 > URL: https://issues.apache.org/jira/browse/SPARK-40963 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.3, 3.2.2, 3.4.0, 3.3.1 > Reporter: Bruce Robbins > Priority: Major > Labels: correctness > > Example: > {noformat} > select c1, explode(c4) as c5 from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(1, 2)), > (2, array(2, 3)), > (3, null) > as data(c1, c2) > ) > ); > +---+---+ > |c1 |c5 | > +---+---+ > |1 |1 | > |1 |2 | > |2 |2 | > |2 |3 | > |3 |0 | > +---+---+ > {noformat} > In the last row, {{c5}} is 0, but should be {{NULL}}. > Another example: > {noformat} > select c1, exists(c4, x -> x is null) as c5 from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(1, 2)), > (2, array(2, 3)), > (3, null) > as data(c1, c2) > ) > ); > +---+-----+ > |c1 |c5 | > +---+-----+ > |1 |false| > |1 |false| > |2 |false| > |2 |false| > |3 |false| > +---+-----+ > {noformat} > In the last row, {{false}} should be {{true}}. > In both cases, at the time {{CreateArray(c3)}} is instantiated, {{c3}}'s > nullability is incorrect because the new projection created by > {{ExtractGenerator}} uses {{generatorOutput}} from {{explode_outer(c2)}} as a > projection list. {{generatorOutput}} doesn't take into account that > {{explode_outer(c2)}} is an _outer_ explode, so the nullability setting is > lost. > {{UpdateAttributeNullability}} will eventually fix the nullable setting for > attributes referring to {{c3}}. but it doesn't fix the {{containsNull}} for > {{c4} in {{explode(c4)}} (first example) or {{exists(c4, x -> x is null)}} > (second example). > This example fails with a {{NullPointerException}} > {noformat} > select c1, inline_outer(c4) from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(named_struct('a', 1, 'b', 2))), > (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), > (3, null) > as data(c1, c2) > ) > ); > 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org