[ https://issues.apache.org/jira/browse/SPARK-40963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins updated SPARK-40963: ---------------------------------- Summary: ExtractGenerator sets incorrect nullability in new Project (was: containsNull in array type attributes is not updated from child output) > ExtractGenerator sets incorrect nullability in new Project > ---------------------------------------------------------- > > Key: SPARK-40963 > URL: https://issues.apache.org/jira/browse/SPARK-40963 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.3, 3.2.2, 3.4.0, 3.3.1 > Reporter: Bruce Robbins > Priority: Major > Labels: correctness > > Example: > {noformat} > select c1, explode(c4) as c5 from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(1, 2)), > (2, array(2, 3)), > (3, null) > as data(c1, c2) > ) > ); > +---+---+ > |c1 |c5 | > +---+---+ > |1 |1 | > |1 |2 | > |2 |2 | > |2 |3 | > |3 |0 | > +---+---+ > {noformat} > In the last row, {{c5}} is 0, but should be {{NULL}}. > The following description is the proximate cause of the issue, but may not be > the _root_ cause (still looking): > At the time {{ResolveGenerate.makeGeneratorOutput}} is called for > {{explode(c4)}}, {{c3}} has nullable set to false, so {{c4}}'s data type has > {{containsNull}} also set to false. Later, {{c3}}'s nullability is updated > and c4's data type reports containsNull = true, but two things go wrong: > * The {{containsNull}} setting for {{c4}} is not propogated to parent > operators (so the attribute {{c4}} in {{explode(c4)}} still has containsNull > = false) > * Even if it were propogated, {{generatorOutput}} for {{explode(c4)}} is > already determined and won't be recalculated. > Another example: > {noformat} > select c1, inline_outer(c4) from ( > select c1, array(c3) as c4 from ( > select c1, explode_outer(c2) as c3 > from values > (1, array(named_struct('a', 1, 'b', 2))), > (2, array(named_struct('a', 3, 'b', 4), named_struct('a', 5, 'b', 6))), > (3, null) > as data(c1, c2) > ) > ); > 22/10/27 11:53:20 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) > > - > - > - > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org