[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Aerts updated SPARK-44132:
---------------------------------
    Description: 
We are seeing issues with the code generator when querying java bean encoded 
data with 2 nested joins.
{code:java}
dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
{code}
will generate invalid code in the code generator.  And can depending on the 
data used generate stack traces like:
{code:java}
 Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
Or:
{code:java}
 Caused by: java.lang.AssertionError: index (2) should < 2
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
When we look at the generated code we see that the code generator seems to be 
mixing up parameters.  For example:
{code:java}
if (smj_leftOutputRow_0 != null) {                          //<==== null check 
for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //<==== causes NPE 
on right parameter here{code}
It is as if the the nesting of 2 full outer joins is confusing the code 
generator and as such generating invalid code.

There is one other strange thing.  We found this issue when using data sets 
which were using the java bean encoder.  We tried to reproduce this in the 
spark shell or using scala case classes but were unable to do so. 

We made a reproduction scenario as unit tests (one for each of the stacktrace 
above) on the spark code base and made it available as a[ pull 
request|https://github.com/apache/spark/pull/41688] to this case.

  was:
We are seeing issues with the code generator when querying java bean encoded 
data with 2 nested joins.
{code:java}
dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
{code}
will generate invalid code in the code generator.  And can depending on the 
data used generate stack traces like:
{code:java}
 Caused by: java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
Or:
{code:java}
 Caused by: java.lang.AssertionError: index (2) should < 2
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
        at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
{code}
When we look at the generated code we see that the code generator seems to be 
mixing up parameters.  For example:
{code:java}
if (smj_leftOutputRow_0 != null) {                          //<==== null check 
for wrong/left parameter
  boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //<==== causes NPE 
on right parameter here{code}
It is as if the the nesting of 2 full outer joins is confusing the code 
generator and as such generating invalid code.

There is one other strange thing.  We found this issue when using data sets 
which were using the java bean encoder.  We tried to reproduce this in the 
spark shell or using scala case classes but were unable to do so. 

We made a reproduction scenario as unit tests (one for each of the stacktrace 
above) on the spark code base and will make it available as a pull request to 
this case.


> nesting full outer joins confuses code generator
> ------------------------------------------------
>
>                 Key: SPARK-44132
>                 URL: https://issues.apache.org/jira/browse/SPARK-44132
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0, 3.4.0, 3.5.0
>         Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>            Reporter: Steven Aerts
>            Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {                          //<==== null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //<==== causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a[ pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to