[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120801#comment-17120801 ] Apache Spark commented on SPARK-31854: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/28691 > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0 >Reporter: Pasha Finkeshteyn >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120800#comment-17120800 ] Apache Spark commented on SPARK-31854: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/28691 > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0 >Reporter: Pasha Finkeshteyn >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120750#comment-17120750 ] Takeshi Yamamuro commented on SPARK-31854: -- Thanks for the update, Dongjoon. > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120746#comment-17120746 ] Dongjoon Hyun commented on SPARK-31854: --- I also verified this at 2.0.2 ~ 2.3.4 and updated the affected versions. > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120413#comment-17120413 ] Apache Spark commented on SPARK-31854: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/28681 > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120412#comment-17120412 ] Takeshi Yamamuro commented on SPARK-31854: -- Thanks for your report. Yea. that should be a bug of the whole-stage codegen as you said. > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119352#comment-17119352 ] Paul Finkelshteyn commented on SPARK-31854: --- What do you mean by query? It is written in Kotlin and is on top of report. If you need alternative query in Scala it will be like following {code:java} spark.conf.set("spark.sql.codegen.wholeStage", false) Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer], 3.asInstanceOf[Integer]).toDS().map(v=>(v,v)).show() {code} It also works when spark.sql.codegen.wholeStage is false and doesn't when it is on. > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Paul Finkelshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119334#comment-17119334 ] Takeshi Yamamuro commented on SPARK-31854: -- Could you show us a query to reproduce the issue in our env? If we don't have it, we cannot look into it. Anyway, thanks for the report. > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Paul Finkelshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org