[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-06-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120801#comment-17120801
 ] 

Apache Spark commented on SPARK-31854:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28691

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Pasha Finkeshteyn
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-06-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120800#comment-17120800
 ] 

Apache Spark commented on SPARK-31854:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28691

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Pasha Finkeshteyn
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-31 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120750#comment-17120750
 ] 

Takeshi Yamamuro commented on SPARK-31854:
--

Thanks for the update, Dongjoon.

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-31 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120746#comment-17120746
 ] 

Dongjoon Hyun commented on SPARK-31854:
---

I also verified this at 2.0.2 ~ 2.3.4 and updated the affected versions.

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-30 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120413#comment-17120413
 ] 

Apache Spark commented on SPARK-31854:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/28681

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-30 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120412#comment-17120412
 ] 

Takeshi Yamamuro commented on SPARK-31854:
--

Thanks for your report. Yea. that should be a bug of the whole-stage codegen as 
you said.

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-29 Thread Paul Finkelshteyn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119352#comment-17119352
 ] 

Paul Finkelshteyn commented on SPARK-31854:
---

What do you mean by query? It is written in Kotlin and is on top of report. 

If you need alternative query in Scala it will be like following


{code:java}
spark.conf.set("spark.sql.codegen.wholeStage", false)

Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer], 
3.asInstanceOf[Integer]).toDS().map(v=>(v,v)).show()
{code}

It also works when spark.sql.codegen.wholeStage is false and doesn't when it is 
on.


> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Paul Finkelshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off

2020-05-29 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119334#comment-17119334
 ] 

Takeshi Yamamuro commented on SPARK-31854:
--

Could you show us a query to reproduce the issue in our env? If we don't have 
it, we cannot look into it. Anyway, thanks for the report.

> Different results of query execution with wholestage codegen on and off
> ---
>
> Key: SPARK-31854
> URL: https://issues.apache.org/jira/browse/SPARK-31854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Paul Finkelshteyn
>Priority: Major
>
> Preface: I'm creating Kotlin API for spark to take best parts from three 
> worlds — spark scala, spark java and kotlin.
> What is nice — it works in most scenarios.
> But i've hit following cornercase:
> {code:scala}
> withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
> dsOf(1, null, 2)
> .map { c(it) }
> .debugCodegen()
> .show()
> }
> {code}
> c(it) is creation of unnamed tuple
> It fails with exception
> {code}
> java.lang.NullPointerException: Null value appeared in non-nullable field:
> top level Product or row object
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> …
> {code}
> I know, in Scala it won't work, so I could stop here. But it works in Kotlin 
> if I turn wholestage codegen off!
> Moreover, if we will dig into generated code (when wholestage codegen is on), 
> we'll see that basically flow is following:
> If one of elements in source dataset was null we wil throw NPE no matter what.
> Flow is as follows:
> {code}
> private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
> serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
> throws java.io.IOException {
> serializefromobject_doConsume_0(mapelements_value_1, 
> mapelements_isNull_1);
> mapelements_isNull_1 = mapelements_resultIsNull_0;
> mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
> private void mapelements_doConsume_0(java.lang.Integer 
> mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
> java.io.IOException {
> mapelements_doConsume_0(deserializetoobject_value_0, 
> deserializetoobject_isNull_0);
> deserializetoobject_resultIsNull_0 = 
> deserializetoobject_exprIsNull_0_0;
> private void 
> deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
> deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
> throws java.io.IOException {
> 
> deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
> localtablescan_isNull_0);
> boolean localtablescan_isNull_0 = 
> localtablescan_row_0.isNullAt(0);
> mapelements_isNull_1 = true;
> {code}
> You can find generated code in it's original view and slightly simplified and 
> refacored version 
> [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]
> I believe that Spark should not behave differently when wholestage codegen is 
> on and off and differences in behavior look like a bug.
> My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org