[jira] [Updated] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31854: - Component/s: (was: Spark Core) SQL > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-31854: - Affects Version/s: 2.4.5 > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31854: Assignee: (was: Apache Spark) > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31854: Assignee: Apache Spark > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Pasha Finkeshteyn >Assignee: Apache Spark >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120413#comment-17120413 ] Apache Spark commented on SPARK-31854: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/28681 > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31854) Different results of query execution with wholestage codegen on and off
[ https://issues.apache.org/jira/browse/SPARK-31854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120412#comment-17120412 ] Takeshi Yamamuro commented on SPARK-31854: -- Thanks for your report. Yea. that should be a bug of the whole-stage codegen as you said. > Different results of query execution with wholestage codegen on and off > --- > > Key: SPARK-31854 > URL: https://issues.apache.org/jira/browse/SPARK-31854 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Pasha Finkeshteyn >Priority: Major > > Preface: I'm creating Kotlin API for spark to take best parts from three > worlds — spark scala, spark java and kotlin. > What is nice — it works in most scenarios. > But i've hit following cornercase: > {code:scala} > withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) { > dsOf(1, null, 2) > .map { c(it) } > .debugCodegen() > .show() > } > {code} > c(it) is creation of unnamed tuple > It fails with exception > {code} > java.lang.NullPointerException: Null value appeared in non-nullable field: > top level Product or row object > If the schema is inferred from a Scala tuple/case class, or a Java bean, > please try to use scala.Option[_] or other nullable types (e.g. > java.lang.Integer instead of int/scala.Int). > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.serializefromobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > … > {code} > I know, in Scala it won't work, so I could stop here. But it works in Kotlin > if I turn wholestage codegen off! > Moreover, if we will dig into generated code (when wholestage codegen is on), > we'll see that basically flow is following: > If one of elements in source dataset was null we wil throw NPE no matter what. > Flow is as follows: > {code} > private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 > serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) > throws java.io.IOException { > serializefromobject_doConsume_0(mapelements_value_1, > mapelements_isNull_1); > mapelements_isNull_1 = mapelements_resultIsNull_0; > mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; > private void mapelements_doConsume_0(java.lang.Integer > mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws > java.io.IOException { > mapelements_doConsume_0(deserializetoobject_value_0, > deserializetoobject_isNull_0); > deserializetoobject_resultIsNull_0 = > deserializetoobject_exprIsNull_0_0; > private void > deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int > deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) > throws java.io.IOException { > > deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, > localtablescan_isNull_0); > boolean localtablescan_isNull_0 = > localtablescan_row_0.isNullAt(0); > mapelements_isNull_1 = true; > {code} > You can find generated code in it's original view and slightly simplified and > refacored version > [here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100] > I believe that Spark should not behave differently when wholestage codegen is > on and off and differences in behavior look like a bug. > My Spark version is 3.0.0-preview2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31836) input_file_name() gives wrong value following Python UDF usage
[ https://issues.apache.org/jira/browse/SPARK-31836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120375#comment-17120375 ] Adam Binford commented on SPARK-31836: -- Confirmed also an issue on 2.4.5. Also I could recreate with just two files without streaming, using {code:java} spark.sql.files.openCostInBytes 0{code} to make sure both files ended up on a single partition. The behavior seems to be after a python UDF, all rows in a partition have the input_file_name of the last row in the partition. But that's an assumption based on a tiny test. Doing {code:java} df = (df .withColumn('before', input_file_name()) .withColumn('during', udf(lambda x: x)(input_file_name())) .withColumn('after', input_file_name()) ) {code} before and during are correct, where after is incorrect (all are the last file in the partition) > input_file_name() gives wrong value following Python UDF usage > -- > > Key: SPARK-31836 > URL: https://issues.apache.org/jira/browse/SPARK-31836 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Wesley Hildebrandt >Priority: Major > > I'm using PySpark for Spark 3.0.0 RC1 with Python 3.6.8. > The following commands demonstrate that the input_file_name() function > sometimes returns the wrong filename following usage of a Python UDF: > $ for i in `seq 5`; do echo $i > /tmp/test-file-$i; done > $ pyspark > >>> import pyspark.sql.functions as F > >>> spark.readStream.text('file:///tmp/test-file-*', > >>> wholetext=True).withColumn('file1', > >>> F.input_file_name()).withColumn('udf', F.udf(lambda > >>> x:x)('value')).withColumn('file2', > >>> F.input_file_name()).writeStream.trigger(once=True).foreachBatch(lambda > >>> df,_: df.select('file1','file2').show(truncate=False, > >>> vertical=True)).start().awaitTermination() > A few notes about this bug: > * It happens with many different files, so it's not related to the file > contents > * It also happens loading files from HDFS, so storage location is not a > factor > * It also happens using .csv() to read the files instead of .text(), so > input format is not a factor > * I have not been able to cause the error without using readStream, so it > seems to be related to streaming > * The bug also happens using spark-submit to send a job to my cluster > * I haven't tested an older version, but it's possible that Spark pulls > 24958 and 25321([https://github.com/apache/spark/pull/24958], > [https://github.com/apache/spark/pull/25321]) to fix issue 28153 > (https://issues.apache.org/jira/browse/SPARK-28153) introduced this bug? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31874) Use `FastDateFormat` as the legacy fractional formatter
[ https://issues.apache.org/jira/browse/SPARK-31874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31874: Assignee: Apache Spark > Use `FastDateFormat` as the legacy fractional formatter > --- > > Key: SPARK-31874 > URL: https://issues.apache.org/jira/browse/SPARK-31874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > By default {{HiveResult}}.{{hiveResultString}} retrieves timestamps values as > instances of {{java.sql.Timestamp}}, and uses the legacy parser > {{SimpleDateFormat}} to convert the timestamps to strings. After the fix > [#28024|https://github.com/apache/spark/pull/28024], the fractional formatter > and its companion - legacy formatter {{SimpleDateFormat}} are created per > every value. By switching from {{LegacySimpleTimestampFormatter}} to > {{LegacyFastTimestampFormatter}}, we can utilize the internal cache of > {{FastDateFormat}}, and avoid parsing the default pattern {{-MM-dd > HH:mm:ss}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31874) Use `FastDateFormat` as the legacy fractional formatter
[ https://issues.apache.org/jira/browse/SPARK-31874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31874: Assignee: (was: Apache Spark) > Use `FastDateFormat` as the legacy fractional formatter > --- > > Key: SPARK-31874 > URL: https://issues.apache.org/jira/browse/SPARK-31874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > By default {{HiveResult}}.{{hiveResultString}} retrieves timestamps values as > instances of {{java.sql.Timestamp}}, and uses the legacy parser > {{SimpleDateFormat}} to convert the timestamps to strings. After the fix > [#28024|https://github.com/apache/spark/pull/28024], the fractional formatter > and its companion - legacy formatter {{SimpleDateFormat}} are created per > every value. By switching from {{LegacySimpleTimestampFormatter}} to > {{LegacyFastTimestampFormatter}}, we can utilize the internal cache of > {{FastDateFormat}}, and avoid parsing the default pattern {{-MM-dd > HH:mm:ss}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31874) Use `FastDateFormat` as the legacy fractional formatter
[ https://issues.apache.org/jira/browse/SPARK-31874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120368#comment-17120368 ] Apache Spark commented on SPARK-31874: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28678 > Use `FastDateFormat` as the legacy fractional formatter > --- > > Key: SPARK-31874 > URL: https://issues.apache.org/jira/browse/SPARK-31874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > By default {{HiveResult}}.{{hiveResultString}} retrieves timestamps values as > instances of {{java.sql.Timestamp}}, and uses the legacy parser > {{SimpleDateFormat}} to convert the timestamps to strings. After the fix > [#28024|https://github.com/apache/spark/pull/28024], the fractional formatter > and its companion - legacy formatter {{SimpleDateFormat}} are created per > every value. By switching from {{LegacySimpleTimestampFormatter}} to > {{LegacyFastTimestampFormatter}}, we can utilize the internal cache of > {{FastDateFormat}}, and avoid parsing the default pattern {{-MM-dd > HH:mm:ss}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31874) Use `FastDateFormat` as the legacy fractional formatter
Maxim Gekk created SPARK-31874: -- Summary: Use `FastDateFormat` as the legacy fractional formatter Key: SPARK-31874 URL: https://issues.apache.org/jira/browse/SPARK-31874 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk By default {{HiveResult}}.{{hiveResultString}} retrieves timestamps values as instances of {{java.sql.Timestamp}}, and uses the legacy parser {{SimpleDateFormat}} to convert the timestamps to strings. After the fix [#28024|https://github.com/apache/spark/pull/28024], the fractional formatter and its companion - legacy formatter {{SimpleDateFormat}} are created per every value. By switching from {{LegacySimpleTimestampFormatter}} to {{LegacyFastTimestampFormatter}}, we can utilize the internal cache of {{FastDateFormat}}, and avoid parsing the default pattern {{-MM-dd HH:mm:ss}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31866) Add partitioning hints in SQL reference
[ https://issues.apache.org/jira/browse/SPARK-31866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31866. -- Fix Version/s: 3.0.0 Assignee: Huaxin Gao (was: Apache Spark) Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28672 > Add partitioning hints in SQL reference > --- > > Key: SPARK-31866 > URL: https://issues.apache.org/jira/browse/SPARK-31866 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Add partitioning hints Coalesce/Repartition/Repartition_By_Range hints in SQL > reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31866) Add partitioning hints in SQL reference
[ https://issues.apache.org/jira/browse/SPARK-31866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-31866: - Priority: Minor (was: Major) > Add partitioning hints in SQL reference > --- > > Key: SPARK-31866 > URL: https://issues.apache.org/jira/browse/SPARK-31866 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Add partitioning hints Coalesce/Repartition/Repartition_By_Range hints in SQL > reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120335#comment-17120335 ] Rakesh Raushan commented on SPARK-31873: Yeah. This is a problem with 2.4.5. {code:java} scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val | date_val_ts. |year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1299 | +---+---++ {code} [~hyukjin.kwon] Do this need to get fixed in 2.4.5. If so, I can check this. > Spark Sql Function year does not extract year from date/timestamp > - > > Key: SPARK-31873 > URL: https://issues.apache.org/jira/browse/SPARK-31873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Deepak Shingavi >Priority: Major > > There is a Spark SQL function > org.apache.spark.sql.functions.year which fails in below case > > {code:java} > // Code to extract year from Timestamp > val df = Seq( > ("1300-01-03 00:00:00") > ).toDF("date_val") > .withColumn("date_val_ts", to_timestamp(col("date_val"))) > .withColumn("year_val", year(to_timestamp(col("date_val" > df.show() > //Output of the above code > +---+---++ > | date_val|date_val_ts|year_val| > +---+---++ > |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| > +---+---++ > {code} > > The above code works perfectly for all the years greater than 1300 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120310#comment-17120310 ] Deepak Shingavi commented on SPARK-31873: - have you tested it on 2.4.5 ? [~rakson] > Spark Sql Function year does not extract year from date/timestamp > - > > Key: SPARK-31873 > URL: https://issues.apache.org/jira/browse/SPARK-31873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Deepak Shingavi >Priority: Major > > There is a Spark SQL function > org.apache.spark.sql.functions.year which fails in below case > > {code:java} > // Code to extract year from Timestamp > val df = Seq( > ("1300-01-03 00:00:00") > ).toDF("date_val") > .withColumn("date_val_ts", to_timestamp(col("date_val"))) > .withColumn("year_val", year(to_timestamp(col("date_val" > df.show() > //Output of the above code > +---+---++ > | date_val|date_val_ts|year_val| > +---+---++ > |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| > +---+---++ > {code} > > The above code works perfectly for all the years greater than 1300 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120301#comment-17120301 ] Rakesh Raushan edited comment on SPARK-31873 at 5/30/20, 4:28 PM: -- {code:java} scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val | date_val_ts. |year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1300 | +---+---++ {code} This works fine with master branch. was (Author: rakson): scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val| date_val_ts|year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1300| +---+---++ This works fine with master branch. > Spark Sql Function year does not extract year from date/timestamp > - > > Key: SPARK-31873 > URL: https://issues.apache.org/jira/browse/SPARK-31873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Deepak Shingavi >Priority: Major > > There is a Spark SQL function > org.apache.spark.sql.functions.year which fails in below case > > {code:java} > // Code to extract year from Timestamp > val df = Seq( > ("1300-01-03 00:00:00") > ).toDF("date_val") > .withColumn("date_val_ts", to_timestamp(col("date_val"))) > .withColumn("year_val", year(to_timestamp(col("date_val" > df.show() > //Output of the above code > +---+---++ > | date_val|date_val_ts|year_val| > +---+---++ > |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| > +---+---++ > {code} > > The above code works perfectly for all the years greater than 1300 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120301#comment-17120301 ] Rakesh Raushan commented on SPARK-31873: scala> val df = Seq(("1300-01-03 00:00:00") ).toDF("date_val").withColumn("date_val_ts", to_timestamp(col("date_val"))).withColumn("year_val", year(to_timestamp(col("date_val" df: org.apache.spark.sql.DataFrame = [date_val: string, date_val_ts: timestamp ... 1 more field] scala> df.show +---+---++ | date_val| date_val_ts|year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00| 1300| +---+---++ This works fine with master branch. > Spark Sql Function year does not extract year from date/timestamp > - > > Key: SPARK-31873 > URL: https://issues.apache.org/jira/browse/SPARK-31873 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Deepak Shingavi >Priority: Major > > There is a Spark SQL function > org.apache.spark.sql.functions.year which fails in below case > > {code:java} > // Code to extract year from Timestamp > val df = Seq( > ("1300-01-03 00:00:00") > ).toDF("date_val") > .withColumn("date_val_ts", to_timestamp(col("date_val"))) > .withColumn("year_val", year(to_timestamp(col("date_val" > df.show() > //Output of the above code > +---+---++ > | date_val|date_val_ts|year_val| > +---+---++ > |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| > +---+---++ > {code} > > The above code works perfectly for all the years greater than 1300 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31873) Spark Sql Function year does not extract year from date/timestamp
Deepak Shingavi created SPARK-31873: --- Summary: Spark Sql Function year does not extract year from date/timestamp Key: SPARK-31873 URL: https://issues.apache.org/jira/browse/SPARK-31873 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Deepak Shingavi There is a Spark SQL function org.apache.spark.sql.functions.year which fails in below case {code:java} // Code to extract year from Timestamp val df = Seq( ("1300-01-03 00:00:00") ).toDF("date_val") .withColumn("date_val_ts", to_timestamp(col("date_val"))) .withColumn("year_val", year(to_timestamp(col("date_val" df.show() //Output of the above code +---+---++ | date_val|date_val_ts|year_val| +---+---++ |1300-01-03 00:00:00|1300-01-03 00:00:00|1299| +---+---++ {code} The above code works perfectly for all the years greater than 1300 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31872) NotNullSafe to get complementary set
Xiaoju Wu created SPARK-31872: - Summary: NotNullSafe to get complementary set Key: SPARK-31872 URL: https://issues.apache.org/jira/browse/SPARK-31872 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.0, 3.0.0 Reporter: Xiaoju Wu If we have a filter expression to get subset of rows, and then we want the complementary set, Not(expression) cannot work, since Not is NullIntolerent, if expression.eval(row) is null, filter predicate is false too for Not(expression), "row" will not appear in both subset and complementary set. So, maybe we need a NotNullSafe implementation to get the complementary set which will result in true if the expression.eval(row) is null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31871) Display the canvas element icon for sorting column
[ https://issues.apache.org/jira/browse/SPARK-31871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120219#comment-17120219 ] Apache Spark commented on SPARK-31871: -- User 'liucht-inspur' has created a pull request for this issue: https://github.com/apache/spark/pull/28680 > Display the canvas element icon for sorting column > -- > > Key: SPARK-31871 > URL: https://issues.apache.org/jira/browse/SPARK-31871 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.3, 2.4.4, 2.4.5 >Reporter: liucht-inspur >Priority: Minor > > In the history Server page and Executor page, due to the wrong canvas element > image path, > The sorting icon cannot be displayed when the sequence is clicked. In order > to improve the user experience, the error path code is modified -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31871) Display the canvas element icon for sorting column
[ https://issues.apache.org/jira/browse/SPARK-31871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31871: Assignee: (was: Apache Spark) > Display the canvas element icon for sorting column > -- > > Key: SPARK-31871 > URL: https://issues.apache.org/jira/browse/SPARK-31871 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.3, 2.4.4, 2.4.5 >Reporter: liucht-inspur >Priority: Minor > > In the history Server page and Executor page, due to the wrong canvas element > image path, > The sorting icon cannot be displayed when the sequence is clicked. In order > to improve the user experience, the error path code is modified -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31871) Display the canvas element icon for sorting column
[ https://issues.apache.org/jira/browse/SPARK-31871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120218#comment-17120218 ] Apache Spark commented on SPARK-31871: -- User 'liucht-inspur' has created a pull request for this issue: https://github.com/apache/spark/pull/28680 > Display the canvas element icon for sorting column > -- > > Key: SPARK-31871 > URL: https://issues.apache.org/jira/browse/SPARK-31871 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.3, 2.4.4, 2.4.5 >Reporter: liucht-inspur >Priority: Minor > > In the history Server page and Executor page, due to the wrong canvas element > image path, > The sorting icon cannot be displayed when the sequence is clicked. In order > to improve the user experience, the error path code is modified -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31871) Display the canvas element icon for sorting column
[ https://issues.apache.org/jira/browse/SPARK-31871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31871: Assignee: Apache Spark > Display the canvas element icon for sorting column > -- > > Key: SPARK-31871 > URL: https://issues.apache.org/jira/browse/SPARK-31871 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.3, 2.4.4, 2.4.5 >Reporter: liucht-inspur >Assignee: Apache Spark >Priority: Minor > > In the history Server page and Executor page, due to the wrong canvas element > image path, > The sorting icon cannot be displayed when the sequence is clicked. In order > to improve the user experience, the error path code is modified -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31871) Display the canvas element icon for sorting column
liucht-inspur created SPARK-31871: - Summary: Display the canvas element icon for sorting column Key: SPARK-31871 URL: https://issues.apache.org/jira/browse/SPARK-31871 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 2.4.5, 2.4.4, 2.4.3 Reporter: liucht-inspur In the history Server page and Executor page, due to the wrong canvas element image path, The sorting icon cannot be displayed when the sequence is clicked. In order to improve the user experience, the error path code is modified -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join
[ https://issues.apache.org/jira/browse/SPARK-31870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120205#comment-17120205 ] Apache Spark commented on SPARK-31870: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/28679 > AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional > shuffle" test has no skew join > - > > Key: SPARK-31870 > URL: https://issues.apache.org/jira/browse/SPARK-31870 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > Due to incorrect configurations of > - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes > - spark.sql.adaptive.advisoryPartitionSizeInBytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join
[ https://issues.apache.org/jira/browse/SPARK-31870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31870: Assignee: (was: Apache Spark) > AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional > shuffle" test has no skew join > - > > Key: SPARK-31870 > URL: https://issues.apache.org/jira/browse/SPARK-31870 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > Due to incorrect configurations of > - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes > - spark.sql.adaptive.advisoryPartitionSizeInBytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join
[ https://issues.apache.org/jira/browse/SPARK-31870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31870: Assignee: Apache Spark > AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional > shuffle" test has no skew join > - > > Key: SPARK-31870 > URL: https://issues.apache.org/jira/browse/SPARK-31870 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Apache Spark >Priority: Minor > > Due to incorrect configurations of > - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes > - spark.sql.adaptive.advisoryPartitionSizeInBytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join
[ https://issues.apache.org/jira/browse/SPARK-31870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-31870: --- Summary: AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join (was: AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test doesn't optimize skew join at all) > AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional > shuffle" test has no skew join > - > > Key: SPARK-31870 > URL: https://issues.apache.org/jira/browse/SPARK-31870 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > Due to incorrect configurations of > - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes > - spark.sql.adaptive.advisoryPartitionSizeInBytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test doesn't optimize skew join at all
Manu Zhang created SPARK-31870: -- Summary: AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test doesn't optimize skew join at all Key: SPARK-31870 URL: https://issues.apache.org/jira/browse/SPARK-31870 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Manu Zhang Due to incorrect configurations of - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes - spark.sql.adaptive.advisoryPartitionSizeInBytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31864) Adjust AQE skew join trigger condition
[ https://issues.apache.org/jira/browse/SPARK-31864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31864. - Fix Version/s: 3.0.0 Assignee: Wei Xue Resolution: Fixed > Adjust AQE skew join trigger condition > -- > > Key: SPARK-31864 > URL: https://issues.apache.org/jira/browse/SPARK-31864 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > Fix For: 3.0.0 > > > Instead of using the raw partition sizes, we should use coalesced partition > sizes to test skew. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31799) Spark Datasource Tables Creating Incorrect Hive Metadata
[ https://issues.apache.org/jira/browse/SPARK-31799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120136#comment-17120136 ] L. C. Hsieh commented on SPARK-31799: - This is happened when Spark SQL think it cannot save the data source table in a Hive compatible way. So this kind of data source tables should be only readable by Spark. > Spark Datasource Tables Creating Incorrect Hive Metadata > > > Key: SPARK-31799 > URL: https://issues.apache.org/jira/browse/SPARK-31799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Anoop Johnson >Priority: Major > > I found that if I create a CSV or JSON table using Spark SQL, it writes the > wrong Hive table metadata, breaking compatibility with other query engines > like Hive and Presto. Here is a very simple example: > {code:sql} > CREATE TABLE test_csv (id String, name String) > USING csv > LOCATION 's3://[...]' > ; > {code} > If you describe the table using Presto, you will see: > {code:sql} > CREATE EXTERNAL TABLE `test_csv`( > `col` array COMMENT 'from deserializer') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'path'='s3://[...]') > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.SequenceFileInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat' > LOCATION > 's3://[...]/test_csv-__PLACEHOLDER__' > TBLPROPERTIES ( > 'spark.sql.create.version'='2.4.4', > 'spark.sql.sources.provider'='csv', > 'spark.sql.sources.schema.numParts'='1', > > 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', > > 'transient_lastDdlTime'='1590196086') > ; > {code} > The table location is set to a placeholder value - the schema is always set > to _col array_. The serde/inputformat is wrong - it says > _SequenceFileInputFormat_ and _LazySimpleSerDe_ even though the requested > format is CSV. > But all the right metadata is written to the custom table properties with > prefix _spark.sql_. However, Hive and Presto does not understand these table > properties and this breaks them. I could reproduce this with JSON too, but > not with Parquet. > I root-caused this issue to CSV and JSON tables not handled > [here|https://github.com/apache/spark/blob/721cba540292d8d76102b18922dabe2a7d918dc5/sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala#L31-L66] > in HiveSerde.scala. As a result, these default values are written. > Is there a reason why CSV and JSON are not handled? I could send a patch to > fix this, but the caveat is that the CSV and JSON Hive serdes should be in > the Spark classpath, otherwise the table creation will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31863) Thriftserver not setting active SparkSession, SQLConf.get not getting session configs correctly
[ https://issues.apache.org/jira/browse/SPARK-31863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31863: --- Assignee: Juliusz Sompolski (was: Apache Spark) > Thriftserver not setting active SparkSession, SQLConf.get not getting session > configs correctly > --- > > Key: SPARK-31863 > URL: https://issues.apache.org/jira/browse/SPARK-31863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.0.0 > > > Thriftserver is not setting the active SparkSession. > Because of that, configuration obtained with SQLConf.get is not the session > configuration. > This makes many configs set by "set" in the session not work correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31861) Thriftserver collecting timestamp not using spark.sql.session.timeZone
[ https://issues.apache.org/jira/browse/SPARK-31861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31861. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28671 [https://github.com/apache/spark/pull/28671] > Thriftserver collecting timestamp not using spark.sql.session.timeZone > -- > > Key: SPARK-31861 > URL: https://issues.apache.org/jira/browse/SPARK-31861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.0.0 > > > If JDBC client is in TimeZone PST, and sets spark.sql.session.timeZone to > PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM > timezone of the Spark cluster is e.g. CET, then > - the timestamp literal in the query is interpreted as 12:00:00 PST, i.e. > 21:00:00 CET > - but currently when it's returned, the timestamps are collected from the > query with a collect() in > https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L299, > and then in the end Timestamps are turned into strings using a t.toString() > in > https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/ColumnValue.java#L138 > This will use the Spark cluster TimeZone. That results in "21:00:00" > returned to the JDBC application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31863) Thriftserver not setting active SparkSession, SQLConf.get not getting session configs correctly
[ https://issues.apache.org/jira/browse/SPARK-31863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31863. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28671 [https://github.com/apache/spark/pull/28671] > Thriftserver not setting active SparkSession, SQLConf.get not getting session > configs correctly > --- > > Key: SPARK-31863 > URL: https://issues.apache.org/jira/browse/SPARK-31863 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.0 > > > Thriftserver is not setting the active SparkSession. > Because of that, configuration obtained with SQLConf.get is not the session > configuration. > This makes many configs set by "set" in the session not work correctly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31861) Thriftserver collecting timestamp not using spark.sql.session.timeZone
[ https://issues.apache.org/jira/browse/SPARK-31861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31861: --- Assignee: Juliusz Sompolski > Thriftserver collecting timestamp not using spark.sql.session.timeZone > -- > > Key: SPARK-31861 > URL: https://issues.apache.org/jira/browse/SPARK-31861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > > If JDBC client is in TimeZone PST, and sets spark.sql.session.timeZone to > PST, and sends a query "SELECT timestamp '2020-05-20 12:00:00'", and the JVM > timezone of the Spark cluster is e.g. CET, then > - the timestamp literal in the query is interpreted as 12:00:00 PST, i.e. > 21:00:00 CET > - but currently when it's returned, the timestamps are collected from the > query with a collect() in > https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L299, > and then in the end Timestamps are turned into strings using a t.toString() > in > https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/ColumnValue.java#L138 > This will use the Spark cluster TimeZone. That results in "21:00:00" > returned to the JDBC application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31859) Thriftserver with spark.sql.datetime.java8API.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-31859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31859. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28671 [https://github.com/apache/spark/pull/28671] > Thriftserver with spark.sql.datetime.java8API.enabled=true > -- > > Key: SPARK-31859 > URL: https://issues.apache.org/jira/browse/SPARK-31859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 3.0.0 > > > {code} > test("spark.sql.datetime.java8API.enabled=true") { > withJdbcStatement() { st => > st.execute("set spark.sql.datetime.java8API.enabled=true") > val rs = st.executeQuery("select timestamp '2020-05-28 00:00:00'") > rs.next() > // scalastyle:off > println(rs.getObject(1)) > } > } > {code} > fails with > {code} > HiveThriftBinaryServerSuite: > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:204) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:444) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:424) > at > org.apache.hive.jdbc.HiveBaseResultSet.getObject(HiveBaseResultSet.java:464 > {code} > It seems it might be needed in HiveResult.toHiveString? > cc [~maxgekk] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31859) Thriftserver with spark.sql.datetime.java8API.enabled=true
[ https://issues.apache.org/jira/browse/SPARK-31859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31859: --- Assignee: Juliusz Sompolski > Thriftserver with spark.sql.datetime.java8API.enabled=true > -- > > Key: SPARK-31859 > URL: https://issues.apache.org/jira/browse/SPARK-31859 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > > {code} > test("spark.sql.datetime.java8API.enabled=true") { > withJdbcStatement() { st => > st.execute("set spark.sql.datetime.java8API.enabled=true") > val rs = st.executeQuery("select timestamp '2020-05-28 00:00:00'") > rs.next() > // scalastyle:off > println(rs.getObject(1)) > } > } > {code} > fails with > {code} > HiveThriftBinaryServerSuite: > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:204) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:444) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:424) > at > org.apache.hive.jdbc.HiveBaseResultSet.getObject(HiveBaseResultSet.java:464 > {code} > It seems it might be needed in HiveResult.toHiveString? > cc [~maxgekk] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org