[jira] [Assigned] (SPARK-33904) Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`
[ https://issues.apache.org/jira/browse/SPARK-33904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33904: --- Assignee: Maxim Gekk > Recognize `spark_catalog` in `saveAsTable()` and `insertInto()` > --- > > Key: SPARK-33904 > URL: https://issues.apache.org/jira/browse/SPARK-33904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The v1 INSERT INTO command recognizes `spark_catalog` as the default session > catalog: > {code:sql} > spark-sql> create table spark_catalog.ns.tbl (c int); > spark-sql> insert into spark_catalog.ns.tbl select 0; > spark-sql> select * from spark_catalog.ns.tbl; > 0 > {code} > but the `saveAsTable()` and `insertInto()` methods don't allow to write a > table with explicitly specified catalog spark_catalog: > {code:scala} > scala> sql("CREATE NAMESPACE spark_catalog.ns") > scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") > org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the > identifier spark_catalog.ns.tbl. > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629) > ... 47 elided > scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl") > org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the > identifier spark_catalog.ns.tbl. > at > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498) > ... 47 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33904) Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`
[ https://issues.apache.org/jira/browse/SPARK-33904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33904. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30919 [https://github.com/apache/spark/pull/30919] > Recognize `spark_catalog` in `saveAsTable()` and `insertInto()` > --- > > Key: SPARK-33904 > URL: https://issues.apache.org/jira/browse/SPARK-33904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > The v1 INSERT INTO command recognizes `spark_catalog` as the default session > catalog: > {code:sql} > spark-sql> create table spark_catalog.ns.tbl (c int); > spark-sql> insert into spark_catalog.ns.tbl select 0; > spark-sql> select * from spark_catalog.ns.tbl; > 0 > {code} > but the `saveAsTable()` and `insertInto()` methods don't allow to write a > table with explicitly specified catalog spark_catalog: > {code:scala} > scala> sql("CREATE NAMESPACE spark_catalog.ns") > scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") > org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the > identifier spark_catalog.ns.tbl. > at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629) > ... 47 elided > scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl") > org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the > identifier spark_catalog.ns.tbl. > at > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498) > ... 47 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33926) Improve the error message in resolving of DSv1 multi-part identifiers
[ https://issues.apache.org/jira/browse/SPARK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33926. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30963 [https://github.com/apache/spark/pull/30963] > Improve the error message in resolving of DSv1 multi-part identifiers > - > > Key: SPARK-33926 > URL: https://issues.apache.org/jira/browse/SPARK-33926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > This is a follow up of > https://github.com/apache/spark/pull/30915#discussion_r549240857 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33926) Improve the error message in resolving of DSv1 multi-part identifiers
[ https://issues.apache.org/jira/browse/SPARK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33926: --- Assignee: Maxim Gekk > Improve the error message in resolving of DSv1 multi-part identifiers > - > > Key: SPARK-33926 > URL: https://issues.apache.org/jira/browse/SPARK-33926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > This is a follow up of > https://github.com/apache/spark/pull/30915#discussion_r549240857 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33940) allow configuring the max column name length in csv writer
[ https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256357#comment-17256357 ] Apache Spark commented on SPARK-33940: -- User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/30972 > allow configuring the max column name length in csv writer > -- > > Key: SPARK-33940 > URL: https://issues.apache.org/jira/browse/SPARK-33940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nan Zhu >Priority: Major > > csv writer actually has an implicit limit on column name length due to > univocity-parser, > > when we initialize a writer > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] > it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java > eventually > ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] > > in that stringCache.get, it has a maxStringLength cap > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] > which is 1024 by default > > we do not expose this as configurable option, leading to NPE when we have a > column name larger than 1024, > > ``` > [info] Cause: java.lang.NullPointerException: > [info] at > com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) > [info] at > org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > ``` > > it could be reproduced by a simple unit test > > ``` > val row1 = Row("a") > val superLongHeader = (0 until 1025).map(_ => "c").mkString("") > val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) > df.repartition(1) > .write > .option("header", "true") > .option("maxColumnNameLength", 1025) > .csv(dataPath) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer
[ https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33940: Assignee: (was: Apache Spark) > allow configuring the max column name length in csv writer > -- > > Key: SPARK-33940 > URL: https://issues.apache.org/jira/browse/SPARK-33940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nan Zhu >Priority: Major > > csv writer actually has an implicit limit on column name length due to > univocity-parser, > > when we initialize a writer > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] > it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java > eventually > ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] > > in that stringCache.get, it has a maxStringLength cap > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] > which is 1024 by default > > we do not expose this as configurable option, leading to NPE when we have a > column name larger than 1024, > > ``` > [info] Cause: java.lang.NullPointerException: > [info] at > com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) > [info] at > org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > ``` > > it could be reproduced by a simple unit test > > ``` > val row1 = Row("a") > val superLongHeader = (0 until 1025).map(_ => "c").mkString("") > val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) > df.repartition(1) > .write > .option("header", "true") > .option("maxColumnNameLength", 1025) > .csv(dataPath) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer
[ https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33940: Assignee: Apache Spark > allow configuring the max column name length in csv writer > -- > > Key: SPARK-33940 > URL: https://issues.apache.org/jira/browse/SPARK-33940 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Nan Zhu >Assignee: Apache Spark >Priority: Major > > csv writer actually has an implicit limit on column name length due to > univocity-parser, > > when we initialize a writer > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] > it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java > eventually > ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] > > in that stringCache.get, it has a maxStringLength cap > [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] > which is 1024 by default > > we do not expose this as configurable option, leading to NPE when we have a > column name larger than 1024, > > ``` > [info] Cause: java.lang.NullPointerException: > [info] at > com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) > [info] at > com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) > [info] at > org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) > [info] at > org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44) > [info] at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) > [info] at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > ``` > > it could be reproduced by a simple unit test > > ``` > val row1 = Row("a") > val superLongHeader = (0 until 1025).map(_ => "c").mkString("") > val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) > df.repartition(1) > .write > .option("header", "true") > .option("maxColumnNameLength", 1025) > .csv(dataPath) > ``` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33940) allow configuring the max column name length in csv writer
Nan Zhu created SPARK-33940: --- Summary: allow configuring the max column name length in csv writer Key: SPARK-33940 URL: https://issues.apache.org/jira/browse/SPARK-33940 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Nan Zhu csv writer actually has an implicit limit on column name length due to univocity-parser, when we initialize a writer [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,] it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java eventually ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)] in that stringCache.get, it has a maxStringLength cap [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104] which is 1024 by default we do not expose this as configurable option, leading to NPE when we have a column name larger than 1024, ``` [info] Cause: java.lang.NullPointerException: [info] at com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349) [info] at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444) [info] at com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410) [info] at org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87) [info] at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58) [info] at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44) [info] at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86) [info] at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126) [info] at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111) [info] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269) [info] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) ``` it could be reproduced by a simple unit test ``` val row1 = Row("a") val superLongHeader = (0 until 1025).map(_ => "c").mkString("") val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader) df.repartition(1) .write .option("header", "true") .option("maxColumnNameLength", 1025) .csv(dataPath) ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33927) Fix Spark Release image
[ https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33927: Assignee: Hyukjin Kwon > Fix Spark Release image > --- > > Key: SPARK-33927 > URL: https://issues.apache.org/jira/browse/SPARK-33927 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Hyukjin Kwon >Priority: Blocker > > The release script seems to be broken. This is a blocker for Apache Spark > 3.1.0 release. > {code} > $ cd dev/create-release/spark-rm > $ docker build -t spark-rm . > ... > exit code: 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33927) Fix Spark Release image
[ https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33927. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30971 [https://github.com/apache/spark/pull/30971] > Fix Spark Release image > --- > > Key: SPARK-33927 > URL: https://issues.apache.org/jira/browse/SPARK-33927 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.1.0 > > > The release script seems to be broken. This is a blocker for Apache Spark > 3.1.0 release. > {code} > $ cd dev/create-release/spark-rm > $ docker build -t spark-rm . > ... > exit code: 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33907) Only prune columns of from_json if parsing options is empty
[ https://issues.apache.org/jira/browse/SPARK-33907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256336#comment-17256336 ] Apache Spark commented on SPARK-33907: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30970 > Only prune columns of from_json if parsing options is empty > --- > > Key: SPARK-33907 > URL: https://issues.apache.org/jira/browse/SPARK-33907 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > Fix For: 3.1.0 > > > For safety, we should only prune columns from from_json expression if the > parsing option is empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33927) Fix Spark Release image
[ https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33927: Assignee: (was: Apache Spark) > Fix Spark Release image > --- > > Key: SPARK-33927 > URL: https://issues.apache.org/jira/browse/SPARK-33927 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > The release script seems to be broken. This is a blocker for Apache Spark > 3.1.0 release. > {code} > $ cd dev/create-release/spark-rm > $ docker build -t spark-rm . > ... > exit code: 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33927) Fix Spark Release image
[ https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33927: Assignee: Apache Spark > Fix Spark Release image > --- > > Key: SPARK-33927 > URL: https://issues.apache.org/jira/browse/SPARK-33927 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Blocker > > The release script seems to be broken. This is a blocker for Apache Spark > 3.1.0 release. > {code} > $ cd dev/create-release/spark-rm > $ docker build -t spark-rm . > ... > exit code: 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33927) Fix Spark Release image
[ https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256335#comment-17256335 ] Apache Spark commented on SPARK-33927: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30971 > Fix Spark Release image > --- > > Key: SPARK-33927 > URL: https://issues.apache.org/jira/browse/SPARK-33927 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > The release script seems to be broken. This is a blocker for Apache Spark > 3.1.0 release. > {code} > $ cd dev/create-release/spark-rm > $ docker build -t spark-rm . > ... > exit code: 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33890) Improve the implement of trim/trimleft/trimright
[ https://issues.apache.org/jira/browse/SPARK-33890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33890. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30905 [https://github.com/apache/spark/pull/30905] > Improve the implement of trim/trimleft/trimright > > > Key: SPARK-33890 > URL: https://issues.apache.org/jira/browse/SPARK-33890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > The current implement of trim/trimleft/trimright have somewhat redundant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33890) Improve the implement of trim/trimleft/trimright
[ https://issues.apache.org/jira/browse/SPARK-33890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33890: --- Assignee: jiaan.geng > Improve the implement of trim/trimleft/trimright > > > Key: SPARK-33890 > URL: https://issues.apache.org/jira/browse/SPARK-33890 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > The current implement of trim/trimleft/trimright have somewhat redundant. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32684: --- Assignee: angerszhu > Add a test case for hive serde/default-serde mode's null value '\\N' > > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32684. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30946 [https://github.com/apache/spark/pull/30946] > Add a test case for hive serde/default-serde mode's null value '\\N' > > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Minor > Fix For: 3.2.0 > > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33874) Spark may report PodRunning if there is a sidecar that has not exited
[ https://issues.apache.org/jira/browse/SPARK-33874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33874. -- Fix Version/s: 3.1.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/30892 > Spark may report PodRunning if there is a sidecar that has not exited > - > > Key: SPARK-33874 > URL: https://issues.apache.org/jira/browse/SPARK-33874 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2, 3.1.0, 3.2.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > > This is a continuation of SPARK-30821 which handles the situation where Spark > is still running but it may have sidecar containers that exited. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33938) Optimize Like Any/All by LikeSimplification
[ https://issues.apache.org/jira/browse/SPARK-33938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256285#comment-17256285 ] jiaan.geng commented on SPARK-33938: I'm working on. > Optimize Like Any/All by LikeSimplification > > > Key: SPARK-33938 > URL: https://issues.apache.org/jira/browse/SPARK-33938 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > We should optimize Like Any/All by LikeSimplification to improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document
[ https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-33932: --- Assignee: L. C. Hsieh (was: Apache Spark) > Clean up KafkaOffsetReader API document > --- > > Key: SPARK-33932 > URL: https://issues.apache.org/jira/browse/SPARK-33932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 3.2.0 > > > KafkaOffsetReader API documents are duplicated among > KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if > the doc is centralized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31946) Failed to register SIGPWR handler on MacOS
[ https://issues.apache.org/jira/browse/SPARK-31946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256249#comment-17256249 ] Apache Spark commented on SPARK-31946: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/30968 > Failed to register SIGPWR handler on MacOS > -- > > Key: SPARK-31946 > URL: https://issues.apache.org/jira/browse/SPARK-31946 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: macOS 10.14.6 >Reporter: wuyi >Priority: Major > > > {code:java} > 20/06/09 22:54:54 WARN SignalUtils: Failed to register SIGPWR handler - > disabling decommission feature. > java.lang.IllegalArgumentException: Unknown signal: PWR > at sun.misc.Signal.(Signal.java:143) > at > org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:83) > at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) > at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:81) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:86) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > Seem like MacOS is *POSIX* compliant. But SIGPWR is not specified in the > *POSIX* specification. See [https://en.wikipedia.org/wiki/Signal_(IPC)#SIGPWR] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31946) Failed to register SIGPWR handler on MacOS
[ https://issues.apache.org/jira/browse/SPARK-31946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31946: Assignee: Apache Spark > Failed to register SIGPWR handler on MacOS > -- > > Key: SPARK-31946 > URL: https://issues.apache.org/jira/browse/SPARK-31946 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: macOS 10.14.6 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > > {code:java} > 20/06/09 22:54:54 WARN SignalUtils: Failed to register SIGPWR handler - > disabling decommission feature. > java.lang.IllegalArgumentException: Unknown signal: PWR > at sun.misc.Signal.(Signal.java:143) > at > org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:83) > at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) > at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:81) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:86) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > Seem like MacOS is *POSIX* compliant. But SIGPWR is not specified in the > *POSIX* specification. See [https://en.wikipedia.org/wiki/Signal_(IPC)#SIGPWR] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31946) Failed to register SIGPWR handler on MacOS
[ https://issues.apache.org/jira/browse/SPARK-31946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31946: Assignee: (was: Apache Spark) > Failed to register SIGPWR handler on MacOS > -- > > Key: SPARK-31946 > URL: https://issues.apache.org/jira/browse/SPARK-31946 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 > Environment: macOS 10.14.6 >Reporter: wuyi >Priority: Major > > > {code:java} > 20/06/09 22:54:54 WARN SignalUtils: Failed to register SIGPWR handler - > disabling decommission feature. > java.lang.IllegalArgumentException: Unknown signal: PWR > at sun.misc.Signal.(Signal.java:143) > at > org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:83) > at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) > at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:81) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:86) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > Seem like MacOS is *POSIX* compliant. But SIGPWR is not specified in the > *POSIX* specification. See [https://en.wikipedia.org/wiki/Signal_(IPC)#SIGPWR] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33939) Do not display the seed of uuid/shuffle if user not specify
ulysses you created SPARK-33939: --- Summary: Do not display the seed of uuid/shuffle if user not specify Key: SPARK-33939 URL: https://issues.apache.org/jira/browse/SPARK-33939 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: ulysses you Keep the display behavior same with rand/randn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33938) Optimize Like Any/All by LikeSimplification
Yuming Wang created SPARK-33938: --- Summary: Optimize Like Any/All by LikeSimplification Key: SPARK-33938 URL: https://issues.apache.org/jira/browse/SPARK-33938 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang We should optimize Like Any/All by LikeSimplification to improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33934) Support automatically identify the Python file and execute it
[ https://issues.apache.org/jira/browse/SPARK-33934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256228#comment-17256228 ] angerszhu commented on SPARK-33934: --- raise a pr soon > Support automatically identify the Python file and execute it > - > > Key: SPARK-33934 > URL: https://issues.apache.org/jira/browse/SPARK-33934 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > In Hive script transform, we can use `USING xxx.py` but in Spark we will got > error > {code:java} > Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most > recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor > 339): org.apache.spark.SparkException: Subprocess exited with status 127. > Error: /bin/bash: xxx.py: can't find the command > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33932) Clean up KafkaOffsetReader API document
[ https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33932. -- Fix Version/s: 3.2.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/30961 > Clean up KafkaOffsetReader API document > --- > > Key: SPARK-33932 > URL: https://issues.apache.org/jira/browse/SPARK-33932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > KafkaOffsetReader API documents are duplicated among > KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if > the doc is centralized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table
[ https://issues.apache.org/jira/browse/SPARK-33937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256196#comment-17256196 ] 黄海升 commented on SPARK-33937: - I'm sorry. It looks like it is. > Move the old partition data to trash instead of deleting it when inserting > rewrite hive table > - > > Key: SPARK-33937 > URL: https://issues.apache.org/jira/browse/SPARK-33937 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: 黄海升 >Priority: Minor > Fix For: 3.2.0 > > > InsertIntoHiveTable should move the old partition data to trash instead of > deleting it. > Because that's what we do in hive. > [https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java] > `deleteOldPathForReplace` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table
[ https://issues.apache.org/jira/browse/SPARK-33937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] 黄海升 resolved SPARK-33937. - Resolution: Duplicate > Move the old partition data to trash instead of deleting it when inserting > rewrite hive table > - > > Key: SPARK-33937 > URL: https://issues.apache.org/jira/browse/SPARK-33937 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: 黄海升 >Priority: Minor > Fix For: 3.2.0 > > > InsertIntoHiveTable should move the old partition data to trash instead of > deleting it. > Because that's what we do in hive. > [https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java] > `deleteOldPathForReplace` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table
[ https://issues.apache.org/jira/browse/SPARK-33937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256192#comment-17256192 ] Chao Sun commented on SPARK-33937: -- This looks like a duplicate of SPARK-32480. > Move the old partition data to trash instead of deleting it when inserting > rewrite hive table > - > > Key: SPARK-33937 > URL: https://issues.apache.org/jira/browse/SPARK-33937 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: 黄海升 >Priority: Minor > Fix For: 3.2.0 > > > InsertIntoHiveTable should move the old partition data to trash instead of > deleting it. > Because that's what we do in hive. > [https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java] > `deleteOldPathForReplace` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table
黄海升 created SPARK-33937: --- Summary: Move the old partition data to trash instead of deleting it when inserting rewrite hive table Key: SPARK-33937 URL: https://issues.apache.org/jira/browse/SPARK-33937 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: 黄海升 Fix For: 3.2.0 InsertIntoHiveTable should move the old partition data to trash instead of deleting it. Because that's what we do in hive. [https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java] `deleteOldPathForReplace` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046 ] David Wyles edited comment on SPARK-33635 at 12/29/20, 8:48 PM: [~gsomogyi] I now have my results. I was so unhappy about these results I ran all the tests again, the only thing that changed between them is the version of spark running on the cluster, everything else was static - the data input from kafka was an unchanging static set of data. Input-> *672733262* rows +*Spark 2.4.5*:+ *440* seconds - *1,528,939* rows per second. +*Spark 3.0.1*:+ *990* seconds - *679,528* rows per seconds. These are multiple runs (I even took the best from spark 3.0.1) I also captured the event logs between these two versions of spark - should anyone find them useful. [event logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing] So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster in this test case. Is Spark SQL reading the source data twice, just as it would if there was a "order by" in the query? Sample code used: val spark = SparkSession.builder.appName("Kafka Read Performance") .config("spark.executor.memory","16g") .config("spark.cores.max", "10") .config("spark.eventLog.enabled","true") .config("spark.eventLog.dir","file:///tmp/spark-events") .config("spark.eventLog.overwrite","true") .getOrCreate() import spark.implicits._ val *startTime* = System.nanoTime() val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("subscribe", config.inTopic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .option("failOnDataLoss","false") .load() df .write .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("topic", config.outTopic) .mode(SaveMode.Append) .save() val *endTime* = System.nanoTime() val elapsedSecs = (endTime - startTime) / 1E9 // static input sample was used, fixed row count. println(s"Took $elapsedSecs secs") spark.stop() was (Author: david.wyles): [~gsomogyi] I now have my results. I was so unhappy about these results I ran all the tests again, the only thing that changed between them is the version of spark running on the cluster, everything else was static - the data input from kafka was an unchanging static set of data. Input-> *672733262* rows +*Spark 2.4.5*:+ *440* seconds - *1,528,939* rows per second. +*Spark 3.0.1*:+ *990* seconds - *679,528* rows per seconds. These are multiple runs (I even took the best from sprak 3.0.1) I also captured the event logs between these two versions of spark - should anyone find them useful. [event logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing] So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster in this test case. Is Spark SQL reading the source data twice, just as it would if there was a "order by" in the query? Sample code used: val spark = SparkSession.builder.appName("Kafka Read Performance") .config("spark.executor.memory","16g") .config("spark.cores.max", "10") .config("spark.eventLog.enabled","true") .config("spark.eventLog.dir","file:///tmp/spark-events") .config("spark.eventLog.overwrite","true") .getOrCreate() import spark.implicits._ val *startTime* = System.nanoTime() val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("subscribe", config.inTopic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .option("failOnDataLoss","false") .load() df .write .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("topic", config.outTopic) .mode(SaveMode.Append) .save() val *endTime* = System.nanoTime() val elapsedSecs = (endTime - startTime) / 1E9 // static input sample was used, fixed row count. println(s"Took $elapsedSecs secs") spark.stop() > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall
[jira] [Updated] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33936: -- Fix Version/s: (was: 3.2.0) 3.1.0 > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33936: -- Fix Version/s: 3.2.0 > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0, 3.2.0 > > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33936. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30966 [https://github.com/apache/spark/pull/30966] > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.2.0 > > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33936: - Assignee: Maxim Gekk > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256116#comment-17256116 ] Apache Spark commented on SPARK-33936: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30967 > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Minor > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column
[ https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255389#comment-17255389 ] Ted Yu edited comment on SPARK-33915 at 12/29/20, 6:51 PM: --- Here is the plan prior to predicate pushdown: {code} 2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- Filter (get_json_object(phone#37, $.phone) = 1200) +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [] - Requested Columns: [id,address,phone] {code} Here is the plan with pushdown: {code} 2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [[phone->'phone' = ?, 1200]] - Requested Columns: [id,address,phone] {code} was (Author: yuzhih...@gmail.com): Here is the plan prior to predicate pushdown: {code} 2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- Filter (get_json_object(phone#37, $.phone) = 1200) +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [] - Requested Columns: [id,address,phone] {code} Here is the plan with pushdown: {code} 2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0 +- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS phone#33] +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person - Cassandra Filters: [["`GetJsonObject(phone#37,$.phone)`" = ?, 1200]] - Requested Columns: [id,address,phone] {code} > Allow json expression to be pushable column > --- > > Key: SPARK-33915 > URL: https://issues.apache.org/jira/browse/SPARK-33915 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Ted Yu >Priority: Major > > Currently PushableColumnBase provides no support for json / jsonb expression. > Example of json expression: > {code} > get_json_object(phone, '$.code') = '1200' > {code} > If non-string literal is part of the expression, the presence of cast() would > complicate the situation. > Implication is that implementation of SupportsPushDownFilters doesn't have a > chance to perform pushdown even if third party DB engine supports json > expression pushdown. > This issue is for discussion and implementation of Spark core changes which > would allow json expression to be recognized as pushable column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33936: Assignee: (was: Apache Spark) > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Minor > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33936: Assignee: Apache Spark > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33936) Add the version when connector methods and interfaces were added
[ https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256107#comment-17256107 ] Apache Spark commented on SPARK-33936: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30966 > Add the version when connector methods and interfaces were added > > > Key: SPARK-33936 > URL: https://issues.apache.org/jira/browse/SPARK-33936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Maxim Gekk >Priority: Minor > > Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the > *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33936) Add the version when connector methods and interfaces were added
Maxim Gekk created SPARK-33936: -- Summary: Add the version when connector methods and interfaces were added Key: SPARK-33936 URL: https://issues.apache.org/jira/browse/SPARK-33936 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0, 3.2.0 Reporter: Maxim Gekk Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the *connector* package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33935) Fix CBOs cost function
[ https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256103#comment-17256103 ] Apache Spark commented on SPARK-33935: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/30965 > Fix CBOs cost function > --- > > Key: SPARK-33935 > URL: https://issues.apache.org/jira/browse/SPARK-33935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: > {code:title=spark.sql.cbo.joinReorder.card.weight} > The weight of cardinality (number of rows) for plan cost comparison in join > reorder: rows * weight + size * (1 - weight). > {code} > But in the implementation the formula is a bit different: > {code:title=Current implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > if (other.planCost.card == 0 || other.planCost.size == 0) { > false > } else { > val relativeRows = BigDecimal(this.planCost.card) / > BigDecimal(other.planCost.card) > val relativeSize = BigDecimal(this.planCost.size) / > BigDecimal(other.planCost.size) > relativeRows * conf.joinReorderCardWeight + > relativeSize * (1 - conf.joinReorderCardWeight) < 1 > } > } > {code} > This change has an unfortunate consequence: > given two plans A and B, both A betterThan B and B betterThan A might give > the same results. This happes when one has many rows with small sizes and > other has few rows with large sizes. > A example values, that have this fenomen with the default weight value (0.7): > A.card = 500, B.card = 300 > A.size = 30, B.size = 80 > Both A betterThan B and B betterThan A would have score above 1 and would > return false. > A new implementation is proposed, that matches the documentation: > {code:title=Proposed implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > val oldCost = BigDecimal(this.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) > val newCost = BigDecimal(other.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) > newCost < oldCost > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33935) Fix CBOs cost function
[ https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256101#comment-17256101 ] Apache Spark commented on SPARK-33935: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/30965 > Fix CBOs cost function > --- > > Key: SPARK-33935 > URL: https://issues.apache.org/jira/browse/SPARK-33935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: > {code:title=spark.sql.cbo.joinReorder.card.weight} > The weight of cardinality (number of rows) for plan cost comparison in join > reorder: rows * weight + size * (1 - weight). > {code} > But in the implementation the formula is a bit different: > {code:title=Current implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > if (other.planCost.card == 0 || other.planCost.size == 0) { > false > } else { > val relativeRows = BigDecimal(this.planCost.card) / > BigDecimal(other.planCost.card) > val relativeSize = BigDecimal(this.planCost.size) / > BigDecimal(other.planCost.size) > relativeRows * conf.joinReorderCardWeight + > relativeSize * (1 - conf.joinReorderCardWeight) < 1 > } > } > {code} > This change has an unfortunate consequence: > given two plans A and B, both A betterThan B and B betterThan A might give > the same results. This happes when one has many rows with small sizes and > other has few rows with large sizes. > A example values, that have this fenomen with the default weight value (0.7): > A.card = 500, B.card = 300 > A.size = 30, B.size = 80 > Both A betterThan B and B betterThan A would have score above 1 and would > return false. > A new implementation is proposed, that matches the documentation: > {code:title=Proposed implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > val oldCost = BigDecimal(this.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) > val newCost = BigDecimal(other.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) > newCost < oldCost > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33935) Fix CBOs cost function
[ https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33935: Assignee: (was: Apache Spark) > Fix CBOs cost function > --- > > Key: SPARK-33935 > URL: https://issues.apache.org/jira/browse/SPARK-33935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: > {code:title=spark.sql.cbo.joinReorder.card.weight} > The weight of cardinality (number of rows) for plan cost comparison in join > reorder: rows * weight + size * (1 - weight). > {code} > But in the implementation the formula is a bit different: > {code:title=Current implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > if (other.planCost.card == 0 || other.planCost.size == 0) { > false > } else { > val relativeRows = BigDecimal(this.planCost.card) / > BigDecimal(other.planCost.card) > val relativeSize = BigDecimal(this.planCost.size) / > BigDecimal(other.planCost.size) > relativeRows * conf.joinReorderCardWeight + > relativeSize * (1 - conf.joinReorderCardWeight) < 1 > } > } > {code} > This change has an unfortunate consequence: > given two plans A and B, both A betterThan B and B betterThan A might give > the same results. This happes when one has many rows with small sizes and > other has few rows with large sizes. > A example values, that have this fenomen with the default weight value (0.7): > A.card = 500, B.card = 300 > A.size = 30, B.size = 80 > Both A betterThan B and B betterThan A would have score above 1 and would > return false. > A new implementation is proposed, that matches the documentation: > {code:title=Proposed implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > val oldCost = BigDecimal(this.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) > val newCost = BigDecimal(other.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) > newCost < oldCost > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33935) Fix CBOs cost function
[ https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33935: Assignee: Apache Spark > Fix CBOs cost function > --- > > Key: SPARK-33935 > URL: https://issues.apache.org/jira/browse/SPARK-33935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > > The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: > {code:title=spark.sql.cbo.joinReorder.card.weight} > The weight of cardinality (number of rows) for plan cost comparison in join > reorder: rows * weight + size * (1 - weight). > {code} > But in the implementation the formula is a bit different: > {code:title=Current implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > if (other.planCost.card == 0 || other.planCost.size == 0) { > false > } else { > val relativeRows = BigDecimal(this.planCost.card) / > BigDecimal(other.planCost.card) > val relativeSize = BigDecimal(this.planCost.size) / > BigDecimal(other.planCost.size) > relativeRows * conf.joinReorderCardWeight + > relativeSize * (1 - conf.joinReorderCardWeight) < 1 > } > } > {code} > This change has an unfortunate consequence: > given two plans A and B, both A betterThan B and B betterThan A might give > the same results. This happes when one has many rows with small sizes and > other has few rows with large sizes. > A example values, that have this fenomen with the default weight value (0.7): > A.card = 500, B.card = 300 > A.size = 30, B.size = 80 > Both A betterThan B and B betterThan A would have score above 1 and would > return false. > A new implementation is proposed, that matches the documentation: > {code:title=Proposed implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > val oldCost = BigDecimal(this.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) > val newCost = BigDecimal(other.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) > newCost < oldCost > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33935) Fix CBOs cost function
[ https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanel Kiis updated SPARK-33935: --- Issue Type: Bug (was: Improvement) > Fix CBOs cost function > --- > > Key: SPARK-33935 > URL: https://issues.apache.org/jira/browse/SPARK-33935 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: > {code:title=spark.sql.cbo.joinReorder.card.weight} > The weight of cardinality (number of rows) for plan cost comparison in join > reorder: rows * weight + size * (1 - weight). > {code} > But in the implementation the formula is a bit different: > {code:title=Current implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > if (other.planCost.card == 0 || other.planCost.size == 0) { > false > } else { > val relativeRows = BigDecimal(this.planCost.card) / > BigDecimal(other.planCost.card) > val relativeSize = BigDecimal(this.planCost.size) / > BigDecimal(other.planCost.size) > relativeRows * conf.joinReorderCardWeight + > relativeSize * (1 - conf.joinReorderCardWeight) < 1 > } > } > {code} > This change has an unfortunate consequence: > given two plans A and B, both A betterThan B and B betterThan A might give > the same results. This happes when one has many rows with small sizes and > other has few rows with large sizes. > A example values, that have this fenomen with the default weight value (0.7): > A.card = 500, B.card = 300 > A.size = 30, B.size = 80 > Both A betterThan B and B betterThan A would have score above 1 and would > return false. > A new implementation is proposed, that matches the documentation: > {code:title=Proposed implementation} > def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { > val oldCost = BigDecimal(this.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) > val newCost = BigDecimal(other.planCost.card) * > conf.joinReorderCardWeight + > BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) > newCost < oldCost > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33935) Fix CBOs cost function
Tanel Kiis created SPARK-33935: -- Summary: Fix CBOs cost function Key: SPARK-33935 URL: https://issues.apache.org/jira/browse/SPARK-33935 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Tanel Kiis The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: {code:title=spark.sql.cbo.joinReorder.card.weight} The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight). {code} But in the implementation the formula is a bit different: {code:title=Current implementation} def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { if (other.planCost.card == 0 || other.planCost.size == 0) { false } else { val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card) val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size) relativeRows * conf.joinReorderCardWeight + relativeSize * (1 - conf.joinReorderCardWeight) < 1 } } {code} This change has an unfortunate consequence: given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes. A example values, that have this fenomen with the default weight value (0.7): A.card = 500, B.card = 300 A.size = 30, B.size = 80 Both A betterThan B and B betterThan A would have score above 1 and would return false. A new implementation is proposed, that matches the documentation: {code:title=Proposed implementation} def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { val oldCost = BigDecimal(this.planCost.card) * conf.joinReorderCardWeight + BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) val newCost = BigDecimal(other.planCost.card) * conf.joinReorderCardWeight + BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) newCost < oldCost } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046 ] David Wyles edited comment on SPARK-33635 at 12/29/20, 4:54 PM: [~gsomogyi] I now have my results. I was so unhappy about these results I ran all the tests again, the only thing that changed between them is the version of spark running on the cluster, everything else was static - the data input from kafka was an unchanging static set of data. Input-> *672733262* rows +*Spark 2.4.5*:+ *440* seconds - *1,528,939* rows per second. +*Spark 3.0.1*:+ *990* seconds - *679,528* rows per seconds. These are multiple runs (I even took the best from sprak 3.0.1) I also captured the event logs between these two versions of spark - should anyone find them useful. [event logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing] So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster in this test case. Is Spark SQL reading the source data twice, just as it would if there was a "order by" in the query? Sample code used: val spark = SparkSession.builder.appName("Kafka Read Performance") .config("spark.executor.memory","16g") .config("spark.cores.max", "10") .config("spark.eventLog.enabled","true") .config("spark.eventLog.dir","file:///tmp/spark-events") .config("spark.eventLog.overwrite","true") .getOrCreate() import spark.implicits._ val *startTime* = System.nanoTime() val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("subscribe", config.inTopic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .option("failOnDataLoss","false") .load() df .write .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("topic", config.outTopic) .mode(SaveMode.Append) .save() val *endTime* = System.nanoTime() val elapsedSecs = (endTime - startTime) / 1E9 // static input sample was used, fixed row count. println(s"Took $elapsedSecs secs") spark.stop() was (Author: david.wyles): [~gsomogyi] I now have my results. I was so unhappy about these results I ran all the tests again, the only thing that changed between them is the version of spark running on the cluster, everything else was static - the data input from kafka was an unchanging static set of data. Input-> *672733262* rows +*Spark 2.4.5*:+ *440* seconds - *1,528,939* rows per second. +*Spark 3.0.1*:+ *990* seconds - *679,528* rows per seconds. These are multiple runs (I even took the best from sprak 3.0.1) I also captured the event logs between these two versions of spark - should anyone find them useful. [event logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing] So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster in this test case (In my production use case I'm just writing to parquet files in hdfs - which is where I noticed the degredation in performant). Is Spark SQL reading the source data twice, just as it would if there was a "order by" in the query? Sample code used: val spark = SparkSession.builder.appName("Kafka Read Performance") .config("spark.executor.memory","16g") .config("spark.cores.max", "10") .config("spark.eventLog.enabled","true") .config("spark.eventLog.dir","file:///tmp/spark-events") .config("spark.eventLog.overwrite","true") .getOrCreate() import spark.implicits._ val *startTime* = System.nanoTime() val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("subscribe", config.inTopic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .option("failOnDataLoss","false") .load() df .write .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("topic", config.outTopic) .mode(SaveMode.Append) .save() val *endTime* = System.nanoTime() val elapsedSecs = (endTime - startTime) / 1E9 // static input sample was used, fixed row count. println(s"Took $elapsedSecs secs") spark.stop() > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908
[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046 ] David Wyles edited comment on SPARK-33635 at 12/29/20, 4:34 PM: [~gsomogyi] I now have my results. I was so unhappy about these results I ran all the tests again, the only thing that changed between them is the version of spark running on the cluster, everything else was static - the data input from kafka was an unchanging static set of data. Input-> *672733262* rows +*Spark 2.4.5*:+ *440* seconds - *1,528,939* rows per second. +*Spark 3.0.1*:+ *990* seconds - *679,528* rows per seconds. These are multiple runs (I even took the best from sprak 3.0.1) I also captured the event logs between these two versions of spark - should anyone find them useful. [event logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing] So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster in this test case (In my production use case I'm just writing to parquet files in hdfs - which is where I noticed the degredation in performant). Is Spark SQL reading the source data twice, just as it would if there was a "order by" in the query? Sample code used: val spark = SparkSession.builder.appName("Kafka Read Performance") .config("spark.executor.memory","16g") .config("spark.cores.max", "10") .config("spark.eventLog.enabled","true") .config("spark.eventLog.dir","file:///tmp/spark-events") .config("spark.eventLog.overwrite","true") .getOrCreate() import spark.implicits._ val *startTime* = System.nanoTime() val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("subscribe", config.inTopic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .option("failOnDataLoss","false") .load() df .write .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("topic", config.outTopic) .mode(SaveMode.Append) .save() val *endTime* = System.nanoTime() val elapsedSecs = (endTime - startTime) / 1E9 // static input sample was used, fixed row count. println(s"Took $elapsedSecs secs") spark.stop() was (Author: david.wyles): [~gsomogyi] I now have my results. I was so unhappy about these results I ran all the tests again, the only thing that changed between them is the version of spark running on the cluster, everything else was static - the data input from kafka was an unchanging static set of data. Input-> *672733262* rows +*Spark 2.4.5*:+ *440* seconds - *1,528,939* rows per second. +*Spark 3.0.1*:+ *990* seconds - *679,528* rows per seconds. These are multiple runs (I even took the best from sprak 3.0.1) I also captured the event logs between these two versions of spark - should anyone find them useful. So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster in this test case (In my production use case I'm just writing to parquet files in hdfs - which is where I noticed the degredation in performant). Is Spark SQL reading the source data twice, just as it would if there was a "order by" in the query? Sample code used: val spark = SparkSession.builder.appName("Kafka Read Performance") .config("spark.executor.memory","16g") .config("spark.cores.max", "10") .config("spark.eventLog.enabled","true") .config("spark.eventLog.dir","file:///tmp/spark-events") .config("spark.eventLog.overwrite","true") .getOrCreate() import spark.implicits._ val *startTime* = System.nanoTime() val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("subscribe", config.inTopic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .option("failOnDataLoss","false") .load() df .write .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("topic", config.outTopic) .mode(SaveMode.Append) .save() val *endTime* = System.nanoTime() val elapsedSecs = (endTime - startTime) / 1E9 // static input sample was used, fixed row count. println(s"Took $elapsedSecs secs") spark.stop() > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). >
[jira] [Commented] (SPARK-33635) Performance regression in Kafka read
[ https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046 ] David Wyles commented on SPARK-33635: - [~gsomogyi] I now have my results. I was so unhappy about these results I ran all the tests again, the only thing that changed between them is the version of spark running on the cluster, everything else was static - the data input from kafka was an unchanging static set of data. Input-> *672733262* rows +*Spark 2.4.5*:+ *440* seconds - *1,528,939* rows per second. +*Spark 3.0.1*:+ *990* seconds - *679,528* rows per seconds. These are multiple runs (I even took the best from sprak 3.0.1) I also captured the event logs between these two versions of spark - should anyone find them useful. So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster in this test case (In my production use case I'm just writing to parquet files in hdfs - which is where I noticed the degredation in performant). Is Spark SQL reading the source data twice, just as it would if there was a "order by" in the query? Sample code used: val spark = SparkSession.builder.appName("Kafka Read Performance") .config("spark.executor.memory","16g") .config("spark.cores.max", "10") .config("spark.eventLog.enabled","true") .config("spark.eventLog.dir","file:///tmp/spark-events") .config("spark.eventLog.overwrite","true") .getOrCreate() import spark.implicits._ val *startTime* = System.nanoTime() val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("subscribe", config.inTopic) .option("startingOffsets", "earliest") .option("endingOffsets", "latest") .option("failOnDataLoss","false") .load() df .write .format("kafka") .option("kafka.bootstrap.servers", config.brokers) .option("topic", config.outTopic) .mode(SaveMode.Append) .save() val *endTime* = System.nanoTime() val elapsedSecs = (endTime - startTime) / 1E9 // static input sample was used, fixed row count. println(s"Took $elapsedSecs secs") spark.stop() > Performance regression in Kafka read > > > Key: SPARK-33635 > URL: https://issues.apache.org/jira/browse/SPARK-33635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 > Environment: A simple 5 node system. A simple data row of csv data in > kafka, evenly distributed between the partitions. > Open JDK 1.8.0.252 > Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to > a distinct NUMA group) > kafka (v 2.3.1) cluster - 5 nodes (1 broker per node). > Centos 7.7.1908 > 1 topic, 10 partiions, 1 hour queue life > (this is just one of clusters we have, I have tested on all of them and > theyall exhibit the same performance degredation) >Reporter: David Wyles >Priority: Major > > I have observed a slowdown in the reading of data from kafka on all of our > systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1) > I have created a sample project to isolate the problem as much as possible, > with just a read all data from a kafka topic (see > [https://github.com/codegorillauk/spark-kafka-read] ). > With 2.4.5, across multiple runs, > I get a stable read rate of 1,120,000 (1.12 mill) rows per second > With 3.0.0 or 3.0.1, across multiple runs, > I get a stable read rate of 632,000 (0.632 mil) rows per second > The represents a *44% loss in performance*. Which is, a lot. > I have been working though the spark-sql-kafka-0-10 code base, but change for > spark 3 have been ongoing for over a year and its difficult to pin point an > exact change or reason for the degradation. > I am happy to help fix this problem, but will need some assitance as I am > unfamiliar with the spark-sql-kafka-0-10 project. > > A sample of the data my test reads (note: its not parsing csv - this is just > test data) > > 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine > Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) > [C],2017/12/15 00:00:00, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-33910. - Fix Version/s: 3.2.0 Resolution: Fixed > Simplify/Optimize conditional expressions > -- > > Key: SPARK-33910 > URL: https://issues.apache.org/jira/browse/SPARK-33910 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.0 > > > 1. Push down the foldable expressions through CaseWhen/If > 2. Simplify conditional in predicate > 3. Push the UnaryExpression into (if / case) branches > 4. Simplify CaseWhen if elseValue is None > 5. Simplify CaseWhen clauses with (true and false) and (false and true) > Common use cases are: > {code:sql} > create table t1 using parquet as select * from range(100); > create table t2 using parquet as select * from range(200); > create temp view v1 as > select 'a' as event_type, * from t1 > union all > select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 > {code} > 1. Reduce read the whole table. > {noformat} > explain select * from v1 where event_type = 'a'; > Before simplify: > == Physical Plan == > Union > :- *(1) Project [a AS event_type#7, id#9L] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, > id#10L] >+- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) > +- *(2) ColumnarToRow > +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: > [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > After simplify: > == Physical Plan == > *(1) Project [a AS event_type#8, id#4L] > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], > Format: Parquet > {noformat} > 2. Push down the conditional expressions to data source. > {noformat} > explain select * from v1 where event_type = 'b'; > Before simplify: > == Physical Plan == > Union > :- LocalTableScan , [event_type#7, id#9L] > +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, > id#10L] >+- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b) > +- *(1) ColumnarToRow > +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: > [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > After simplify: > == Physical Plan == > *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L > AS id#4L] > +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1)) >+- *(1) ColumnarToRow > +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: > [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], > PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct > {noformat} > 3. Reduce the amount of calculation. > {noformat} > Before simplify: > explain select event_type = 'e' from v1; > == Physical Plan == > Union > :- *(1) Project [false AS (event_type = e)#37] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> > +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS > (event_type = e)#38] >+- *(2) ColumnarToRow > +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > After simplify: > == Physical Plan == > Union > :- *(1) Project [false AS (event_type = e)#10] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], > Format: Parquet, > +- *(2) Project [false AS (event_type = e)#14] >+- *(2) ColumnarToRow > +- FileScan parquet default.t2[] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33910) Simplify/Optimize conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-33910: --- Assignee: Yuming Wang > Simplify/Optimize conditional expressions > -- > > Key: SPARK-33910 > URL: https://issues.apache.org/jira/browse/SPARK-33910 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > 1. Push down the foldable expressions through CaseWhen/If > 2. Simplify conditional in predicate > 3. Push the UnaryExpression into (if / case) branches > 4. Simplify CaseWhen if elseValue is None > 5. Simplify CaseWhen clauses with (true and false) and (false and true) > Common use cases are: > {code:sql} > create table t1 using parquet as select * from range(100); > create table t2 using parquet as select * from range(200); > create temp view v1 as > select 'a' as event_type, * from t1 > union all > select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2 > {code} > 1. Reduce read the whole table. > {noformat} > explain select * from v1 where event_type = 'a'; > Before simplify: > == Physical Plan == > Union > :- *(1) Project [a AS event_type#7, id#9L] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, > id#10L] >+- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a) > +- *(2) ColumnarToRow > +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: > [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > After simplify: > == Physical Plan == > *(1) Project [a AS event_type#8, id#4L] > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], > Format: Parquet > {noformat} > 2. Push down the conditional expressions to data source. > {noformat} > explain select * from v1 where event_type = 'b'; > Before simplify: > == Physical Plan == > Union > :- LocalTableScan , [event_type#7, id#9L] > +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, > id#10L] >+- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b) > +- *(1) ColumnarToRow > +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: > [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > After simplify: > == Physical Plan == > *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L > AS id#4L] > +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1)) >+- *(1) ColumnarToRow > +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: > [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], > PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct > {noformat} > 3. Reduce the amount of calculation. > {noformat} > Before simplify: > explain select event_type = 'e' from v1; > == Physical Plan == > Union > :- *(1) Project [false AS (event_type = e)#37] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> > +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS > (event_type = e)#38] >+- *(2) ColumnarToRow > +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > After simplify: > == Physical Plan == > Union > :- *(1) Project [false AS (event_type = e)#10] > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], > Format: Parquet, > +- *(2) Project [false AS (event_type = e)#14] >+- *(2) ColumnarToRow > +- FileScan parquet default.t2[] Batched: true, DataFilters: [], > Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33934) Support automatically identify the Python file and execute it
[ https://issues.apache.org/jira/browse/SPARK-33934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33934: -- Parent: SPARK-31936 Issue Type: Sub-task (was: Improvement) > Support automatically identify the Python file and execute it > - > > Key: SPARK-33934 > URL: https://issues.apache.org/jira/browse/SPARK-33934 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > In Hive script transform, we can use `USING xxx.py` but in Spark we will got > error > {code:java} > Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most > recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor > 339): org.apache.spark.SparkException: Subprocess exited with status 127. > Error: /bin/bash: xxx.py: can't find the command > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33934) Support automatically identify the Python file and execute it
[ https://issues.apache.org/jira/browse/SPARK-33934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33934: -- Description: In Hive script transform, we can use `USING xxx.py` but in Spark we will got error {code:java} Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 339): org.apache.spark.SparkException: Subprocess exited with status 127. Error: /bin/bash: xxx.py: can't find the command {code} was: In Hive script transform, we can use `USING xxx/py` but in Spark we will got error {code:java} Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 339): org.apache.spark.SparkException: Subprocess exited with status 127. Error: /bin/bash: xxx.py: can't find the command {code} > Support automatically identify the Python file and execute it > - > > Key: SPARK-33934 > URL: https://issues.apache.org/jira/browse/SPARK-33934 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > In Hive script transform, we can use `USING xxx.py` but in Spark we will got > error > {code:java} > Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most > recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor > 339): org.apache.spark.SparkException: Subprocess exited with status 127. > Error: /bin/bash: xxx.py: can't find the command > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33934) Support automatically identify the Python file and execute it
angerszhu created SPARK-33934: - Summary: Support automatically identify the Python file and execute it Key: SPARK-33934 URL: https://issues.apache.org/jira/browse/SPARK-33934 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu In Hive script transform, we can use `USING xxx/py` but in Spark we will got error {code:java} Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 339): org.apache.spark.SparkException: Subprocess exited with status 127. Error: /bin/bash: xxx.py: can't find the command {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32684: Assignee: (was: Apache Spark) > Add a test case for hive serde/default-serde mode's null value '\\N' > > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Minor > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32684: Assignee: Apache Spark > Add a test case for hive serde/default-serde mode's null value '\\N' > > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Minor > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32684: - Priority: Minor (was: Major) > Add a test case for hive serde/default-serde mode's null value '\\N' > > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Minor > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32684: - Summary: Add a test case for hive serde/default-serde mode's null value '\\N' (was: Scrip transform hive serde/default-serde mode null value keep same with hive as '\\N') > Add a test case for hive serde/default-serde mode's null value '\\N' > > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION
[ https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256015#comment-17256015 ] Apache Spark commented on SPARK-33859: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30964 > Support V2 ALTER TABLE .. RENAME PARTITION > -- > > Key: SPARK-33859 > URL: https://issues.apache.org/jira/browse/SPARK-33859 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION > similar to v1 implementation: > https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION
[ https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256014#comment-17256014 ] Apache Spark commented on SPARK-33859: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30964 > Support V2 ALTER TABLE .. RENAME PARTITION > -- > > Key: SPARK-33859 > URL: https://issues.apache.org/jira/browse/SPARK-33859 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION > similar to v1 implementation: > https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32684: - Affects Version/s: (was: 3.0.0) 3.2.0 > Add a test case for hive serde/default-serde mode's null value '\\N' > > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32684) Scrip transform hive serde/default-serde mode null value keep same with hive as '\\N'
[ https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-32684: -- > Scrip transform hive serde/default-serde mode null value keep same with hive > as '\\N' > - > > Key: SPARK-32684 > URL: https://issues.apache.org/jira/browse/SPARK-32684 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Major > > Hive serde default NULL value is '\N' > {code:java} > String nullString = tbl.getProperty( > serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); > nullSequence = new Text(nullString); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33930) Spark SQL no serde row format field delimit default is '\u0001'
[ https://issues.apache.org/jira/browse/SPARK-33930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33930. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30958 [https://github.com/apache/spark/pull/30958] > Spark SQL no serde row format field delimit default is '\u0001' > --- > > Key: SPARK-33930 > URL: https://issues.apache.org/jira/browse/SPARK-33930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > > For same sql > {code:java} > SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t > {code} > !image-2020-12-29-13-11-31-336.png! > > !image-2020-12-29-13-11-45-734.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33930) Spark SQL no serde row format field delimit default is '\u0001'
[ https://issues.apache.org/jira/browse/SPARK-33930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33930: Assignee: angerszhu > Spark SQL no serde row format field delimit default is '\u0001' > --- > > Key: SPARK-33930 > URL: https://issues.apache.org/jira/browse/SPARK-33930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > For same sql > {code:java} > SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t > {code} > !image-2020-12-29-13-11-31-336.png! > > !image-2020-12-29-13-11-45-734.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33909) Check rand functions seed is legal at analyer side
[ https://issues.apache.org/jira/browse/SPARK-33909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33909. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30923 [https://github.com/apache/spark/pull/30923] > Check rand functions seed is legal at analyer side > -- > > Key: SPARK-33909 > URL: https://issues.apache.org/jira/browse/SPARK-33909 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0 > > > It's better to check seed expression is legal at analyzer side instead of > execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33909) Check rand functions seed is legal at analyer side
[ https://issues.apache.org/jira/browse/SPARK-33909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33909: --- Assignee: ulysses you > Check rand functions seed is legal at analyer side > -- > > Key: SPARK-33909 > URL: https://issues.apache.org/jira/browse/SPARK-33909 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > It's better to check seed expression is legal at analyzer side instead of > execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION
[ https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33859: --- Assignee: Maxim Gekk > Support V2 ALTER TABLE .. RENAME PARTITION > -- > > Key: SPARK-33859 > URL: https://issues.apache.org/jira/browse/SPARK-33859 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION > similar to v1 implementation: > https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION
[ https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33859. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30935 [https://github.com/apache/spark/pull/30935] > Support V2 ALTER TABLE .. RENAME PARTITION > -- > > Key: SPARK-33859 > URL: https://issues.apache.org/jira/browse/SPARK-33859 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION > similar to v1 implementation: > https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33871) Cannot access to column after left semi join and left join
[ https://issues.apache.org/jira/browse/SPARK-33871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255978#comment-17255978 ] Hyukjin Kwon commented on SPARK-33871: -- +1 for [~viirya]'s advice here. > Cannot access to column after left semi join and left join > --- > > Key: SPARK-33871 > URL: https://issues.apache.org/jira/browse/SPARK-33871 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Evgenii Samusenko >Priority: Minor > > Cannot access to column after left semi join and left join > {code} > val col = "c1" > val df = Seq((1, "a"),(2, "a"),(3, "a"),(4, "a")).toDF(col, "c2") > val df2 = Seq(1).toDF(col) > val semiJoin = df.join(df2, df(col) === df2(col), "left_semi") > val left = df.join(semiJoin, df(col) === semiJoin(col), "left") > left.show > +---+---+++ > | c1| c2| c1| c2| > +---+---+++ > | 1| a| 1| a| > | 2| a|null|null| > | 3| a|null|null| > | 4| a|null|null| > +---+---+++ > left.select(semiJoin(col)) > +---+ > | c1| > +---+ > | 1| > | 2| > | 3| > | 4| > +---+ > left.select(df(col)) > +---+ > | c1| > +---+ > | 1| > | 2| > | 3| > | 4| > +---+ > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33927) Fix Spark Release image
[ https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255977#comment-17255977 ] Hyukjin Kwon commented on SPARK-33927: -- Thanks for letting me know [~dongjoon]. I will likely have to take a look for this one this week :-). > Fix Spark Release image > --- > > Key: SPARK-33927 > URL: https://issues.apache.org/jira/browse/SPARK-33927 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > The release script seems to be broken. This is a blocker for Apache Spark > 3.1.0 release. > {code} > $ cd dev/create-release/spark-rm > $ docker build -t spark-rm . > ... > exit code: 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33929) Spark-submit with --package deequ doesn't pull all jars
[ https://issues.apache.org/jira/browse/SPARK-33929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33929: - Affects Version/s: (was: 2.3.4) (was: 2.3.3) (was: 2.3.2) (was: 2.3.1) (was: 2.3.0) 2.4.7 > Spark-submit with --package deequ doesn't pull all jars > --- > > Key: SPARK-33929 > URL: https://issues.apache.org/jira/browse/SPARK-33929 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.7 >Reporter: Dustin Smith >Priority: Major > > This issue was marked as solved SPARK-24074; however, another [~hyukjin.kwon] > pointed out in the comments that version 2.4x was experiencing this same > problem when using Amazon Deequ. > This problem exist in 2.3.x ecosystem as well for Deequ. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33929) Spark-submit with --package deequ doesn't pull all jars
[ https://issues.apache.org/jira/browse/SPARK-33929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33929: - Target Version/s: (was: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4) > Spark-submit with --package deequ doesn't pull all jars > --- > > Key: SPARK-33929 > URL: https://issues.apache.org/jira/browse/SPARK-33929 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4 >Reporter: Dustin Smith >Priority: Major > > This issue was marked as solved SPARK-24074; however, another [~hyukjin.kwon] > pointed out in the comments that version 2.4x was experiencing this same > problem when using Amazon Deequ. > This problem exist in 2.3.x ecosystem as well for Deequ. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32968) Column pruning for CsvToStructs
[ https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32968. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30912 [https://github.com/apache/spark/pull/30912] > Column pruning for CsvToStructs > --- > > Key: SPARK-32968 > URL: https://issues.apache.org/jira/browse/SPARK-32968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > We could do column pruning for CsvToStructs expression if we only require > some fields from it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33926) Improve the error message in resolving of DSv1 multi-part identifiers
[ https://issues.apache.org/jira/browse/SPARK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255953#comment-17255953 ] Apache Spark commented on SPARK-33926: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/30963 > Improve the error message in resolving of DSv1 multi-part identifiers > - > > Key: SPARK-33926 > URL: https://issues.apache.org/jira/browse/SPARK-33926 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > This is a follow up of > https://github.com/apache/spark/pull/30915#discussion_r549240857 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE
[ https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255904#comment-17255904 ] Apache Spark commented on SPARK-33933: -- User 'zhongyu09' has created a pull request for this issue: https://github.com/apache/spark/pull/30962 > Broadcast timeout happened unexpectedly in AQE > --- > > Key: SPARK-33933 > URL: https://issues.apache.org/jira/browse/SPARK-33933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yu Zhong >Assignee: Apache Spark >Priority: Major > > In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal > queries as below. > > {code:java} > Could not execute broadcast in 300 secs. You can increase the timeout for > broadcasts via spark.sql.broadcastTimeout or disable broadcast join by > setting spark.sql.autoBroadcastJoinThreshold to -1 > {code} > > This is usually happens when broadcast join(with or without hint) after a > long running shuffle (more than 5 minutes). By disable AQE, the issues > disappear. > The workaround is to increase spark.sql.broadcastTimeout and it works. But > because the data to broadcast is very small, that doesn't make sense. > After investigation, the root cause should be like this: when enable AQE, in > getFinalPhysicalPlan, spark traversal the physical plan bottom up and create > query stage for materialized part by createQueryStages and materialize those > new created query stages to submit map stages or broadcasting. When > ShuffleQueryStage are materializing before BroadcastQueryStage, the map job > and broadcast job are submitted almost at the same time, but map job will > hold all the computing resources. If the map job runs slow (when lots of data > needs to process and the resource is limited), the broadcast job cannot be > started(and finished) before spark.sql.broadcastTimeout, thus cause whole job > failed (introduced in SPARK-31475). > Code to reproduce: > > {code:java} > import java.util.UUID > import scala.util.Random > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.SparkSession > val spark = SparkSession.builder() > .master("local[2]") > .appName("Test Broadcast").getOrCreate() > import spark.implicits._ > spark.conf.set("spark.sql.adaptive.enabled", "true") > val sc = spark.sparkContext > sc.setLogLevel("INFO") > val uuid = UUID.randomUUID > val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { > for (i <- Range(0, 1 + Random.nextInt(1))) > yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) > }).toDF("index", "part", "pv", "uuid") > .withColumn("md5", md5($"uuid")) > val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) > val dim = dim_data.toDF("name", "index") > val result = df.groupBy("index") > .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) > .join(dim, Seq("index")) > .collect(){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE
[ https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255903#comment-17255903 ] Apache Spark commented on SPARK-33933: -- User 'zhongyu09' has created a pull request for this issue: https://github.com/apache/spark/pull/30962 > Broadcast timeout happened unexpectedly in AQE > --- > > Key: SPARK-33933 > URL: https://issues.apache.org/jira/browse/SPARK-33933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yu Zhong >Priority: Major > > In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal > queries as below. > > {code:java} > Could not execute broadcast in 300 secs. You can increase the timeout for > broadcasts via spark.sql.broadcastTimeout or disable broadcast join by > setting spark.sql.autoBroadcastJoinThreshold to -1 > {code} > > This is usually happens when broadcast join(with or without hint) after a > long running shuffle (more than 5 minutes). By disable AQE, the issues > disappear. > The workaround is to increase spark.sql.broadcastTimeout and it works. But > because the data to broadcast is very small, that doesn't make sense. > After investigation, the root cause should be like this: when enable AQE, in > getFinalPhysicalPlan, spark traversal the physical plan bottom up and create > query stage for materialized part by createQueryStages and materialize those > new created query stages to submit map stages or broadcasting. When > ShuffleQueryStage are materializing before BroadcastQueryStage, the map job > and broadcast job are submitted almost at the same time, but map job will > hold all the computing resources. If the map job runs slow (when lots of data > needs to process and the resource is limited), the broadcast job cannot be > started(and finished) before spark.sql.broadcastTimeout, thus cause whole job > failed (introduced in SPARK-31475). > Code to reproduce: > > {code:java} > import java.util.UUID > import scala.util.Random > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.SparkSession > val spark = SparkSession.builder() > .master("local[2]") > .appName("Test Broadcast").getOrCreate() > import spark.implicits._ > spark.conf.set("spark.sql.adaptive.enabled", "true") > val sc = spark.sparkContext > sc.setLogLevel("INFO") > val uuid = UUID.randomUUID > val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { > for (i <- Range(0, 1 + Random.nextInt(1))) > yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) > }).toDF("index", "part", "pv", "uuid") > .withColumn("md5", md5($"uuid")) > val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) > val dim = dim_data.toDF("name", "index") > val result = df.groupBy("index") > .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) > .join(dim, Seq("index")) > .collect(){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE
[ https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33933: Assignee: (was: Apache Spark) > Broadcast timeout happened unexpectedly in AQE > --- > > Key: SPARK-33933 > URL: https://issues.apache.org/jira/browse/SPARK-33933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yu Zhong >Priority: Major > > In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal > queries as below. > > {code:java} > Could not execute broadcast in 300 secs. You can increase the timeout for > broadcasts via spark.sql.broadcastTimeout or disable broadcast join by > setting spark.sql.autoBroadcastJoinThreshold to -1 > {code} > > This is usually happens when broadcast join(with or without hint) after a > long running shuffle (more than 5 minutes). By disable AQE, the issues > disappear. > The workaround is to increase spark.sql.broadcastTimeout and it works. But > because the data to broadcast is very small, that doesn't make sense. > After investigation, the root cause should be like this: when enable AQE, in > getFinalPhysicalPlan, spark traversal the physical plan bottom up and create > query stage for materialized part by createQueryStages and materialize those > new created query stages to submit map stages or broadcasting. When > ShuffleQueryStage are materializing before BroadcastQueryStage, the map job > and broadcast job are submitted almost at the same time, but map job will > hold all the computing resources. If the map job runs slow (when lots of data > needs to process and the resource is limited), the broadcast job cannot be > started(and finished) before spark.sql.broadcastTimeout, thus cause whole job > failed (introduced in SPARK-31475). > Code to reproduce: > > {code:java} > import java.util.UUID > import scala.util.Random > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.SparkSession > val spark = SparkSession.builder() > .master("local[2]") > .appName("Test Broadcast").getOrCreate() > import spark.implicits._ > spark.conf.set("spark.sql.adaptive.enabled", "true") > val sc = spark.sparkContext > sc.setLogLevel("INFO") > val uuid = UUID.randomUUID > val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { > for (i <- Range(0, 1 + Random.nextInt(1))) > yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) > }).toDF("index", "part", "pv", "uuid") > .withColumn("md5", md5($"uuid")) > val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) > val dim = dim_data.toDF("name", "index") > val result = df.groupBy("index") > .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) > .join(dim, Seq("index")) > .collect(){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE
[ https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33933: Assignee: Apache Spark > Broadcast timeout happened unexpectedly in AQE > --- > > Key: SPARK-33933 > URL: https://issues.apache.org/jira/browse/SPARK-33933 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Yu Zhong >Assignee: Apache Spark >Priority: Major > > In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal > queries as below. > > {code:java} > Could not execute broadcast in 300 secs. You can increase the timeout for > broadcasts via spark.sql.broadcastTimeout or disable broadcast join by > setting spark.sql.autoBroadcastJoinThreshold to -1 > {code} > > This is usually happens when broadcast join(with or without hint) after a > long running shuffle (more than 5 minutes). By disable AQE, the issues > disappear. > The workaround is to increase spark.sql.broadcastTimeout and it works. But > because the data to broadcast is very small, that doesn't make sense. > After investigation, the root cause should be like this: when enable AQE, in > getFinalPhysicalPlan, spark traversal the physical plan bottom up and create > query stage for materialized part by createQueryStages and materialize those > new created query stages to submit map stages or broadcasting. When > ShuffleQueryStage are materializing before BroadcastQueryStage, the map job > and broadcast job are submitted almost at the same time, but map job will > hold all the computing resources. If the map job runs slow (when lots of data > needs to process and the resource is limited), the broadcast job cannot be > started(and finished) before spark.sql.broadcastTimeout, thus cause whole job > failed (introduced in SPARK-31475). > Code to reproduce: > > {code:java} > import java.util.UUID > import scala.util.Random > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.SparkSession > val spark = SparkSession.builder() > .master("local[2]") > .appName("Test Broadcast").getOrCreate() > import spark.implicits._ > spark.conf.set("spark.sql.adaptive.enabled", "true") > val sc = spark.sparkContext > sc.setLogLevel("INFO") > val uuid = UUID.randomUUID > val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { > for (i <- Range(0, 1 + Random.nextInt(1))) > yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) > }).toDF("index", "part", "pv", "uuid") > .withColumn("md5", md5($"uuid")) > val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) > val dim = dim_data.toDF("name", "index") > val result = df.groupBy("index") > .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) > .join(dim, Seq("index")) > .collect(){code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31936) Implement ScriptTransform in sql/core
[ https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-31936: -- Affects Version/s: 3.2.0 3.1.0 > Implement ScriptTransform in sql/core > - > > Key: SPARK-31936 > URL: https://issues.apache.org/jira/browse/SPARK-31936 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > ScriptTransformation currently relies on Hive internals. It'd be great if we > can implement a native ScriptTransformation in sql/core module to remove the > extra Hive dependency here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE
[ https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yu Zhong updated SPARK-33933: - Description: In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal queries as below. {code:java} Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1 {code} This is usually happens when broadcast join(with or without hint) after a long running shuffle (more than 5 minutes). By disable AQE, the issues disappear. The workaround is to increase spark.sql.broadcastTimeout and it works. But because the data to broadcast is very small, that doesn't make sense. After investigation, the root cause should be like this: when enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). Code to reproduce: {code:java} import java.util.UUID import scala.util.Random import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .master("local[2]") .appName("Test Broadcast").getOrCreate() import spark.implicits._ spark.conf.set("spark.sql.adaptive.enabled", "true") val sc = spark.sparkContext sc.setLogLevel("INFO") val uuid = UUID.randomUUID val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { for (i <- Range(0, 1 + Random.nextInt(1))) yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) }).toDF("index", "part", "pv", "uuid") .withColumn("md5", md5($"uuid")) val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) val dim = dim_data.toDF("name", "index") val result = df.groupBy("index") .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) .join(dim, Seq("index")) .collect(){code} was: In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal queries as below. {code:java} Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1 {code} This is usually happens when broadcast join(with or without hint) after a long running shuffle (more than 5 minutes). By disable AQE, the issues disappear. The workaround is to increase spark.sql.broadcastTimeout and it works. But because the data to broadcast is very small, that doesn't make sense. After investigation, the root cause should be like this: when enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). Code to reproduce: {code:java} import java.util.UUID import scala.util.Random import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .master("local[2]") .appName("Test Broadcast").getOrCreate() import spark.implicits._ val sc = spark.sparkContext sc.setLogLevel("INFO") val uuid = UUID.randomUUID val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { for (i <- Range(0, 1 + Random.nextInt(1))) yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) }).toDF("index", "part", "pv", "uuid") .withColumn("md5", md5($"uuid")) val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) val dim = dim_data.toDF("name", "index") val result = df.groupBy("index") .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) .join(dim, Seq("index")) .collect(){code} > Broadcast timeout happened unexpectedly in AQE > --- > > Key: SPARK-33933 > URL: https://issues.apache.org/jira/browse/SPARK-33933 > Project: Spark >
[jira] [Created] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE
Yu Zhong created SPARK-33933: Summary: Broadcast timeout happened unexpectedly in AQE Key: SPARK-33933 URL: https://issues.apache.org/jira/browse/SPARK-33933 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0 Reporter: Yu Zhong In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal queries as below. {code:java} Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1 {code} This is usually happens when broadcast join(with or without hint) after a long running shuffle (more than 5 minutes). By disable AQE, the issues disappear. The workaround is to increase spark.sql.broadcastTimeout and it works. But because the data to broadcast is very small, that doesn't make sense. After investigation, the root cause should be like this: when enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). Code to reproduce: {code:java} import java.util.UUID import scala.util.Random import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession.builder() .master("local[2]") .appName("Test Broadcast").getOrCreate() import spark.implicits._ val sc = spark.sparkContext sc.setLogLevel("INFO") val uuid = UUID.randomUUID val df = sc.parallelize(Range(0, 1), 1).flatMap(x => { for (i <- Range(0, 1 + Random.nextInt(1))) yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString) }).toDF("index", "part", "pv", "uuid") .withColumn("md5", md5($"uuid")) val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x)) val dim = dim_data.toDF("name", "index") val result = df.groupBy("index") .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv")) .join(dim, Seq("index")) .collect(){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document
[ https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33932: Assignee: Apache Spark (was: L. C. Hsieh) > Clean up KafkaOffsetReader API document > --- > > Key: SPARK-33932 > URL: https://issues.apache.org/jira/browse/SPARK-33932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > > KafkaOffsetReader API documents are duplicated among > KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if > the doc is centralized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document
[ https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33932: Assignee: Apache Spark (was: L. C. Hsieh) > Clean up KafkaOffsetReader API document > --- > > Key: SPARK-33932 > URL: https://issues.apache.org/jira/browse/SPARK-33932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > > KafkaOffsetReader API documents are duplicated among > KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if > the doc is centralized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document
[ https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33932: Assignee: L. C. Hsieh (was: Apache Spark) > Clean up KafkaOffsetReader API document > --- > > Key: SPARK-33932 > URL: https://issues.apache.org/jira/browse/SPARK-33932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > > KafkaOffsetReader API documents are duplicated among > KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if > the doc is centralized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33932) Clean up KafkaOffsetReader API document
[ https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255862#comment-17255862 ] Apache Spark commented on SPARK-33932: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30961 > Clean up KafkaOffsetReader API document > --- > > Key: SPARK-33932 > URL: https://issues.apache.org/jira/browse/SPARK-33932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > > KafkaOffsetReader API documents are duplicated among > KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if > the doc is centralized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33932) Clean up KafkaOffsetReader API document
L. C. Hsieh created SPARK-33932: --- Summary: Clean up KafkaOffsetReader API document Key: SPARK-33932 URL: https://issues.apache.org/jira/browse/SPARK-33932 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh KafkaOffsetReader API documents are duplicated among KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if the doc is centralized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org