[jira] [Assigned] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true
[ https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35104: Assignee: Kousuke Saruta (was: Apache Spark) > Fix ugly indentation of multiple JSON records in a single split file > generated by JacksonGenerator when pretty option is true > - > > Key: SPARK-35104 > URL: https://issues.apache.org/jira/browse/SPARK-35104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > When writing multiple JSON records into a single split file with pretty > option true, indentation will be broken except for the first JSON record. > {code:java} > // Run in the Spark Shell. > // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. > // Or set spark.default.parallelism for the previous releases. > spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) > val df = Seq("a", "b", "c").toDF > df.write.option("pretty", "true").json("/path/to/output") > # Run in a Shell > $ cat /path/to/output/*.json > { > "value" : "a" > } > { > "value" : "b" > } > { > "value" : "c" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true
[ https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322647#comment-17322647 ] Apache Spark commented on SPARK-35104: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32203 > Fix ugly indentation of multiple JSON records in a single split file > generated by JacksonGenerator when pretty option is true > - > > Key: SPARK-35104 > URL: https://issues.apache.org/jira/browse/SPARK-35104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > When writing multiple JSON records into a single split file with pretty > option true, indentation will be broken except for the first JSON record. > {code:java} > // Run in the Spark Shell. > // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. > // Or set spark.default.parallelism for the previous releases. > spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) > val df = Seq("a", "b", "c").toDF > df.write.option("pretty", "true").json("/path/to/output") > # Run in a Shell > $ cat /path/to/output/*.json > { > "value" : "a" > } > { > "value" : "b" > } > { > "value" : "c" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true
[ https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35104: Assignee: Apache Spark (was: Kousuke Saruta) > Fix ugly indentation of multiple JSON records in a single split file > generated by JacksonGenerator when pretty option is true > - > > Key: SPARK-35104 > URL: https://issues.apache.org/jira/browse/SPARK-35104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > When writing multiple JSON records into a single split file with pretty > option true, indentation will be broken except for the first JSON record. > {code:java} > // Run in the Spark Shell. > // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. > // Or set spark.default.parallelism for the previous releases. > spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) > val df = Seq("a", "b", "c").toDF > df.write.option("pretty", "true").json("/path/to/output") > # Run in a Shell > $ cat /path/to/output/*.json > { > "value" : "a" > } > { > "value" : "b" > } > { > "value" : "c" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true
[ https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-35104: --- Description: When writing multiple JSON records into a single split file with pretty option true, indentation will be broken except for the first JSON record. {code:java} // Run in the Spark Shell. // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. // Or set spark.default.parallelism for the previous releases. spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) val df = Seq("a", "b", "c").toDF df.write.option("pretty", "true").json("/path/to/output") # Run in a Shell $ cat /path/to/output/*.json { "value" : "a" } { "value" : "b" } { "value" : "c" } {code} was: When writing multiple JSON records into a single split file with pretty option true, indentation will be broken except for the first JSON record. {code} // Run in the Spark Shell. // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. // Or set spark.default.parallelism for the previous releases. spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) val df = Seq("a", "b", "c").toDF df.write.option("pretty", "true").json("/path/to/output") # Run in the Shell $ cat /path/to/output/*.json { "value" : "a" } { "value" : "b" } { "value" : "c" } {code} > Fix ugly indentation of multiple JSON records in a single split file > generated by JacksonGenerator when pretty option is true > - > > Key: SPARK-35104 > URL: https://issues.apache.org/jira/browse/SPARK-35104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > When writing multiple JSON records into a single split file with pretty > option true, indentation will be broken except for the first JSON record. > {code:java} > // Run in the Spark Shell. > // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. > // Or set spark.default.parallelism for the previous releases. > spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) > val df = Seq("a", "b", "c").toDF > df.write.option("pretty", "true").json("/path/to/output") > # Run in a Shell > $ cat /path/to/output/*.json > { > "value" : "a" > } > { > "value" : "b" > } > { > "value" : "c" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true
Kousuke Saruta created SPARK-35104: -- Summary: Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true Key: SPARK-35104 URL: https://issues.apache.org/jira/browse/SPARK-35104 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2, 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta When writing multiple JSON records into a single split file with pretty option true, indentation will be broken except for the first JSON record. {code} // Run in the Spark Shell. // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master. // Or set spark.default.parallelism for the previous releases. spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1) val df = Seq("a", "b", "c").toDF df.write.option("pretty", "true").json("/path/to/output") # Run in the Shell $ cat /path/to/output/*.json { "value" : "a" } { "value" : "b" } { "value" : "c" } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322627#comment-17322627 ] Apache Spark commented on SPARK-28098: -- User 'FatalLin' has created a pull request for this issue: https://github.com/apache/spark/pull/32202 > Native ORC reader doesn't support subdirectories with Hive tables > - > > Key: SPARK-28098 > URL: https://issues.apache.org/jira/browse/SPARK-28098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Douglas Drinka >Priority: Major > > The Hive ORC reader supports recursive directory reads from S3. Spark's > native ORC reader supports recursive directory reads, but not when used with > Hive. > > {code:java} > val testData = List(1,2,3,4,5) > val dataFrame = testData.toDF() > dataFrame > .coalesce(1) > .write > .mode(SaveMode.Overwrite) > .format("orc") > .option("compression", "zlib") > .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/") > spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest") > spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS > ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'") > spark.conf.set("hive.mapred.supports.subdirectories","true") > spark.conf.set("mapred.input.dir.recursive","true") > spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //0 > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //5{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28098: Assignee: (was: Apache Spark) > Native ORC reader doesn't support subdirectories with Hive tables > - > > Key: SPARK-28098 > URL: https://issues.apache.org/jira/browse/SPARK-28098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Douglas Drinka >Priority: Major > > The Hive ORC reader supports recursive directory reads from S3. Spark's > native ORC reader supports recursive directory reads, but not when used with > Hive. > > {code:java} > val testData = List(1,2,3,4,5) > val dataFrame = testData.toDF() > dataFrame > .coalesce(1) > .write > .mode(SaveMode.Overwrite) > .format("orc") > .option("compression", "zlib") > .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/") > spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest") > spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS > ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'") > spark.conf.set("hive.mapred.supports.subdirectories","true") > spark.conf.set("mapred.input.dir.recursive","true") > spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //0 > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //5{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28098: Assignee: Apache Spark > Native ORC reader doesn't support subdirectories with Hive tables > - > > Key: SPARK-28098 > URL: https://issues.apache.org/jira/browse/SPARK-28098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Douglas Drinka >Assignee: Apache Spark >Priority: Major > > The Hive ORC reader supports recursive directory reads from S3. Spark's > native ORC reader supports recursive directory reads, but not when used with > Hive. > > {code:java} > val testData = List(1,2,3,4,5) > val dataFrame = testData.toDF() > dataFrame > .coalesce(1) > .write > .mode(SaveMode.Overwrite) > .format("orc") > .option("compression", "zlib") > .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/") > spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest") > spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS > ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'") > spark.conf.set("hive.mapred.supports.subdirectories","true") > spark.conf.set("mapred.input.dir.recursive","true") > spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //0 > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //5{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322626#comment-17322626 ] Apache Spark commented on SPARK-28098: -- User 'FatalLin' has created a pull request for this issue: https://github.com/apache/spark/pull/32202 > Native ORC reader doesn't support subdirectories with Hive tables > - > > Key: SPARK-28098 > URL: https://issues.apache.org/jira/browse/SPARK-28098 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Douglas Drinka >Priority: Major > > The Hive ORC reader supports recursive directory reads from S3. Spark's > native ORC reader supports recursive directory reads, but not when used with > Hive. > > {code:java} > val testData = List(1,2,3,4,5) > val dataFrame = testData.toDF() > dataFrame > .coalesce(1) > .write > .mode(SaveMode.Overwrite) > .format("orc") > .option("compression", "zlib") > .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/") > spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest") > spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS > ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'") > spark.conf.set("hive.mapred.supports.subdirectories","true") > spark.conf.set("mapred.input.dir.recursive","true") > spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true") > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //0 > spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false") > println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count) > //5{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35103) Improve type coercion rule performance
Yingyi Bu created SPARK-35103: - Summary: Improve type coercion rule performance Key: SPARK-35103 URL: https://issues.apache.org/jira/browse/SPARK-35103 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yingyi Bu Reduce the time spent on type coercion rules by running them together one-tree-node-at-a-time in a combined rule. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33731) Standardize exception types
[ https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33731. -- Resolution: Duplicate > Standardize exception types > --- > > Key: SPARK-33731 > URL: https://issues.apache.org/jira/browse/SPARK-33731 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should: > - have a better hierarchy for exception types > - or at least use the default type of exceptions correctly instead of just > throwing a plain Exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32194) Standardize exceptions in PySpark
[ https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32194: - Parent: SPARK-32195 Issue Type: Sub-task (was: Improvement) > Standardize exceptions in PySpark > - > > Key: SPARK-32194 > URL: https://issues.apache.org/jira/browse/SPARK-32194 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many > cases. We should standardize them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322621#comment-17322621 ] Apache Spark commented on SPARK-35026: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32201 > Support use CUBE/ROLLUP in GROUPING SETS > > > Key: SPARK-35026 > URL: https://issues.apache.org/jira/browse/SPARK-35026 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png|width=924,height=463! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35026: Assignee: (was: Apache Spark) > Support use CUBE/ROLLUP in GROUPING SETS > > > Key: SPARK-35026 > URL: https://issues.apache.org/jira/browse/SPARK-35026 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png|width=924,height=463! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322620#comment-17322620 ] Apache Spark commented on SPARK-35026: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32201 > Support use CUBE/ROLLUP in GROUPING SETS > > > Key: SPARK-35026 > URL: https://issues.apache.org/jira/browse/SPARK-35026 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png|width=924,height=463! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS
[ https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35026: Assignee: Apache Spark > Support use CUBE/ROLLUP in GROUPING SETS > > > Key: SPARK-35026 > URL: https://issues.apache.org/jira/browse/SPARK-35026 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > Attachments: screenshot-1.png > > > !screenshot-1.png|width=924,height=463! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35099) Convert ANSI interval literals to SQL string
[ https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-35099. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32196 [https://github.com/apache/spark/pull/32196] > Convert ANSI interval literals to SQL string > > > Key: SPARK-35099 > URL: https://issues.apache.org/jira/browse/SPARK-35099 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.2.0 > > > Implement the sql() and toString() methods of Literal for ANSI interval > types: DayTimeIntervalType and YearMonthIntervalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated
[ https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322596#comment-17322596 ] Apache Spark commented on SPARK-35102: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/32200 > Make spark.sql.hive.version meaningful and not deprecated > - > > Key: SPARK-35102 > URL: https://issues.apache.org/jira/browse/SPARK-35102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Minor > > Firstly let's take a look the definition and comment. > {code:java} > // A fake config which is only here for backward compatibility reasons. This > config has no effect > // to Spark, just for reporting the builtin Hive version of Spark to existing > applications that > // already rely on this config. > val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version") > .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive > version in Spark.") > .version("1.1.1") > .fallbackConf(HIVE_METASTORE_VERSION) > {code} > It is used for reporting the built-in Hive version but the current status is > unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax. > It is marked as deprecated but kept a long way until now. I guess it is hard > for us to remove it and not even necessary. > On second thought, it's actually good for us to keep it to work with the > `spark.sql.hive.metastore.version`. As when > `spark.sql.hive.metastore.version` is changed, it could just be used to > report the compiled hive version statically, it's useful when an error occurs > in this case. So this parameter should be fixed to compiled hive version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated
[ https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322595#comment-17322595 ] Apache Spark commented on SPARK-35102: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/32200 > Make spark.sql.hive.version meaningful and not deprecated > - > > Key: SPARK-35102 > URL: https://issues.apache.org/jira/browse/SPARK-35102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Minor > > Firstly let's take a look the definition and comment. > {code:java} > // A fake config which is only here for backward compatibility reasons. This > config has no effect > // to Spark, just for reporting the builtin Hive version of Spark to existing > applications that > // already rely on this config. > val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version") > .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive > version in Spark.") > .version("1.1.1") > .fallbackConf(HIVE_METASTORE_VERSION) > {code} > It is used for reporting the built-in Hive version but the current status is > unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax. > It is marked as deprecated but kept a long way until now. I guess it is hard > for us to remove it and not even necessary. > On second thought, it's actually good for us to keep it to work with the > `spark.sql.hive.metastore.version`. As when > `spark.sql.hive.metastore.version` is changed, it could just be used to > report the compiled hive version statically, it's useful when an error occurs > in this case. So this parameter should be fixed to compiled hive version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated
[ https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35102: Assignee: Apache Spark > Make spark.sql.hive.version meaningful and not deprecated > - > > Key: SPARK-35102 > URL: https://issues.apache.org/jira/browse/SPARK-35102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Minor > > Firstly let's take a look the definition and comment. > {code:java} > // A fake config which is only here for backward compatibility reasons. This > config has no effect > // to Spark, just for reporting the builtin Hive version of Spark to existing > applications that > // already rely on this config. > val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version") > .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive > version in Spark.") > .version("1.1.1") > .fallbackConf(HIVE_METASTORE_VERSION) > {code} > It is used for reporting the built-in Hive version but the current status is > unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax. > It is marked as deprecated but kept a long way until now. I guess it is hard > for us to remove it and not even necessary. > On second thought, it's actually good for us to keep it to work with the > `spark.sql.hive.metastore.version`. As when > `spark.sql.hive.metastore.version` is changed, it could just be used to > report the compiled hive version statically, it's useful when an error occurs > in this case. So this parameter should be fixed to compiled hive version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated
[ https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35102: Assignee: (was: Apache Spark) > Make spark.sql.hive.version meaningful and not deprecated > - > > Key: SPARK-35102 > URL: https://issues.apache.org/jira/browse/SPARK-35102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kent Yao >Priority: Minor > > Firstly let's take a look the definition and comment. > {code:java} > // A fake config which is only here for backward compatibility reasons. This > config has no effect > // to Spark, just for reporting the builtin Hive version of Spark to existing > applications that > // already rely on this config. > val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version") > .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive > version in Spark.") > .version("1.1.1") > .fallbackConf(HIVE_METASTORE_VERSION) > {code} > It is used for reporting the built-in Hive version but the current status is > unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax. > It is marked as deprecated but kept a long way until now. I guess it is hard > for us to remove it and not even necessary. > On second thought, it's actually good for us to keep it to work with the > `spark.sql.hive.metastore.version`. As when > `spark.sql.hive.metastore.version` is changed, it could just be used to > report the compiled hive version statically, it's useful when an error occurs > in this case. So this parameter should be fixed to compiled hive version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated
Kent Yao created SPARK-35102: Summary: Make spark.sql.hive.version meaningful and not deprecated Key: SPARK-35102 URL: https://issues.apache.org/jira/browse/SPARK-35102 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Kent Yao Firstly let's take a look the definition and comment. {code:java} // A fake config which is only here for backward compatibility reasons. This config has no effect // to Spark, just for reporting the builtin Hive version of Spark to existing applications that // already rely on this config. val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version") .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive version in Spark.") .version("1.1.1") .fallbackConf(HIVE_METASTORE_VERSION) {code} It is used for reporting the built-in Hive version but the current status is unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax. It is marked as deprecated but kept a long way until now. I guess it is hard for us to remove it and not even necessary. On second thought, it's actually good for us to keep it to work with the `spark.sql.hive.metastore.version`. As when `spark.sql.hive.metastore.version` is changed, it could just be used to report the compiled hive version statically, it's useful when an error occurs in this case. So this parameter should be fixed to compiled hive version. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322568#comment-17322568 ] Dongjoon Hyun commented on SPARK-35093: --- Thank you for reporting and fixing this, [~andygrove]. I collect this into a subtask of SPARK-33828 to give more visibility. > [SQL] AQE columnar mismatch on exchange reuse > - > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Priority: Major > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35093) AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35093: -- Summary: AQE columnar mismatch on exchange reuse (was: [SQL] AQE columnar mismatch on exchange reuse) > AQE columnar mismatch on exchange reuse > --- > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Priority: Major > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35093: -- Parent: SPARK-33828 Issue Type: Sub-task (was: Bug) > [SQL] AQE columnar mismatch on exchange reuse > - > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Priority: Major > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35101) Add GitHub status check in PR instead of a comment
[ https://issues.apache.org/jira/browse/SPARK-35101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35101: Assignee: (was: Apache Spark) > Add GitHub status check in PR instead of a comment > -- > > Key: SPARK-35101 > URL: https://issues.apache.org/jira/browse/SPARK-35101 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should create a Github Actions' Checks in PRs instead of relying on a > comment to make it easier to judge if a PR is mergeable or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35101) Add GitHub status check in PR instead of a comment
[ https://issues.apache.org/jira/browse/SPARK-35101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35101: Assignee: Apache Spark > Add GitHub status check in PR instead of a comment > -- > > Key: SPARK-35101 > URL: https://issues.apache.org/jira/browse/SPARK-35101 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > We should create a Github Actions' Checks in PRs instead of relying on a > comment to make it easier to judge if a PR is mergeable or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35101) Add GitHub status check in PR instead of a comment
[ https://issues.apache.org/jira/browse/SPARK-35101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322565#comment-17322565 ] Apache Spark commented on SPARK-35101: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32193 > Add GitHub status check in PR instead of a comment > -- > > Key: SPARK-35101 > URL: https://issues.apache.org/jira/browse/SPARK-35101 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should create a Github Actions' Checks in PRs instead of relying on a > comment to make it easier to judge if a PR is mergeable or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-35048) Distribute GitHub Actions workflows to fork repositories to share the resources
[ https://issues.apache.org/jira/browse/SPARK-35048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-35048: - Comment: was deleted (was: User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/32174) > Distribute GitHub Actions workflows to fork repositories to share the > resources > --- > > Key: SPARK-35048 > URL: https://issues.apache.org/jira/browse/SPARK-35048 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 3.2.0 > > > The PR was opened first as a POC. Please refer to the PR description at > https://github.com/apache/spark/pull/32092. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35101) Add GitHub status check in PR instead of a comment
Hyukjin Kwon created SPARK-35101: Summary: Add GitHub status check in PR instead of a comment Key: SPARK-35101 URL: https://issues.apache.org/jira/browse/SPARK-35101 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.2.0 Reporter: Hyukjin Kwon We should create a Github Actions' Checks in PRs instead of relying on a comment to make it easier to judge if a PR is mergeable or not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35100) Refactor AFT - support virtual centering
[ https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35100: Assignee: Apache Spark > Refactor AFT - support virtual centering > > > Key: SPARK-35100 > URL: https://issues.apache.org/jira/browse/SPARK-35100 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35100) Refactor AFT - support virtual centering
[ https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35100: Assignee: (was: Apache Spark) > Refactor AFT - support virtual centering > > > Key: SPARK-35100 > URL: https://issues.apache.org/jira/browse/SPARK-35100 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35100) Refactor AFT - support virtual centering
[ https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322556#comment-17322556 ] Apache Spark commented on SPARK-35100: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/32199 > Refactor AFT - support virtual centering > > > Key: SPARK-35100 > URL: https://issues.apache.org/jira/browse/SPARK-35100 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35100) Refactor AFT - support virtual centering
[ https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322555#comment-17322555 ] Apache Spark commented on SPARK-35100: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/32199 > Refactor AFT - support virtual centering > > > Key: SPARK-35100 > URL: https://issues.apache.org/jira/browse/SPARK-35100 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35100) Refactor AFT - support virtual centering
zhengruifeng created SPARK-35100: Summary: Refactor AFT - support virtual centering Key: SPARK-35100 URL: https://issues.apache.org/jira/browse/SPARK-35100 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 3.2.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
[ https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322552#comment-17322552 ] Apache Spark commented on SPARK-26164: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/32198 > [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort > -- > > Key: SPARK-26164 > URL: https://issues.apache.org/jira/browse/SPARK-26164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0, 3.0.0, 3.1.0 >Reporter: Cheng Su >Priority: Minor > > Problem: > Current spark always requires a local sort before writing to output table on > partition/bucket columns [1]. The disadvantage is the sort might waste > reserved CPU time on executor due to spill. Hive does not require the local > sort before writing output table [2], and we saw performance regression when > migrating hive workload to spark. > > Proposal: > We can avoid the local sort by keeping the mapping between file path and > output writer. In case of writing row to a new file path, we create a new > output writer. Otherwise, re-use the same output writer if the writer already > exists (mainly change should be in FileFormatDataWriter.scala). This is very > similar to what hive does in [2]. > Given the new behavior (i.e. avoid sort by keeping multiple output writer) > consumes more memory on executor (multiple output writer needs to be opened > in same time), than the current behavior (i.e. only one output writer > opened). We can add the config to switch between the current and new behavior. > > [1]: spark FileFormatWriter.scala - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] > [2]: hive FileSinkOperator.java - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322551#comment-17322551 ] Apache Spark commented on SPARK-34995: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/32197 > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35081) Add Data Source Option links to missing documents.
[ https://issues.apache.org/jira/browse/SPARK-35081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-35081: Description: We're missing the documents for options in detail such as `to_xxx`, `from_xxx`, `schema_of_csv` and `schema_of_json` functions. We should add a link to the Data Source Option for them. was: We're missing the documents for options in detail for `to_xxx` and `from_xxx` functions. We should add a link to the Data Source Option for them. > Add Data Source Option links to missing documents. > -- > > Key: SPARK-35081 > URL: https://issues.apache.org/jira/browse/SPARK-35081 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We're missing the documents for options in detail such as `to_xxx`, > `from_xxx`, `schema_of_csv` and `schema_of_json` functions. > We should add a link to the Data Source Option for them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35081) Add Data Source Option links to missing documents.
[ https://issues.apache.org/jira/browse/SPARK-35081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-35081: Summary: Add Data Source Option links to missing documents. (was: Add Data Source Option links to `to_xxx`, `from_xxx`.) > Add Data Source Option links to missing documents. > -- > > Key: SPARK-35081 > URL: https://issues.apache.org/jira/browse/SPARK-35081 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We're missing the documents for options in detail for `to_xxx` and `from_xxx` > functions. > We should add a link to the Data Source Option for them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35032) Port Koalas Index unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35032. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32139 [https://github.com/apache/spark/pull/32139] > Port Koalas Index unit tests into PySpark > - > > Key: SPARK-35032 > URL: https://issues.apache.org/jira/browse/SPARK-35032 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > This JIRA aims to port Koalas Index unit tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35032) Port Koalas Index unit tests into PySpark
[ https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-35032: Assignee: Xinrong Meng > Port Koalas Index unit tests into PySpark > - > > Key: SPARK-35032 > URL: https://issues.apache.org/jira/browse/SPARK-35032 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > This JIRA aims to port Koalas Index unit tests to [PySpark > tests|https://github.com/apache/spark/tree/master/python/pyspark/tests]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency
[ https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322476#comment-17322476 ] Min Shen commented on SPARK-30602: -- We have published the production results of push-based shuffle on 100% of LinkedIn's offline Spark workloads in the following blog post. https://www.linkedin.com/pulse/bringing-next-gen-shuffle-architecture-data-linkedin-scale-min-shen/ > SPIP: Support push-based shuffle to improve shuffle efficiency > -- > > Key: SPARK-30602 > URL: https://issues.apache.org/jira/browse/SPARK-30602 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Min Shen >Priority: Major > Labels: release-notes > Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, > vldb_magnet_final.pdf > > > In a large deployment of a Spark compute infrastructure, Spark shuffle is > becoming a potential scaling bottleneck and a source of inefficiency in the > cluster. When doing Spark on YARN for a large-scale deployment, people > usually enable Spark external shuffle service and store the intermediate > shuffle files on HDD. Because the number of blocks generated for a particular > shuffle grows quadratically compared to the size of shuffled data (# mappers > and reducers grows linearly with the size of shuffled data, but # blocks is # > mappers * # reducers), one general trend we have observed is that the more > data a Spark application processes, the smaller the block size becomes. In a > few production clusters we have seen, the average shuffle block size is only > 10s of KBs. Because of the inefficiency of performing random reads on HDD for > small amount of data, the overall efficiency of the Spark external shuffle > services serving the shuffle blocks degrades as we see an increasing # of > Spark applications processing an increasing amount of data. In addition, > because Spark external shuffle service is a shared service in a multi-tenancy > cluster, the inefficiency with one Spark application could propagate to other > applications as well. > In this ticket, we propose a solution to improve Spark shuffle efficiency in > above mentioned environments with push-based shuffle. With push-based > shuffle, shuffle is performed at the end of mappers and blocks get pre-merged > and move towards reducers. In our prototype implementation, we have seen > significant efficiency improvements when performing large shuffles. We take a > Spark-native approach to achieve this, i.e., extending Spark’s existing > shuffle netty protocol, and the behaviors of Spark mappers, reducers and > drivers. This way, we can bring the benefits of more efficient shuffle in > Spark without incurring the dependency or overhead of either specialized > storage layer or external infrastructure pieces. > > Link to dev mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35099) Convert ANSI interval literals to SQL string
[ https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322462#comment-17322462 ] Apache Spark commented on SPARK-35099: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/32196 > Convert ANSI interval literals to SQL string > > > Key: SPARK-35099 > URL: https://issues.apache.org/jira/browse/SPARK-35099 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Implement the sql() and toString() methods of Literal for ANSI interval > types: DayTimeIntervalType and YearMonthIntervalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35099) Convert ANSI interval literals to SQL string
[ https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35099: Assignee: Max Gekk (was: Apache Spark) > Convert ANSI interval literals to SQL string > > > Key: SPARK-35099 > URL: https://issues.apache.org/jira/browse/SPARK-35099 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Implement the sql() and toString() methods of Literal for ANSI interval > types: DayTimeIntervalType and YearMonthIntervalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35099) Convert ANSI interval literals to SQL string
[ https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35099: Assignee: Apache Spark (was: Max Gekk) > Convert ANSI interval literals to SQL string > > > Key: SPARK-35099 > URL: https://issues.apache.org/jira/browse/SPARK-35099 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Implement the sql() and toString() methods of Literal for ANSI interval > types: DayTimeIntervalType and YearMonthIntervalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35099) Convert ANSI interval literals to SQL string
[ https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322461#comment-17322461 ] Apache Spark commented on SPARK-35099: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/32196 > Convert ANSI interval literals to SQL string > > > Key: SPARK-35099 > URL: https://issues.apache.org/jira/browse/SPARK-35099 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Implement the sql() and toString() methods of Literal for ANSI interval > types: DayTimeIntervalType and YearMonthIntervalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322453#comment-17322453 ] Apache Spark commented on SPARK-35093: -- User 'andygrove' has created a pull request for this issue: https://github.com/apache/spark/pull/32195 > [SQL] AQE columnar mismatch on exchange reuse > - > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Priority: Major > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35093: Assignee: Apache Spark > [SQL] AQE columnar mismatch on exchange reuse > - > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Assignee: Apache Spark >Priority: Major > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35093: Assignee: (was: Apache Spark) > [SQL] AQE columnar mismatch on exchange reuse > - > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Priority: Major > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322452#comment-17322452 ] Apache Spark commented on SPARK-35093: -- User 'andygrove' has created a pull request for this issue: https://github.com/apache/spark/pull/32195 > [SQL] AQE columnar mismatch on exchange reuse > - > > Key: SPARK-35093 > URL: https://issues.apache.org/jira/browse/SPARK-35093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Andy Grove >Priority: Major > > With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that > are semantically equal. > This is done by comparing the canonicalized plan for two Exchange nodes to > see if they are the same. > Unfortunately this does not take into account the fact that two exchanges > with the same canonical plan might be replaced by a plugin in a way that > makes them not compatible. For example, a plugin could create one version > with supportsColumnar=true and another with supportsColumnar=false. It is not > valid to re-use exchanges if there is a supportsColumnar mismatch. > I have tested a fix for this and will put up a PR once I figure out how to > write the tests. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35099) Convert ANSI interval literals to SQL string
Max Gekk created SPARK-35099: Summary: Convert ANSI interval literals to SQL string Key: SPARK-35099 URL: https://issues.apache.org/jira/browse/SPARK-35099 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Assignee: Max Gekk Implement the sql() and toString() methods of Literal for ANSI interval types: DayTimeIntervalType and YearMonthIntervalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35098) Revisit pandas-on-Spark test cases that are disabled because of pandas nondeterministic return values
Xinrong Meng created SPARK-35098: Summary: Revisit pandas-on-Spark test cases that are disabled because of pandas nondeterministic return values Key: SPARK-35098 URL: https://issues.apache.org/jira/browse/SPARK-35098 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.2.0 Reporter: Xinrong Meng Some test cases have been disabled in the places as shown below because of pandas nondeterministic return values: * pandas returns `None` or `nan` randomly python/pyspark/pandas/tests/test_series.py test_astype * pandas returns `True` or `False` randomly python/pyspark/pandas/tests/indexes/test_base.py test_monotonic We should revisit them later. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime
Max Gekk created SPARK-35097: Summary: Add column name to SparkUpgradeException about ancient datetime Key: SPARK-35097 URL: https://issues.apache.org/jira/browse/SPARK-35097 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk The error message: {code:java} org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is. {code} doesn't have any clues of which column causes the issue. Need to improve the message and add column name to it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322376#comment-17322376 ] Apache Spark commented on SPARK-35096: -- User 'sandeep-katta' has created a pull request for this issue: https://github.com/apache/spark/pull/32194 > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(I
[jira] [Assigned] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35096: Assignee: (was: Apache Spark) > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:38) > at > org.
[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322378#comment-17322378 ] Apache Spark commented on SPARK-35096: -- User 'sandeep-katta' has created a pull request for this issue: https://github.com/apache/spark/pull/32194 > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(I
[jira] [Assigned] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35096: Assignee: Apache Spark > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Assignee: Apache Spark >Priority: Major > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray
[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
[ https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322360#comment-17322360 ] Sandeep Katta commented on SPARK-35096: --- Working on fix, soon raise PR for this > foreachBatch throws ArrayIndexOutOfBoundsException if schema is case > Insensitive > > > Key: SPARK-35096 > URL: https://issues.apache.org/jira/browse/SPARK-35096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sandeep Katta >Priority: Major > > Below code works fine before spark3, running on spark3 throws > java.lang.ArrayIndexOutOfBoundsException > {code:java} > val inputPath = "/Users/xyz/data/testcaseInsensitivity" > val output_path = "/Users/xyz/output" > spark.range(10).write.format("parquet").save(inputPath) > def process_row(microBatch: DataFrame, batchId: Long): Unit = { > val df = microBatch.select($"ID".alias("other")) // Doesn't work > df.write.format("parquet").mode("append").save(output_path) > } > val schema = new StructType().add("id", LongType) > val stream_df = > spark.readStream.schema(schema).format("parquet").load(inputPath) > stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) > .start().awaitTermination() > {code} > Stack Trace: > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 > at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.immutable.List.map(List.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.Wrap
[jira] [Created] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive
Sandeep Katta created SPARK-35096: - Summary: foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive Key: SPARK-35096 URL: https://issues.apache.org/jira/browse/SPARK-35096 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Sandeep Katta Below code works fine before spark3, running on spark3 throws java.lang.ArrayIndexOutOfBoundsException {code:java} val inputPath = "/Users/xyz/data/testcaseInsensitivity" val output_path = "/Users/xyz/output" spark.range(10).write.format("parquet").save(inputPath) def process_row(microBatch: DataFrame, batchId: Long): Unit = { val df = microBatch.select($"ID".alias("other")) // Doesn't work df.write.format("parquet").mode("append").save(output_path) } val schema = new StructType().add("id", LongType) val stream_df = spark.readStream.schema(schema).format("parquet").load(inputPath) stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _) .start().awaitTermination() {code} Stack Trace: {code:java} Caused by: java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.spark.sql.types.StructType.apply(StructType.scala:414) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.immutable.List.map(List.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:38) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:146) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:138) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:138) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.sca
[jira] [Created] (SPARK-35095) Use ANSI intervals in streaming join tests
Max Gekk created SPARK-35095: Summary: Use ANSI intervals in streaming join tests Key: SPARK-35095 URL: https://issues.apache.org/jira/browse/SPARK-35095 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Enable ANSI intervals in the tests: - StreamingOuterJoinSuite.right outer with watermark range condition - StreamingOuterJoinSuite.left outer with watermark range condition -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort
[ https://issues.apache.org/jira/browse/SPARK-35094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-35094: --- Description: I use spark 3.1.1 and 3.0.2. Function `from_json` return wrong result with Permissive mode. In corner case: 1. Json message has complex nested structure \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, sameNameField_WhichValueWillAppearInwrongPlace}} 2. Nested -> Nested Field: Schema is satisfy align with value in json. scala code to reproduce: {code:java} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.StructField import org.apache.spark.sql.types.StructType object Main { def main(args: Array[String]): Unit = { implicit val spark: SparkSession = SparkSession.builder().master("local[*]").getOrCreate() import spark.implicits._ val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", StringType, nullable = true) val nestedFieldWhichNotSatisfyJsonMessage = StructField( "badNestedField", StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, nullable = true))) ) val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage = StructField( "nestedField", StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, schemaForFieldWhichWillHaveWrongValue)) ) val customSchema = StructType(Seq( schemaForFieldWhichWillHaveWrongValue, nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage )) val jsonStringToTest = """{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}""" val df = List(jsonStringToTest) .toDF("json") // issue happen only in permissive mode during best effort .select(from_json($"json", customSchema).as("toBeFlatten")) .select("toBeFlatten.*") df.show(truncate = false) assert( df.select("problematicName").as[String].first() == "ThisValueWillBeOverwritten", "wrong value in root schema, parser take value from column with same name but in another nested elvel" ) } } {code} I was not able to debug this issue, to find the exact root cause. But what I find in debug, that In `org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block `e.partialResult()` already have a wrong value. I hope this will help to fix the issue. I do a DIRTY HACK to fix the issue. I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, e.record))`. In my case, it's better to do not have any values in the row, than theoretically have a wrong value in some column. was: I use spark 3.1.1 and 3.0.2. Function `from_json` return wrong result with Permissive mode. In corner case: 1. Json message has complex nested structure \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, sameNameField_WhichValueWillAppearInwrongPlace}} 2. Nested -> Nested Field: Schema is satisfy align with value in json. scala code to reproduce: {code:java} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.StructField import org.apache.spark.sql.types.StructType object Main { def main(args: Array[String]): Unit = { implicit val spark: SparkSession = SparkSession.builder().master("local[*]").getOrCreate() import spark.implicits._ val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", StringType, nullable = true) val nestedFieldWhichNotSatisfyJsonMessage = StructField( "badNestedField", StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, nullable = true))) ) val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage = StructField( "nestedField", StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, schemaForFieldWhichWillHaveWrongValue)) ) val customSchema = StructType(Seq( schemaForFieldWhichWillHaveWrongValue, nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage )) val jsonStringToTest = """{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}""" val df = List(jsonStringToTest) .toDF("json") // issue happen only in permissive mode during best effort .select(from_json($"json", customSchema).as("toBeFlatten")) .select("toBeFlatten.*") df.show(truncate = false) assert( df.select("problematicName").as[String].first() == "ThisValueWillBeOverwritten", "wrong value in root schema, parser take value from column with same name but in another ne
[jira] [Updated] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort
[ https://issues.apache.org/jira/browse/SPARK-35094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Hryhoriev updated SPARK-35094: --- Description: I use spark 3.1.1 and 3.0.2. Function `from_json` return wrong result with Permissive mode. In corner case: 1. Json message has complex nested structure \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, sameNameField_WhichValueWillAppearInwrongPlace}} 2. Nested -> Nested Field: Schema is satisfy align with value in json. scala code to reproduce: {code:java} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.StructField import org.apache.spark.sql.types.StructType object Main { def main(args: Array[String]): Unit = { implicit val spark: SparkSession = SparkSession.builder().master("local[*]").getOrCreate() import spark.implicits._ val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", StringType, nullable = true) val nestedFieldWhichNotSatisfyJsonMessage = StructField( "badNestedField", StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, nullable = true))) ) val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage = StructField( "nestedField", StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, schemaForFieldWhichWillHaveWrongValue)) ) val customSchema = StructType(Seq( schemaForFieldWhichWillHaveWrongValue, nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage )) val jsonStringToTest = """{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}""" val df = List(jsonStringToTest) .toDF("json") // issue happen only in permissive mode during best effort .select(from_json($"json", customSchema).as("toBeFlatten")) .select("toBeFlatten.*") df.show(truncate = false) assert( df.select("problematicName").as[String].first() == "ThisValueWillBeOverwritten", "wrong value in root schema, parser take value from column with same name but in another nested elvel" ) } } {code} I was not able to debug this issue, to find the exact root cause. But what I find in debug, that In `org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block `e.partialResult()` already have a wrong value. I hope this will help to fix the issue. I do a DIRTY HACK to fix the issue. I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, e.record))`. In my case, it's better to do not have any values in the row, than theoretically have a wrong value in some column. was: I use spark 3.1.1 and 3.0.2. Function `from_json` return wrong result with Permissive mode. In corner case: 1. Json message has complex nested structure \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, sameNameField_WhichValueWillAppearInwrongPlace}} 2. Nested -> Nested Field: Schema is satisfy align with value in json. scala code to reproduce: {code:java} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.StructField import org.apache.spark.sql.types.StructType object Main { def main(args: Array[String]): Unit = { implicit val spark: SparkSession = SparkSession.builder().master("local[*]").getOrCreate() import spark.implicits._ val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", StringType, nullable = true) val nestedFieldWhichNotSatisfyJsonMessage = StructField( "badNestedField", StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, nullable = true))) ) val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage = StructField( "nestedField", StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, schemaForFieldWhichWillHaveWrongValue)) ) val customSchema = StructType(Seq( schemaForFieldWhichWillHaveWrongValue, nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage )) val jsonStringToTest = """{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}""" val df = List(jsonStringToTest) .toDF("json") // issue happen only in permissive mode during best effort .select(from_json($"json", customSchema).as("toBeFlatten")) .select("toBeFlatten.*") df.show(truncate = false) assert( df.select("problematicName").as[String].first() == "ThisValueWillBeOverwritten", "wrong value in root schema, parser take value from column with same name but in another nested e
[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322333#comment-17322333 ] Nicholas Chammas commented on SPARK-33000: -- Per the discussion [on the dev list|http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html] and [PR|https://github.com/apache/spark/pull/31742], it seems we just want to update the documentation to clarify that {{cleanCheckpoints}} does not impact shutdown behavior. i.e. Checkpoints are not meant to be cleaned up on shutdown (whether planned or unplanned), and the config is currently working as intended. > cleanCheckpoints config does not clean all checkpointed RDDs on shutdown > > > Key: SPARK-33000 > URL: https://issues.apache.org/jira/browse/SPARK-33000 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.6 >Reporter: Nicholas Chammas >Priority: Minor > > Maybe it's just that the documentation needs to be updated, but I found this > surprising: > {code:python} > $ pyspark > ... > >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true') > >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/') > >>> a = spark.range(10) > >>> a.checkpoint() > DataFrame[id: bigint] > > >>> exit(){code} > The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected > Spark to clean it up on shutdown. > The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} > says: > {quote}Controls whether to clean checkpoint files if the reference is out of > scope. > {quote} > When Spark shuts down, everything goes out of scope, so I'd expect all > checkpointed RDDs to be cleaned up. > For the record, I see the same behavior in both the Scala and Python REPLs. > Evidence the current behavior is confusing: > * [https://stackoverflow.com/q/52630858/877069] > * [https://stackoverflow.com/q/60009856/877069] > * [https://stackoverflow.com/q/61454740/877069] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort
Nick Hryhoriev created SPARK-35094: -- Summary: Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort Key: SPARK-35094 URL: https://issues.apache.org/jira/browse/SPARK-35094 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.1.1, 3.0.2 Reporter: Nick Hryhoriev I use spark 3.1.1 and 3.0.2. Function `from_json` return wrong result with Permissive mode. In corner case: 1. Json message has complex nested structure \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, sameNameField_WhichValueWillAppearInwrongPlace}} 2. Nested -> Nested Field: Schema is satisfy align with value in json. scala code to reproduce: {code:java} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.from_json import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.types.StringType import org.apache.spark.sql.types.StructField import org.apache.spark.sql.types.StructType object Main { def main(args: Array[String]): Unit = { implicit val spark: SparkSession = SparkSession.builder().master("local[*]").getOrCreate() import spark.implicits._ val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", StringType, nullable = true) val nestedFieldWhichNotSatisfyJsonMessage = StructField( "badNestedField", StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, nullable = true))) ) val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage = StructField( "nestedField", StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, schemaForFieldWhichWillHaveWrongValue)) ) val customSchema = StructType(Seq( schemaForFieldWhichWillHaveWrongValue, nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage )) val jsonStringToTest = """{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}""" val df = List(jsonStringToTest) .toDF("json") // issue happen only in permissive mode during best effort .select(from_json($"json", customSchema).as("toBeFlatten")) .select("toBeFlatten.*") df.show(truncate = false) assert( df.select("problematicName").as[String].first() == "ThisValueWillBeOverwritten", "wrong value in root schema, parser take value from column with same name but in another nested elvel" ) } } {code} I was not able to debug this issue, to find the exact root cause. But what I find in debug, that In `org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block `e.partialResult()` already have a wrong value. I hope this will help to fix the issue. I do a DIRTY HACK to fix the issue. I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, e.record))`. in my case, it's better to do not have any values in the row, than theoretically have a wrong value in some column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
[ https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322307#comment-17322307 ] Dongjoon Hyun commented on SPARK-34792: --- Thank you for sharing your update, [~kondziolka9ld]. > Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3 > - > > Key: SPARK-34792 > URL: https://issues.apache.org/jira/browse/SPARK-34792 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 3.0.1 >Reporter: kondziolka9ld >Priority: Major > > Hi, > Please consider a following difference of `randomSplit` method even despite > of the same seed. > > {code:java} > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.7 > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |4| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |5| > |6| > |7| > |8| > |9| > | 10| > +-+ > {code} > while as on spark-3 > {code:java} > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |5| > | 10| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |4| > |6| > |7| > |8| > |9| > +-+ > {code} > I guess that implementation of `sample` method changed. > Is it possible to restore previous behaviour? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse
Andy Grove created SPARK-35093: -- Summary: [SQL] AQE columnar mismatch on exchange reuse Key: SPARK-35093 URL: https://issues.apache.org/jira/browse/SPARK-35093 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2 Reporter: Andy Grove With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that are semantically equal. This is done by comparing the canonicalized plan for two Exchange nodes to see if they are the same. Unfortunately this does not take into account the fact that two exchanges with the same canonical plan might be replaced by a plugin in a way that makes them not compatible. For example, a plugin could create one version with supportsColumnar=true and another with supportsColumnar=false. It is not valid to re-use exchanges if there is a supportsColumnar mismatch. I have tested a fix for this and will put up a PR once I figure out how to write the tests. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *
[ https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322285#comment-17322285 ] Apache Spark commented on SPARK-34720: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32192 > Incorrect star expansion logic MERGE INSERT * / UPDATE * > > > Key: SPARK-34720 > URL: https://issues.apache.org/jira/browse/SPARK-34720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Tathagata Das >Priority: Major > > The natural expectation of INSERT * or UPDATE * in MERGE is to assign source > columns to target columns of the same name. However, the current logic here > generates the assignment by position. > https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214 > This can very easily lead to incorrect results without the user realizing > this odd behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *
[ https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322284#comment-17322284 ] Apache Spark commented on SPARK-34720: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32192 > Incorrect star expansion logic MERGE INSERT * / UPDATE * > > > Key: SPARK-34720 > URL: https://issues.apache.org/jira/browse/SPARK-34720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Tathagata Das >Priority: Major > > The natural expectation of INSERT * or UPDATE * in MERGE is to assign source > columns to target columns of the same name. However, the current logic here > generates the assignment by position. > https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214 > This can very easily lead to incorrect results without the user realizing > this odd behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *
[ https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34720: Assignee: Apache Spark > Incorrect star expansion logic MERGE INSERT * / UPDATE * > > > Key: SPARK-34720 > URL: https://issues.apache.org/jira/browse/SPARK-34720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Major > > The natural expectation of INSERT * or UPDATE * in MERGE is to assign source > columns to target columns of the same name. However, the current logic here > generates the assignment by position. > https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214 > This can very easily lead to incorrect results without the user realizing > this odd behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *
[ https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34720: Assignee: (was: Apache Spark) > Incorrect star expansion logic MERGE INSERT * / UPDATE * > > > Key: SPARK-34720 > URL: https://issues.apache.org/jira/browse/SPARK-34720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Tathagata Das >Priority: Major > > The natural expectation of INSERT * or UPDATE * in MERGE is to assign source > columns to target columns of the same name. However, the current logic here > generates the assignment by position. > https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214 > This can very easily lead to incorrect results without the user realizing > this odd behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.
[ https://issues.apache.org/jira/browse/SPARK-35092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] akiyamaneko updated SPARK-35092: Attachment: the rdd title in storage page shows too long.png > the auto-generated rdd's name in the storage tab should be truncated if it is > too long. > --- > > Key: SPARK-35092 > URL: https://issues.apache.org/jira/browse/SPARK-35092 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Trivial > Attachments: the rdd title in storage page shows too long.png > > > the auto-generated rdd's name in the storage tab should be truncated if it is > too long. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.
[ https://issues.apache.org/jira/browse/SPARK-35092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] akiyamaneko updated SPARK-35092: Attachment: (was: the rdd title in storage page shows too long.png) > the auto-generated rdd's name in the storage tab should be truncated if it is > too long. > --- > > Key: SPARK-35092 > URL: https://issues.apache.org/jira/browse/SPARK-35092 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Trivial > Attachments: the rdd title in storage page shows too long.png > > > the auto-generated rdd's name in the storage tab should be truncated if it is > too long. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.
akiyamaneko created SPARK-35092: --- Summary: the auto-generated rdd's name in the storage tab should be truncated if it is too long. Key: SPARK-35092 URL: https://issues.apache.org/jira/browse/SPARK-35092 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.1 Reporter: akiyamaneko Attachments: the rdd title in storage page shows too long.png the auto-generated rdd's name in the storage tab should be truncated if it is too long. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.
[ https://issues.apache.org/jira/browse/SPARK-35092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] akiyamaneko updated SPARK-35092: Attachment: the rdd title in storage page shows too long.png > the auto-generated rdd's name in the storage tab should be truncated if it is > too long. > --- > > Key: SPARK-35092 > URL: https://issues.apache.org/jira/browse/SPARK-35092 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Trivial > Attachments: the rdd title in storage page shows too long.png > > > the auto-generated rdd's name in the storage tab should be truncated if it is > too long. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35091) Support ANSI intervals by date_part()
[ https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-35091: - Description: Support year-month and day-time intervals by date_part(). For example: {code:sql} > SELECT date_part('days', interval '5 0:0:0' day to second); 5 > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO SECOND); 30.001001 {code} was: Support field extraction from year-month and day-time intervals. For example: {code:sql} > SELECT extract(days FROM interval '5 0:0:0' day to second); 5 > SELECT extract(seconds FROM interval '10 11:12:30.001001' DAY TO SECOND); 30.001001 {code} > Support ANSI intervals by date_part() > - > > Key: SPARK-35091 > URL: https://issues.apache.org/jira/browse/SPARK-35091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Support year-month and day-time intervals by date_part(). For example: > {code:sql} > > SELECT date_part('days', interval '5 0:0:0' day to second); >5 > > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO > SECOND); >30.001001 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35091) Support ANSI intervals by date_part()
Max Gekk created SPARK-35091: Summary: Support ANSI intervals by date_part() Key: SPARK-35091 URL: https://issues.apache.org/jira/browse/SPARK-35091 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Support field extraction from year-month and day-time intervals. For example: {code:sql} > SELECT extract(days FROM interval '5 0:0:0' day to second); 5 > SELECT extract(seconds FROM interval '10 11:12:30.001001' DAY TO SECOND); 30.001001 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322146#comment-17322146 ] Apache Spark commented on SPARK-35087: -- User 'kyoty' has created a pull request for this issue: https://github.com/apache/spark/pull/32190 > Some columns in table ` Aggregated Metrics by Executor` of stage-detail page > shows incorrectly. > > > Key: SPARK-35087 > URL: https://issues.apache.org/jira/browse/SPARK-35087 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 > Environment: spark version: 3.1.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: sort-result-incorrent.png > > > Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc > in table ` Aggregated Metrics by Executor` of stage-detail page shouble be > sorted as numerical-order instead of lexicographical-order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35090) Extract a field from ANSI interval
Max Gekk created SPARK-35090: Summary: Extract a field from ANSI interval Key: SPARK-35090 URL: https://issues.apache.org/jira/browse/SPARK-35090 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Support field extraction from year-month and day-time intervals. For example: {code:sql} > SELECT extract(days FROM interval '5 0:0:0' day to second); 5 > SELECT extract(seconds FROM interval '10 11:12:30.001001' DAY TO SECOND); 30.001001 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)
[ https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322138#comment-17322138 ] Apache Spark commented on SPARK-34787: -- User 'kyoty' has created a pull request for this issue: https://github.com/apache/spark/pull/32189 > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX) > -- > > Key: SPARK-34787 > URL: https://issues.apache.org/jira/browse/SPARK-34787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: akiyamaneko >Priority: Minor > > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX): > {code:html} > 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt > application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO > FsHistoryProvider: Parsing > hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421 > for listing data... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35087: Assignee: Apache Spark > Some columns in table ` Aggregated Metrics by Executor` of stage-detail page > shows incorrectly. > > > Key: SPARK-35087 > URL: https://issues.apache.org/jira/browse/SPARK-35087 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 > Environment: spark version: 3.1.1 >Reporter: akiyamaneko >Assignee: Apache Spark >Priority: Minor > Attachments: sort-result-incorrent.png > > > Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc > in table ` Aggregated Metrics by Executor` of stage-detail page shouble be > sorted as numerical-order instead of lexicographical-order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34995: Assignee: (was: Apache Spark) > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322135#comment-17322135 ] Apache Spark commented on SPARK-35087: -- User 'kyoty' has created a pull request for this issue: https://github.com/apache/spark/pull/32187 > Some columns in table ` Aggregated Metrics by Executor` of stage-detail page > shows incorrectly. > > > Key: SPARK-35087 > URL: https://issues.apache.org/jira/browse/SPARK-35087 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 > Environment: spark version: 3.1.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: sort-result-incorrent.png > > > Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc > in table ` Aggregated Metrics by Executor` of stage-detail page shouble be > sorted as numerical-order instead of lexicographical-order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34995: Assignee: Apache Spark > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35087: Assignee: (was: Apache Spark) > Some columns in table ` Aggregated Metrics by Executor` of stage-detail page > shows incorrectly. > > > Key: SPARK-35087 > URL: https://issues.apache.org/jira/browse/SPARK-35087 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 > Environment: spark version: 3.1.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: sort-result-incorrent.png > > > Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc > in table ` Aggregated Metrics by Executor` of stage-detail page shouble be > sorted as numerical-order instead of lexicographical-order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function
Domagoj created SPARK-35089: --- Summary: non consistent results running count for same dataset after filter and lead window function Key: SPARK-35089 URL: https://issues.apache.org/jira/browse/SPARK-35089 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.1 Reporter: Domagoj I have found an inconsistency with count function results after lead window function and filter. I have a dataframe (this is simplified version, but it's enough to reproduce) with millions of records, with these columns: * df1: ** start(timestamp) ** user_id(int) ** type(string) I need to define duration between two rows, and filter on that duration and type. I used window lead function to get the next event time (that define end for current event), so every row now gets start and stop times. If NULL (last row for example), add next midnight as stop. Data is stored in ORC file (tried with Parquet format, no difference) This only happens with multiple cluster nodes, for example AWS EMR cluster or local docker cluster setup. If I run it on single instance (local on laptop), I get consistent results every time. Spark version is 3.0.1, both in AWS and local and docker setup. Here is some simple code that you can use to reproduce it, I've used jupyterLab notebook on AWS EMR. Spark version is 3.0.1. {code:java} import org.apache.spark.sql.expressions.Window // this dataframe generation code should be executed only once, and data have to be saved, and then opened from disk, so it's always same. val getRandomUser = udf(()=>{ val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy") users(scala.util.Random.nextInt(7)) }) val getRandomType = udf(()=>{ val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE") types(scala.util.Random.nextInt(5)) }) val getRandomStart = udf((x:Int)=>{ x+scala.util.Random.nextInt(47) }) // for loop is used to avoid out of memory error during creation of dataframe for( a <- 0 to 23){ // use iterator a to continue with next million, repeat 1 mil times val x=Range(a*100,(a*100)+100).toDF("id"). withColumn("start",getRandomStart(col("id"))). withColumn("user",getRandomUser()). withColumn("type",getRandomType()). drop("id") x.write.mode("append").orc("hdfs:///random.orc") } // above code should be run only once, I used a cell in Jupyter // define window and lead val w = Window.partitionBy("user").orderBy("start") // if null, replace with 30.000.000 val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) // read data to dataframe, create stop column and calculate duration val fox2 = spark.read.orc("hdfs:///random.orc"). withColumn("end", ts_lead). withColumn("duration", col("end")-col("start")) // repeated executions of this line returns different results for count // I have it in separate cell in JupyterLab fox2.where("type='TypeA' and duration>4").count() {code} My results for three consecutive runs of last line were: * run 1: 2551259 * run 2: 2550756 * run 3: 2551279 It's very important to say that if I use filter: fox2.where("type='TypeA' ") or fox2.where("duration>4"), each of them can be executed repeatedly and I get consistent result every time. I can save dataframe after crating stop and duration columns, and after that, I get consistent results every time. It is not very practical workaround, as I need a lot of space and time to implement it. This dataset is really big (in my eyes at least, aprox 100.000.000 new records per day). If I run this same example on my local machine using master = local[*], everything works as expected, it's just on cluster setup. I tried to create cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with one master and two workers, and have successfully reproduced issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35088) Accept ANSI intervals by the Sequence expression
Max Gekk created SPARK-35088: Summary: Accept ANSI intervals by the Sequence expression Key: SPARK-35088 URL: https://issues.apache.org/jira/browse/SPARK-35088 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Currently, the expression accepts only CalendarIntervalType as the step expression. It should support ANSI intervals as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34995: - Fix Version/s: (was: 3.2.0) > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-34995: -- Assignee: (was: Haejoon Lee) Reverted at https://github.com/apache/spark/commit/637f59360b3b10c7200daa29b15c80ce9b710850 > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.2.0 > > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.
akiyamaneko created SPARK-35087: --- Summary: Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly. Key: SPARK-35087 URL: https://issues.apache.org/jira/browse/SPARK-35087 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.1 Environment: spark version: 3.1.1 Reporter: akiyamaneko Attachments: sort-result-incorrent.png Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc in table ` Aggregated Metrics by Executor` of stage-detail page shouble be sorted as numerical-order instead of lexicographical-order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.
[ https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] akiyamaneko updated SPARK-35087: Attachment: sort-result-incorrent.png > Some columns in table ` Aggregated Metrics by Executor` of stage-detail page > shows incorrectly. > > > Key: SPARK-35087 > URL: https://issues.apache.org/jira/browse/SPARK-35087 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1 > Environment: spark version: 3.1.1 >Reporter: akiyamaneko >Priority: Minor > Attachments: sort-result-incorrent.png > > > Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc > in table ` Aggregated Metrics by Executor` of stage-detail page shouble be > sorted as numerical-order instead of lexicographical-order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34995. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32154 [https://github.com/apache/spark/pull/32154] > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.2.0 > > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark
[ https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34995: Assignee: Haejoon Lee > Port/integrate Koalas remaining codes into PySpark > -- > > Key: SPARK-34995 > URL: https://issues.apache.org/jira/browse/SPARK-34995 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > > There are some more commits remaining after the main codes were ported. > - > [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47] > - > [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
[ https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322050#comment-17322050 ] kondziolka9ld commented on SPARK-34792: --- I found a root cause of this difference, namely it is related to https://issues.apache.org/jira/browse/SPARK-23643 change. Obviously, input paritioning of dataframe has inpact on results as well. When we assure that input paritioning on spark-2-4-7 and spark-3 are the same and when we revert change in `core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala` from [https://github.com/apache/spark/pull/20793/|https://github.com/apache/spark/pull/20793/f] we get consistent results between both versions. > Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3 > - > > Key: SPARK-34792 > URL: https://issues.apache.org/jira/browse/SPARK-34792 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 3.0.1 >Reporter: kondziolka9ld >Priority: Major > > Hi, > Please consider a following difference of `randomSplit` method even despite > of the same seed. > > {code:java} > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.7 > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |4| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |5| > |6| > |7| > |8| > |9| > | 10| > +-+ > {code} > while as on spark-3 > {code:java} > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |5| > | 10| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |4| > |6| > |7| > |8| > |9| > +-+ > {code} > I guess that implementation of `sample` method changed. > Is it possible to restore previous behaviour? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed due to stride alignment.
[ https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321997#comment-17321997 ] Apache Spark commented on SPARK-34843: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32186 > JDBCRelation columnPartition function improperly determines stride size. > Upper bound is skewed due to stride alignment. > --- > > Key: SPARK-34843 > URL: https://issues.apache.org/jira/browse/SPARK-34843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jason Yarbrough >Assignee: Jason Yarbrough >Priority: Minor > Fix For: 3.2.0 > > Attachments: SPARK-34843.patch > > > Currently, in JDBCRelation (line 123), the stride size is calculated as > follows: > val stride: Long = upperBound / numPartitions - lowerBound / numPartitions > > Due to truncation happening on both divisions, the stride size can fall short > of what it should be. This can lead to a big difference between the provided > upper bound and the actual start of the last partition. > I propose this formula, as it is much more accurate and leads to better > distribution: > val stride = (upperBound / numPartitions.toFloat - lowerBound / > numPartitions.toFloat).toLong > > An example (using a date column): > Say you're creating 1,000 partitions. If you provide a lower bound of > 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 > (translated to 18563), Spark determines the stride size as follows: > > (18563L / 1000L) - (-15611 / 1000L) = 33 > Starting from the lower bound, doing strides of 33, you'll end up at > 2017-07-08. This is over 3 years of extra data that will go into the last > partition, and depending on the shape of the data could cause a very long > running task at the end of a job. > > Using the formula I'm proposing, you'd get: > ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34 > This would put the upper bound at 2020-04-02, which is much closer to the > original supplied upper bound. This is the best you can do to get as close as > possible to the upper bound (without adjusting the number of partitions). For > example, a stride size of 35 would go well past the supplied upper bound > (over 2 years, 2022-11-22). > > In the above example, there is only a difference of 1 between the stride size > using the current formula and the stride size using the proposed formula, but > with greater distance between the lower and upper bounds, or a lower number > of partitions, the difference can be much greater. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed due to stride alignment.
[ https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321996#comment-17321996 ] Apache Spark commented on SPARK-34843: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32186 > JDBCRelation columnPartition function improperly determines stride size. > Upper bound is skewed due to stride alignment. > --- > > Key: SPARK-34843 > URL: https://issues.apache.org/jira/browse/SPARK-34843 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jason Yarbrough >Assignee: Jason Yarbrough >Priority: Minor > Fix For: 3.2.0 > > Attachments: SPARK-34843.patch > > > Currently, in JDBCRelation (line 123), the stride size is calculated as > follows: > val stride: Long = upperBound / numPartitions - lowerBound / numPartitions > > Due to truncation happening on both divisions, the stride size can fall short > of what it should be. This can lead to a big difference between the provided > upper bound and the actual start of the last partition. > I propose this formula, as it is much more accurate and leads to better > distribution: > val stride = (upperBound / numPartitions.toFloat - lowerBound / > numPartitions.toFloat).toLong > > An example (using a date column): > Say you're creating 1,000 partitions. If you provide a lower bound of > 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 > (translated to 18563), Spark determines the stride size as follows: > > (18563L / 1000L) - (-15611 / 1000L) = 33 > Starting from the lower bound, doing strides of 33, you'll end up at > 2017-07-08. This is over 3 years of extra data that will go into the last > partition, and depending on the shape of the data could cause a very long > running task at the end of a job. > > Using the formula I'm proposing, you'd get: > ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34 > This would put the upper bound at 2020-04-02, which is much closer to the > original supplied upper bound. This is the best you can do to get as close as > possible to the upper bound (without adjusting the number of partitions). For > example, a stride size of 35 would go well past the supplied upper bound > (over 2 years, 2022-11-22). > > In the above example, there is only a difference of 1 between the stride size > using the current formula and the stride size using the proposed formula, but > with greater distance between the lower and upper bounds, or a lower number > of partitions, the difference can be much greater. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org