[jira] [Assigned] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35104:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Fix ugly indentation of multiple JSON records in a single split file 
> generated by JacksonGenerator when pretty option is true
> -
>
> Key: SPARK-35104
> URL: https://issues.apache.org/jira/browse/SPARK-35104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When writing multiple JSON records into a single split file with pretty 
> option true, indentation will be broken except for the first JSON record.
> {code:java}
> // Run in the Spark Shell.
> // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
> // Or set spark.default.parallelism for the previous releases.
> spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
> val df = Seq("a", "b", "c").toDF
> df.write.option("pretty", "true").json("/path/to/output")
> # Run in a Shell
> $ cat /path/to/output/*.json
> {
>   "value" : "a"
> }
>  {
>   "value" : "b"
> }
>  {
>   "value" : "c"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322647#comment-17322647
 ] 

Apache Spark commented on SPARK-35104:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32203

> Fix ugly indentation of multiple JSON records in a single split file 
> generated by JacksonGenerator when pretty option is true
> -
>
> Key: SPARK-35104
> URL: https://issues.apache.org/jira/browse/SPARK-35104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When writing multiple JSON records into a single split file with pretty 
> option true, indentation will be broken except for the first JSON record.
> {code:java}
> // Run in the Spark Shell.
> // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
> // Or set spark.default.parallelism for the previous releases.
> spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
> val df = Seq("a", "b", "c").toDF
> df.write.option("pretty", "true").json("/path/to/output")
> # Run in a Shell
> $ cat /path/to/output/*.json
> {
>   "value" : "a"
> }
>  {
>   "value" : "b"
> }
>  {
>   "value" : "c"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35104:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Fix ugly indentation of multiple JSON records in a single split file 
> generated by JacksonGenerator when pretty option is true
> -
>
> Key: SPARK-35104
> URL: https://issues.apache.org/jira/browse/SPARK-35104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> When writing multiple JSON records into a single split file with pretty 
> option true, indentation will be broken except for the first JSON record.
> {code:java}
> // Run in the Spark Shell.
> // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
> // Or set spark.default.parallelism for the previous releases.
> spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
> val df = Seq("a", "b", "c").toDF
> df.write.option("pretty", "true").json("/path/to/output")
> # Run in a Shell
> $ cat /path/to/output/*.json
> {
>   "value" : "a"
> }
>  {
>   "value" : "b"
> }
>  {
>   "value" : "c"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true

2021-04-15 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35104:
---
Description: 
When writing multiple JSON records into a single split file with pretty option 
true, indentation will be broken except for the first JSON record.
{code:java}
// Run in the Spark Shell.
// Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
// Or set spark.default.parallelism for the previous releases.
spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
val df = Seq("a", "b", "c").toDF
df.write.option("pretty", "true").json("/path/to/output")

# Run in a Shell
$ cat /path/to/output/*.json
{
  "value" : "a"
}
 {
  "value" : "b"
}
 {
  "value" : "c"
}
{code}

  was:
When writing multiple JSON records into a single split file with pretty option 
true, indentation will be broken except for the first JSON record.
{code}
// Run in the Spark Shell.
// Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
// Or set spark.default.parallelism for the previous releases.
spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
val df = Seq("a", "b", "c").toDF
df.write.option("pretty", "true").json("/path/to/output")

# Run in the Shell
$ cat /path/to/output/*.json
{
  "value" : "a"
}
 {
  "value" : "b"
}
 {
  "value" : "c"
}
{code}



> Fix ugly indentation of multiple JSON records in a single split file 
> generated by JacksonGenerator when pretty option is true
> -
>
> Key: SPARK-35104
> URL: https://issues.apache.org/jira/browse/SPARK-35104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When writing multiple JSON records into a single split file with pretty 
> option true, indentation will be broken except for the first JSON record.
> {code:java}
> // Run in the Spark Shell.
> // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
> // Or set spark.default.parallelism for the previous releases.
> spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
> val df = Seq("a", "b", "c").toDF
> df.write.option("pretty", "true").json("/path/to/output")
> # Run in a Shell
> $ cat /path/to/output/*.json
> {
>   "value" : "a"
> }
>  {
>   "value" : "b"
> }
>  {
>   "value" : "c"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true

2021-04-15 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-35104:
--

 Summary: Fix ugly indentation of multiple JSON records in a single 
split file generated by JacksonGenerator when pretty option is true
 Key: SPARK-35104
 URL: https://issues.apache.org/jira/browse/SPARK-35104
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


When writing multiple JSON records into a single split file with pretty option 
true, indentation will be broken except for the first JSON record.
{code}
// Run in the Spark Shell.
// Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
// Or set spark.default.parallelism for the previous releases.
spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
val df = Seq("a", "b", "c").toDF
df.write.option("pretty", "true").json("/path/to/output")

# Run in the Shell
$ cat /path/to/output/*.json
{
  "value" : "a"
}
 {
  "value" : "b"
}
 {
  "value" : "c"
}
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322627#comment-17322627
 ] 

Apache Spark commented on SPARK-28098:
--

User 'FatalLin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32202

> Native ORC reader doesn't support subdirectories with Hive tables
> -
>
> Key: SPARK-28098
> URL: https://issues.apache.org/jira/browse/SPARK-28098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Douglas Drinka
>Priority: Major
>
> The Hive ORC reader supports recursive directory reads from S3.  Spark's 
> native ORC reader supports recursive directory reads, but not when used with 
> Hive.
>  
> {code:java}
> val testData = List(1,2,3,4,5)
> val dataFrame = testData.toDF()
> dataFrame
> .coalesce(1)
> .write
> .mode(SaveMode.Overwrite)
> .format("orc")
> .option("compression", "zlib")
> .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")
> spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
> spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS 
> ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")
> spark.conf.set("hive.mapred.supports.subdirectories","true")
> spark.conf.set("mapred.input.dir.recursive","true")
> spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //0
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //5{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28098:


Assignee: (was: Apache Spark)

> Native ORC reader doesn't support subdirectories with Hive tables
> -
>
> Key: SPARK-28098
> URL: https://issues.apache.org/jira/browse/SPARK-28098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Douglas Drinka
>Priority: Major
>
> The Hive ORC reader supports recursive directory reads from S3.  Spark's 
> native ORC reader supports recursive directory reads, but not when used with 
> Hive.
>  
> {code:java}
> val testData = List(1,2,3,4,5)
> val dataFrame = testData.toDF()
> dataFrame
> .coalesce(1)
> .write
> .mode(SaveMode.Overwrite)
> .format("orc")
> .option("compression", "zlib")
> .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")
> spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
> spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS 
> ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")
> spark.conf.set("hive.mapred.supports.subdirectories","true")
> spark.conf.set("mapred.input.dir.recursive","true")
> spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //0
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //5{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-28098:


Assignee: Apache Spark

> Native ORC reader doesn't support subdirectories with Hive tables
> -
>
> Key: SPARK-28098
> URL: https://issues.apache.org/jira/browse/SPARK-28098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Douglas Drinka
>Assignee: Apache Spark
>Priority: Major
>
> The Hive ORC reader supports recursive directory reads from S3.  Spark's 
> native ORC reader supports recursive directory reads, but not when used with 
> Hive.
>  
> {code:java}
> val testData = List(1,2,3,4,5)
> val dataFrame = testData.toDF()
> dataFrame
> .coalesce(1)
> .write
> .mode(SaveMode.Overwrite)
> .format("orc")
> .option("compression", "zlib")
> .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")
> spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
> spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS 
> ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")
> spark.conf.set("hive.mapred.supports.subdirectories","true")
> spark.conf.set("mapred.input.dir.recursive","true")
> spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //0
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //5{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322626#comment-17322626
 ] 

Apache Spark commented on SPARK-28098:
--

User 'FatalLin' has created a pull request for this issue:
https://github.com/apache/spark/pull/32202

> Native ORC reader doesn't support subdirectories with Hive tables
> -
>
> Key: SPARK-28098
> URL: https://issues.apache.org/jira/browse/SPARK-28098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Douglas Drinka
>Priority: Major
>
> The Hive ORC reader supports recursive directory reads from S3.  Spark's 
> native ORC reader supports recursive directory reads, but not when used with 
> Hive.
>  
> {code:java}
> val testData = List(1,2,3,4,5)
> val dataFrame = testData.toDF()
> dataFrame
> .coalesce(1)
> .write
> .mode(SaveMode.Overwrite)
> .format("orc")
> .option("compression", "zlib")
> .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")
> spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
> spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS 
> ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")
> spark.conf.set("hive.mapred.supports.subdirectories","true")
> spark.conf.set("mapred.input.dir.recursive","true")
> spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //0
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //5{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35103) Improve type coercion rule performance

2021-04-15 Thread Yingyi Bu (Jira)
Yingyi Bu created SPARK-35103:
-

 Summary: Improve type coercion rule performance
 Key: SPARK-35103
 URL: https://issues.apache.org/jira/browse/SPARK-35103
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yingyi Bu


Reduce the time spent on type coercion rules by running them together 
one-tree-node-at-a-time in a combined rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33731) Standardize exception types

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33731.
--
Resolution: Duplicate

> Standardize exception types
> ---
>
> Key: SPARK-33731
> URL: https://issues.apache.org/jira/browse/SPARK-33731
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should:
> - have a better hierarchy for exception types
> - or at least use the default type of exceptions correctly instead of just 
> throwing a plain Exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32194) Standardize exceptions in PySpark

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32194:
-
Parent: SPARK-32195
Issue Type: Sub-task  (was: Improvement)

> Standardize exceptions in PySpark
> -
>
> Key: SPARK-32194
> URL: https://issues.apache.org/jira/browse/SPARK-32194
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, PySpark throws {{Exception}} or just {{RuntimeException}} in many 
> cases. We should standardize them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322621#comment-17322621
 ] 

Apache Spark commented on SPARK-35026:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32201

> Support use CUBE/ROLLUP in GROUPING SETS
> 
>
> Key: SPARK-35026
> URL: https://issues.apache.org/jira/browse/SPARK-35026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=924,height=463!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35026:


Assignee: (was: Apache Spark)

> Support use CUBE/ROLLUP in GROUPING SETS
> 
>
> Key: SPARK-35026
> URL: https://issues.apache.org/jira/browse/SPARK-35026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=924,height=463!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322620#comment-17322620
 ] 

Apache Spark commented on SPARK-35026:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32201

> Support use CUBE/ROLLUP in GROUPING SETS
> 
>
> Key: SPARK-35026
> URL: https://issues.apache.org/jira/browse/SPARK-35026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=924,height=463!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35026:


Assignee: Apache Spark

> Support use CUBE/ROLLUP in GROUPING SETS
> 
>
> Key: SPARK-35026
> URL: https://issues.apache.org/jira/browse/SPARK-35026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=924,height=463!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35099) Convert ANSI interval literals to SQL string

2021-04-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35099.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32196
[https://github.com/apache/spark/pull/32196]

> Convert ANSI interval literals to SQL string
> 
>
> Key: SPARK-35099
> URL: https://issues.apache.org/jira/browse/SPARK-35099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Implement the sql() and toString() methods of Literal for ANSI interval 
> types: DayTimeIntervalType and YearMonthIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322596#comment-17322596
 ] 

Apache Spark commented on SPARK-35102:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32200

> Make spark.sql.hive.version meaningful and not deprecated
> -
>
> Key: SPARK-35102
> URL: https://issues.apache.org/jira/browse/SPARK-35102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Firstly let's take a look the definition and comment.
> {code:java}
> // A fake config which is only here for backward compatibility reasons. This 
> config has no effect
> // to Spark, just for reporting the builtin Hive version of Spark to existing 
> applications that
> // already rely on this config.
> val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
>   .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive 
> version in Spark.")
>   .version("1.1.1")
>   .fallbackConf(HIVE_METASTORE_VERSION)
> {code}
> It is used for reporting the built-in Hive version but the current status is 
> unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.
> It is marked as deprecated but kept a long way until now. I guess it is hard 
> for us to remove it and not even necessary.
> On second thought, it's actually good for us to keep it to work with the 
> `spark.sql.hive.metastore.version`. As when 
> `spark.sql.hive.metastore.version` is changed, it could just be used to 
> report the compiled hive version statically, it's useful when an error occurs 
> in this case. So this parameter should be fixed to compiled hive version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322595#comment-17322595
 ] 

Apache Spark commented on SPARK-35102:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32200

> Make spark.sql.hive.version meaningful and not deprecated
> -
>
> Key: SPARK-35102
> URL: https://issues.apache.org/jira/browse/SPARK-35102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Firstly let's take a look the definition and comment.
> {code:java}
> // A fake config which is only here for backward compatibility reasons. This 
> config has no effect
> // to Spark, just for reporting the builtin Hive version of Spark to existing 
> applications that
> // already rely on this config.
> val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
>   .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive 
> version in Spark.")
>   .version("1.1.1")
>   .fallbackConf(HIVE_METASTORE_VERSION)
> {code}
> It is used for reporting the built-in Hive version but the current status is 
> unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.
> It is marked as deprecated but kept a long way until now. I guess it is hard 
> for us to remove it and not even necessary.
> On second thought, it's actually good for us to keep it to work with the 
> `spark.sql.hive.metastore.version`. As when 
> `spark.sql.hive.metastore.version` is changed, it could just be used to 
> report the compiled hive version statically, it's useful when an error occurs 
> in this case. So this parameter should be fixed to compiled hive version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35102:


Assignee: Apache Spark

> Make spark.sql.hive.version meaningful and not deprecated
> -
>
> Key: SPARK-35102
> URL: https://issues.apache.org/jira/browse/SPARK-35102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> Firstly let's take a look the definition and comment.
> {code:java}
> // A fake config which is only here for backward compatibility reasons. This 
> config has no effect
> // to Spark, just for reporting the builtin Hive version of Spark to existing 
> applications that
> // already rely on this config.
> val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
>   .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive 
> version in Spark.")
>   .version("1.1.1")
>   .fallbackConf(HIVE_METASTORE_VERSION)
> {code}
> It is used for reporting the built-in Hive version but the current status is 
> unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.
> It is marked as deprecated but kept a long way until now. I guess it is hard 
> for us to remove it and not even necessary.
> On second thought, it's actually good for us to keep it to work with the 
> `spark.sql.hive.metastore.version`. As when 
> `spark.sql.hive.metastore.version` is changed, it could just be used to 
> report the compiled hive version statically, it's useful when an error occurs 
> in this case. So this parameter should be fixed to compiled hive version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35102:


Assignee: (was: Apache Spark)

> Make spark.sql.hive.version meaningful and not deprecated
> -
>
> Key: SPARK-35102
> URL: https://issues.apache.org/jira/browse/SPARK-35102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Firstly let's take a look the definition and comment.
> {code:java}
> // A fake config which is only here for backward compatibility reasons. This 
> config has no effect
> // to Spark, just for reporting the builtin Hive version of Spark to existing 
> applications that
> // already rely on this config.
> val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
>   .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive 
> version in Spark.")
>   .version("1.1.1")
>   .fallbackConf(HIVE_METASTORE_VERSION)
> {code}
> It is used for reporting the built-in Hive version but the current status is 
> unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.
> It is marked as deprecated but kept a long way until now. I guess it is hard 
> for us to remove it and not even necessary.
> On second thought, it's actually good for us to keep it to work with the 
> `spark.sql.hive.metastore.version`. As when 
> `spark.sql.hive.metastore.version` is changed, it could just be used to 
> report the compiled hive version statically, it's useful when an error occurs 
> in this case. So this parameter should be fixed to compiled hive version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35102) Make spark.sql.hive.version meaningful and not deprecated

2021-04-15 Thread Kent Yao (Jira)
Kent Yao created SPARK-35102:


 Summary: Make spark.sql.hive.version meaningful and not deprecated
 Key: SPARK-35102
 URL: https://issues.apache.org/jira/browse/SPARK-35102
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kent Yao


Firstly let's take a look the definition and comment.
{code:java}
// A fake config which is only here for backward compatibility reasons. This 
config has no effect
// to Spark, just for reporting the builtin Hive version of Spark to existing 
applications that
// already rely on this config.
val FAKE_HIVE_VERSION = buildConf("spark.sql.hive.version")
  .doc(s"deprecated, please use ${HIVE_METASTORE_VERSION.key} to get the Hive 
version in Spark.")
  .version("1.1.1")
  .fallbackConf(HIVE_METASTORE_VERSION)
{code}

It is used for reporting the built-in Hive version but the current status is 
unsatisfactory, as it is could be changed in many ways e.g. --conf/SET syntax.

It is marked as deprecated but kept a long way until now. I guess it is hard 
for us to remove it and not even necessary.

On second thought, it's actually good for us to keep it to work with the 
`spark.sql.hive.metastore.version`. As when `spark.sql.hive.metastore.version` 
is changed, it could just be used to report the compiled hive version 
statically, it's useful when an error occurs in this case. So this parameter 
should be fixed to compiled hive version.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse

2021-04-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322568#comment-17322568
 ] 

Dongjoon Hyun commented on SPARK-35093:
---

Thank you for reporting and fixing this, [~andygrove].
I collect this into a subtask of SPARK-33828 to give more visibility.

> [SQL] AQE columnar mismatch on exchange reuse
> -
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35093) AQE columnar mismatch on exchange reuse

2021-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35093:
--
Summary: AQE columnar mismatch on exchange reuse  (was: [SQL] AQE columnar 
mismatch on exchange reuse)

> AQE columnar mismatch on exchange reuse
> ---
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse

2021-04-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35093:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> [SQL] AQE columnar mismatch on exchange reuse
> -
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35101) Add GitHub status check in PR instead of a comment

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35101:


Assignee: (was: Apache Spark)

> Add GitHub status check in PR instead of a comment
> --
>
> Key: SPARK-35101
> URL: https://issues.apache.org/jira/browse/SPARK-35101
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should create a Github Actions' Checks in PRs instead of relying on a 
> comment to make it easier to judge if a PR is mergeable or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35101) Add GitHub status check in PR instead of a comment

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35101:


Assignee: Apache Spark

> Add GitHub status check in PR instead of a comment
> --
>
> Key: SPARK-35101
> URL: https://issues.apache.org/jira/browse/SPARK-35101
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> We should create a Github Actions' Checks in PRs instead of relying on a 
> comment to make it easier to judge if a PR is mergeable or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35101) Add GitHub status check in PR instead of a comment

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322565#comment-17322565
 ] 

Apache Spark commented on SPARK-35101:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32193

> Add GitHub status check in PR instead of a comment
> --
>
> Key: SPARK-35101
> URL: https://issues.apache.org/jira/browse/SPARK-35101
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should create a Github Actions' Checks in PRs instead of relying on a 
> comment to make it easier to judge if a PR is mergeable or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-35048) Distribute GitHub Actions workflows to fork repositories to share the resources

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35048:
-
Comment: was deleted

(was: User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32174)

> Distribute GitHub Actions workflows to fork repositories to share the 
> resources
> ---
>
> Key: SPARK-35048
> URL: https://issues.apache.org/jira/browse/SPARK-35048
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.2.0
>
>
> The PR was opened first as a POC. Please refer to the PR description at 
> https://github.com/apache/spark/pull/32092.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35101) Add GitHub status check in PR instead of a comment

2021-04-15 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-35101:


 Summary: Add GitHub status check in PR instead of a comment
 Key: SPARK-35101
 URL: https://issues.apache.org/jira/browse/SPARK-35101
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


We should create a Github Actions' Checks in PRs instead of relying on a 
comment to make it easier to judge if a PR is mergeable or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35100) Refactor AFT - support virtual centering

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35100:


Assignee: Apache Spark

> Refactor AFT - support virtual centering
> 
>
> Key: SPARK-35100
> URL: https://issues.apache.org/jira/browse/SPARK-35100
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35100) Refactor AFT - support virtual centering

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35100:


Assignee: (was: Apache Spark)

> Refactor AFT - support virtual centering
> 
>
> Key: SPARK-35100
> URL: https://issues.apache.org/jira/browse/SPARK-35100
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35100) Refactor AFT - support virtual centering

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322556#comment-17322556
 ] 

Apache Spark commented on SPARK-35100:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32199

> Refactor AFT - support virtual centering
> 
>
> Key: SPARK-35100
> URL: https://issues.apache.org/jira/browse/SPARK-35100
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35100) Refactor AFT - support virtual centering

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322555#comment-17322555
 ] 

Apache Spark commented on SPARK-35100:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32199

> Refactor AFT - support virtual centering
> 
>
> Key: SPARK-35100
> URL: https://issues.apache.org/jira/browse/SPARK-35100
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35100) Refactor AFT - support virtual centering

2021-04-15 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-35100:


 Summary: Refactor AFT - support virtual centering
 Key: SPARK-35100
 URL: https://issues.apache.org/jira/browse/SPARK-35100
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322552#comment-17322552
 ] 

Apache Spark commented on SPARK-26164:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32198

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322551#comment-17322551
 ] 

Apache Spark commented on SPARK-34995:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32197

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35081) Add Data Source Option links to missing documents.

2021-04-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35081:

Description: 
We're missing the documents for options in detail such as `to_xxx`, `from_xxx`, 
`schema_of_csv` and `schema_of_json` functions.

We should add a link to the Data Source Option for them.

  was:
We're missing the documents for options in detail for `to_xxx` and `from_xxx` 
functions.

We should add a link to the Data Source Option for them.


> Add Data Source Option links to missing documents.
> --
>
> Key: SPARK-35081
> URL: https://issues.apache.org/jira/browse/SPARK-35081
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We're missing the documents for options in detail such as `to_xxx`, 
> `from_xxx`, `schema_of_csv` and `schema_of_json` functions.
> We should add a link to the Data Source Option for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35081) Add Data Source Option links to missing documents.

2021-04-15 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35081:

Summary: Add Data Source Option links to missing documents.  (was: Add Data 
Source Option links to `to_xxx`, `from_xxx`.)

> Add Data Source Option links to missing documents.
> --
>
> Key: SPARK-35081
> URL: https://issues.apache.org/jira/browse/SPARK-35081
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We're missing the documents for options in detail for `to_xxx` and `from_xxx` 
> functions.
> We should add a link to the Data Source Option for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35032) Port Koalas Index unit tests into PySpark

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35032.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32139
[https://github.com/apache/spark/pull/32139]

> Port Koalas Index unit tests into PySpark
> -
>
> Key: SPARK-35032
> URL: https://issues.apache.org/jira/browse/SPARK-35032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> This JIRA aims to port Koalas Index unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35032) Port Koalas Index unit tests into PySpark

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35032:


Assignee: Xinrong Meng

> Port Koalas Index unit tests into PySpark
> -
>
> Key: SPARK-35032
> URL: https://issues.apache.org/jira/browse/SPARK-35032
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> This JIRA aims to port Koalas Index unit tests to [PySpark 
> tests|https://github.com/apache/spark/tree/master/python/pyspark/tests].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-04-15 Thread Min Shen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322476#comment-17322476
 ] 

Min Shen commented on SPARK-30602:
--

We have published the production results of push-based shuffle on 100% of 
LinkedIn's offline Spark workloads in the following blog post.

https://www.linkedin.com/pulse/bringing-next-gen-shuffle-architecture-data-linkedin-scale-min-shen/

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_magnet_final.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35099) Convert ANSI interval literals to SQL string

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322462#comment-17322462
 ] 

Apache Spark commented on SPARK-35099:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32196

> Convert ANSI interval literals to SQL string
> 
>
> Key: SPARK-35099
> URL: https://issues.apache.org/jira/browse/SPARK-35099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Implement the sql() and toString() methods of Literal for ANSI interval 
> types: DayTimeIntervalType and YearMonthIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35099) Convert ANSI interval literals to SQL string

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35099:


Assignee: Max Gekk  (was: Apache Spark)

> Convert ANSI interval literals to SQL string
> 
>
> Key: SPARK-35099
> URL: https://issues.apache.org/jira/browse/SPARK-35099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Implement the sql() and toString() methods of Literal for ANSI interval 
> types: DayTimeIntervalType and YearMonthIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35099) Convert ANSI interval literals to SQL string

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35099:


Assignee: Apache Spark  (was: Max Gekk)

> Convert ANSI interval literals to SQL string
> 
>
> Key: SPARK-35099
> URL: https://issues.apache.org/jira/browse/SPARK-35099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Implement the sql() and toString() methods of Literal for ANSI interval 
> types: DayTimeIntervalType and YearMonthIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35099) Convert ANSI interval literals to SQL string

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322461#comment-17322461
 ] 

Apache Spark commented on SPARK-35099:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32196

> Convert ANSI interval literals to SQL string
> 
>
> Key: SPARK-35099
> URL: https://issues.apache.org/jira/browse/SPARK-35099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Implement the sql() and toString() methods of Literal for ANSI interval 
> types: DayTimeIntervalType and YearMonthIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322453#comment-17322453
 ] 

Apache Spark commented on SPARK-35093:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/32195

> [SQL] AQE columnar mismatch on exchange reuse
> -
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35093:


Assignee: Apache Spark

> [SQL] AQE columnar mismatch on exchange reuse
> -
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Major
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35093:


Assignee: (was: Apache Spark)

> [SQL] AQE columnar mismatch on exchange reuse
> -
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322452#comment-17322452
 ] 

Apache Spark commented on SPARK-35093:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/32195

> [SQL] AQE columnar mismatch on exchange reuse
> -
>
> Key: SPARK-35093
> URL: https://issues.apache.org/jira/browse/SPARK-35093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Andy Grove
>Priority: Major
>
> With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
> are semantically equal.
> This is done by comparing the canonicalized plan for two Exchange nodes to 
> see if they are the same.
> Unfortunately this does not take into account the fact that two exchanges 
> with the same canonical plan might be replaced by a plugin in a way that 
> makes them not compatible. For example, a plugin could create one version 
> with supportsColumnar=true and another with supportsColumnar=false. It is not 
> valid to re-use exchanges if there is a supportsColumnar mismatch.
> I have tested a fix for this and will put up a PR once I figure out how to 
> write the tests.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35099) Convert ANSI interval literals to SQL string

2021-04-15 Thread Max Gekk (Jira)
Max Gekk created SPARK-35099:


 Summary: Convert ANSI interval literals to SQL string
 Key: SPARK-35099
 URL: https://issues.apache.org/jira/browse/SPARK-35099
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


Implement the sql() and toString() methods of Literal for ANSI interval types: 
DayTimeIntervalType and YearMonthIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35098) Revisit pandas-on-Spark test cases that are disabled because of pandas nondeterministic return values

2021-04-15 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-35098:


 Summary: Revisit pandas-on-Spark test cases that are disabled 
because of pandas nondeterministic return values
 Key: SPARK-35098
 URL: https://issues.apache.org/jira/browse/SPARK-35098
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Xinrong Meng


Some test cases have been disabled in the places as shown below because of 
pandas nondeterministic return values:
 * pandas returns `None` or `nan` randomly
python/pyspark/pandas/tests/test_series.py test_astype

 * pandas returns `True` or `False` randomly
python/pyspark/pandas/tests/indexes/test_base.py test_monotonic

We should revisit them later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime

2021-04-15 Thread Max Gekk (Jira)
Max Gekk created SPARK-35097:


 Summary: Add column name to SparkUpgradeException about ancient 
datetime
 Key: SPARK-35097
 URL: https://issues.apache.org/jira/browse/SPARK-35097
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


The error message:

{code:java}
org.apache.spark.SparkUpgradeException: You may get a different result due to 
the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
calendar. See more details in SPARK-31404. You can set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
datetime values w.r.t. the calendar difference during reading. Or set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
datetime values as it is.
{code}

doesn't have any clues of which column causes the issue. Need to improve the 
message and add column name to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322376#comment-17322376
 ] 

Apache Spark commented on SPARK-35096:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/32194

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(I

[jira] [Assigned] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35096:


Assignee: (was: Apache Spark)

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:38)
>   at 
> org.

[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322378#comment-17322378
 ] 

Apache Spark commented on SPARK-35096:
--

User 'sandeep-katta' has created a pull request for this issue:
https://github.com/apache/spark/pull/32194

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(I

[jira] [Assigned] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35096:


Assignee: Apache Spark

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Assignee: Apache Spark
>Priority: Major
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray

[jira] [Commented] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-15 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322360#comment-17322360
 ] 

Sandeep Katta commented on SPARK-35096:
---

Working on fix, soon raise  PR for this

> foreachBatch throws ArrayIndexOutOfBoundsException if schema is case 
> Insensitive
> 
>
> Key: SPARK-35096
> URL: https://issues.apache.org/jira/browse/SPARK-35096
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sandeep Katta
>Priority: Major
>
> Below code works fine before spark3, running on spark3 throws 
> java.lang.ArrayIndexOutOfBoundsException
> {code:java}
> val inputPath = "/Users/xyz/data/testcaseInsensitivity"
> val output_path = "/Users/xyz/output"
> spark.range(10).write.format("parquet").save(inputPath)
> def process_row(microBatch: DataFrame, batchId: Long): Unit = {
>   val df = microBatch.select($"ID".alias("other")) // Doesn't work
>   df.write.format("parquet").mode("append").save(output_path)
> }
> val schema = new StructType().add("id", LongType)
> val stream_df = 
> spark.readStream.schema(schema).format("parquet").load(inputPath)
> stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
>   .start().awaitTermination()
> {code}
> Stack Trace:
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.Wrap

[jira] [Created] (SPARK-35096) foreachBatch throws ArrayIndexOutOfBoundsException if schema is case Insensitive

2021-04-15 Thread Sandeep Katta (Jira)
Sandeep Katta created SPARK-35096:
-

 Summary: foreachBatch throws ArrayIndexOutOfBoundsException if 
schema is case Insensitive
 Key: SPARK-35096
 URL: https://issues.apache.org/jira/browse/SPARK-35096
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Sandeep Katta


Below code works fine before spark3, running on spark3 throws 

java.lang.ArrayIndexOutOfBoundsException
{code:java}
val inputPath = "/Users/xyz/data/testcaseInsensitivity"
val output_path = "/Users/xyz/output"

spark.range(10).write.format("parquet").save(inputPath)

def process_row(microBatch: DataFrame, batchId: Long): Unit = {
  val df = microBatch.select($"ID".alias("other")) // Doesn't work
  df.write.format("parquet").mode("append").save(output_path)

}

val schema = new StructType().add("id", LongType)

val stream_df = 
spark.readStream.schema(schema).format("parquet").load(inputPath)
stream_df.writeStream.trigger(Trigger.Once).foreachBatch(process_row _)
  .start().awaitTermination()
{code}
Stack Trace:
{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
  at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.$anonfun$applyOrElse$3(objects.scala:216)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
  at scala.collection.immutable.List.map(List.scala:298)
  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:215)
  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$4.applyOrElse(objects.scala:203)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:314)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:149)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:147)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:298)
  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:203)
  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:121)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:149)
  at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  at 
scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:38)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:146)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:138)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:138)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.sca

[jira] [Created] (SPARK-35095) Use ANSI intervals in streaming join tests

2021-04-15 Thread Max Gekk (Jira)
Max Gekk created SPARK-35095:


 Summary: Use ANSI intervals in streaming join tests
 Key: SPARK-35095
 URL: https://issues.apache.org/jira/browse/SPARK-35095
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


Enable ANSI intervals in the tests:
- StreamingOuterJoinSuite.right outer with watermark range condition
- StreamingOuterJoinSuite.left outer with watermark range condition



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort

2021-04-15 Thread Nick Hryhoriev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-35094:
---
Description: 
I use spark 3.1.1 and 3.0.2.

 Function `from_json` return wrong result with Permissive mode.
 In corner case:
 1. Json message has complex nested structure
 \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, 
sameNameField_WhichValueWillAppearInwrongPlace}}
 2. Nested -> Nested Field: Schema is satisfy align with value in json.

scala code to reproduce:
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType

object Main {

  def main(args: Array[String]): Unit = {
implicit val spark: SparkSession = 
SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._

val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", 
StringType, nullable = true)
val nestedFieldWhichNotSatisfyJsonMessage = StructField(
  "badNestedField",
  StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, 
nullable = true)))
)
val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage =
  StructField(
"nestedField",
StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, 
schemaForFieldWhichWillHaveWrongValue))
  )
val customSchema = StructType(Seq(
  schemaForFieldWhichWillHaveWrongValue,
  nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage
))

val jsonStringToTest =
  
"""{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}"""
val df = List(jsonStringToTest)
  .toDF("json")
  // issue happen only in permissive mode during best effort
  .select(from_json($"json", customSchema).as("toBeFlatten"))
  .select("toBeFlatten.*")
df.show(truncate = false)

assert(
  df.select("problematicName").as[String].first() == 
"ThisValueWillBeOverwritten",
  "wrong value in root schema, parser take value from column with same name 
but in another nested elvel"
)
  }

}

{code}
I was not able to debug this issue, to find the exact root cause.
 But what I find in debug, that In 
`org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block 
`e.partialResult()` already have a wrong value.

I hope this will help to fix the issue.

I do a DIRTY HACK to fix the issue.
 I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, 
e.record))`.
 In my case, it's better to do not have any values in the row, than 
theoretically have a wrong value in some column.

  was:
I use spark 3.1.1 and 3.0.2.
 Function `from_json` return wrong result with Permissive mode.
 In corner case:
 1. Json message has complex nested structure
 \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, 
sameNameField_WhichValueWillAppearInwrongPlace}}
 2. Nested -> Nested Field: Schema is satisfy align with value in json.

scala code to reproduce:
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType

object Main {

  def main(args: Array[String]): Unit = {
implicit val spark: SparkSession = 
SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._

val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", 
StringType, nullable = true)
val nestedFieldWhichNotSatisfyJsonMessage = StructField(
  "badNestedField",
  StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, 
nullable = true)))
)
val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage =
  StructField(
"nestedField",
StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, 
schemaForFieldWhichWillHaveWrongValue))
  )
val customSchema = StructType(Seq(
  schemaForFieldWhichWillHaveWrongValue,
  nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage
))

val jsonStringToTest =
  
"""{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}"""
val df = List(jsonStringToTest)
  .toDF("json")
  // issue happen only in permissive mode during best effort
  .select(from_json($"json", customSchema).as("toBeFlatten"))
  .select("toBeFlatten.*")
df.show(truncate = false)

assert(
  df.select("problematicName").as[String].first() == 
"ThisValueWillBeOverwritten",
  "wrong value in root schema, parser take value from column with same name 
but in another ne

[jira] [Updated] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort

2021-04-15 Thread Nick Hryhoriev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Hryhoriev updated SPARK-35094:
---
Description: 
I use spark 3.1.1 and 3.0.2.
 Function `from_json` return wrong result with Permissive mode.
 In corner case:
 1. Json message has complex nested structure
 \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, 
sameNameField_WhichValueWillAppearInwrongPlace}}
 2. Nested -> Nested Field: Schema is satisfy align with value in json.

scala code to reproduce:
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType

object Main {

  def main(args: Array[String]): Unit = {
implicit val spark: SparkSession = 
SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._

val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", 
StringType, nullable = true)
val nestedFieldWhichNotSatisfyJsonMessage = StructField(
  "badNestedField",
  StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, 
nullable = true)))
)
val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage =
  StructField(
"nestedField",
StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, 
schemaForFieldWhichWillHaveWrongValue))
  )
val customSchema = StructType(Seq(
  schemaForFieldWhichWillHaveWrongValue,
  nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage
))

val jsonStringToTest =
  
"""{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}"""
val df = List(jsonStringToTest)
  .toDF("json")
  // issue happen only in permissive mode during best effort
  .select(from_json($"json", customSchema).as("toBeFlatten"))
  .select("toBeFlatten.*")
df.show(truncate = false)

assert(
  df.select("problematicName").as[String].first() == 
"ThisValueWillBeOverwritten",
  "wrong value in root schema, parser take value from column with same name 
but in another nested elvel"
)
  }

}

{code}
I was not able to debug this issue, to find the exact root cause.
 But what I find in debug, that In 
`org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block 
`e.partialResult()` already have a wrong value.

I hope this will help to fix the issue.

I do a DIRTY HACK to fix the issue.
 I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, 
e.record))`.
In my case, it's better to do not have any values in the row, than 
theoretically have a wrong value in some column.

  was:
I use spark 3.1.1 and 3.0.2.
Function `from_json` return wrong result with Permissive mode.
In corner case:
1. Json message has complex nested structure
 \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, 
sameNameField_WhichValueWillAppearInwrongPlace}}
2. Nested -> Nested Field: Schema is satisfy align with value in json.

scala code to reproduce:
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType

object Main {

  def main(args: Array[String]): Unit = {
implicit val spark: SparkSession = 
SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._

val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", 
StringType, nullable = true)
val nestedFieldWhichNotSatisfyJsonMessage = StructField(
  "badNestedField",
  StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, 
nullable = true)))
)
val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage =
  StructField(
"nestedField",
StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, 
schemaForFieldWhichWillHaveWrongValue))
  )
val customSchema = StructType(Seq(
  schemaForFieldWhichWillHaveWrongValue,
  nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage
))

val jsonStringToTest =
  
"""{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}"""
val df = List(jsonStringToTest)
  .toDF("json")
  // issue happen only in permissive mode during best effort
  .select(from_json($"json", customSchema).as("toBeFlatten"))
  .select("toBeFlatten.*")
df.show(truncate = false)

assert(
  df.select("problematicName").as[String].first() == 
"ThisValueWillBeOverwritten",
  "wrong value in root schema, parser take value from column with same name 
but in another nested e

[jira] [Commented] (SPARK-33000) cleanCheckpoints config does not clean all checkpointed RDDs on shutdown

2021-04-15 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322333#comment-17322333
 ] 

Nicholas Chammas commented on SPARK-33000:
--

Per the discussion [on the dev 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Shutdown-cleanup-of-disk-based-resources-that-Spark-creates-td30928.html]
 and [PR|https://github.com/apache/spark/pull/31742], it seems we just want to 
update the documentation to clarify that {{cleanCheckpoints}} does not impact 
shutdown behavior. i.e. Checkpoints are not meant to be cleaned up on shutdown 
(whether planned or unplanned), and the config is currently working as intended.

> cleanCheckpoints config does not clean all checkpointed RDDs on shutdown
> 
>
> Key: SPARK-33000
> URL: https://issues.apache.org/jira/browse/SPARK-33000
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Maybe it's just that the documentation needs to be updated, but I found this 
> surprising:
> {code:python}
> $ pyspark
> ...
> >>> spark.conf.set('spark.cleaner.referenceTracking.cleanCheckpoints', 'true')
> >>> spark.sparkContext.setCheckpointDir('/tmp/spark/checkpoint/')
> >>> a = spark.range(10)
> >>> a.checkpoint()
> DataFrame[id: bigint] 
>   
> >>> exit(){code}
> The checkpoint data is left behind in {{/tmp/spark/checkpoint/}}. I expected 
> Spark to clean it up on shutdown.
> The documentation for {{spark.cleaner.referenceTracking.cleanCheckpoints}} 
> says:
> {quote}Controls whether to clean checkpoint files if the reference is out of 
> scope.
> {quote}
> When Spark shuts down, everything goes out of scope, so I'd expect all 
> checkpointed RDDs to be cleaned up.
> For the record, I see the same behavior in both the Scala and Python REPLs.
> Evidence the current behavior is confusing:
>  * [https://stackoverflow.com/q/52630858/877069]
>  * [https://stackoverflow.com/q/60009856/877069]
>  * [https://stackoverflow.com/q/61454740/877069]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35094) Spark from_json(JsonToStruct) function return wrong value in permissive mode in case best effort

2021-04-15 Thread Nick Hryhoriev (Jira)
Nick Hryhoriev created SPARK-35094:
--

 Summary: Spark from_json(JsonToStruct)  function return wrong 
value in permissive mode in case best effort
 Key: SPARK-35094
 URL: https://issues.apache.org/jira/browse/SPARK-35094
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.1.1, 3.0.2
Reporter: Nick Hryhoriev


I use spark 3.1.1 and 3.0.2.
Function `from_json` return wrong result with Permissive mode.
In corner case:
1. Json message has complex nested structure
 \{sameNameField)damaged, nestedVal:{badSchemaNestedVal, 
sameNameField_WhichValueWillAppearInwrongPlace}}
2. Nested -> Nested Field: Schema is satisfy align with value in json.

scala code to reproduce:
{code:java}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType

object Main {

  def main(args: Array[String]): Unit = {
implicit val spark: SparkSession = 
SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._

val schemaForFieldWhichWillHaveWrongValue = StructField("problematicName", 
StringType, nullable = true)
val nestedFieldWhichNotSatisfyJsonMessage = StructField(
  "badNestedField",
  StructType(Seq(StructField("SomethingWhichNotInJsonMessage", IntegerType, 
nullable = true)))
)
val nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage =
  StructField(
"nestedField",
StructType(Seq(nestedFieldWhichNotSatisfyJsonMessage, 
schemaForFieldWhichWillHaveWrongValue))
  )
val customSchema = StructType(Seq(
  schemaForFieldWhichWillHaveWrongValue,
  nestedFieldWithNestedFieldWhichNotSatisfyJsonMessage
))

val jsonStringToTest =
  
"""{"problematicName":"ThisValueWillBeOverwritten","nestedField":{"badNestedField":"14","problematicName":"thisValueInTwoPlaces"}}"""
val df = List(jsonStringToTest)
  .toDF("json")
  // issue happen only in permissive mode during best effort
  .select(from_json($"json", customSchema).as("toBeFlatten"))
  .select("toBeFlatten.*")
df.show(truncate = false)

assert(
  df.select("problematicName").as[String].first() == 
"ThisValueWillBeOverwritten",
  "wrong value in root schema, parser take value from column with same name 
but in another nested elvel"
)
  }

}

{code}
I was not able to debug this issue, to find the exact root cause.
But what I find in debug, that In 
`org.apache.spark.sql.catalyst.util.FailureSafeParser` in line 64. code block 
`e.partialResult()` already have a wrong value.

I hope this will help to fix the issue.

I do a DIRTY HACK to fix the issue.
I just fork this function and hardcode `None` -> `Iterator(toResultRow(None, 
e.record))`.
in my case, it's better to do not have any values in the row, than 
theoretically have a wrong value in some column.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3

2021-04-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322307#comment-17322307
 ] 

Dongjoon Hyun commented on SPARK-34792:
---

Thank you for sharing your update, [~kondziolka9ld].

> Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
> -
>
> Key: SPARK-34792
> URL: https://issues.apache.org/jira/browse/SPARK-34792
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.0.1
>Reporter: kondziolka9ld
>Priority: Major
>
> Hi, 
> Please consider a following difference of `randomSplit` method even despite 
> of the same seed.
>  
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |4|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |5|
> |6|
> |7|
> |8|
> |9|
> |   10|
> +-+
> {code}
> while as on spark-3
> {code:java}
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |5|
> |   10|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |4|
> |6|
> |7|
> |8|
> |9|
> +-+
> {code}
> I guess that implementation of `sample` method changed.
> Is it possible to restore previous behaviour?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35093) [SQL] AQE columnar mismatch on exchange reuse

2021-04-15 Thread Andy Grove (Jira)
Andy Grove created SPARK-35093:
--

 Summary: [SQL] AQE columnar mismatch on exchange reuse
 Key: SPARK-35093
 URL: https://issues.apache.org/jira/browse/SPARK-35093
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2
Reporter: Andy Grove


With AQE enabled, AdaptiveSparkPlanExec will attempt to reuse exchanges that 
are semantically equal.

This is done by comparing the canonicalized plan for two Exchange nodes to see 
if they are the same.

Unfortunately this does not take into account the fact that two exchanges with 
the same canonical plan might be replaced by a plugin in a way that makes them 
not compatible. For example, a plugin could create one version with 
supportsColumnar=true and another with supportsColumnar=false. It is not valid 
to re-use exchanges if there is a supportsColumnar mismatch.

I have tested a fix for this and will put up a PR once I figure out how to 
write the tests.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322285#comment-17322285
 ] 

Apache Spark commented on SPARK-34720:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32192

> Incorrect star expansion logic MERGE INSERT * / UPDATE *
> 
>
> Key: SPARK-34720
> URL: https://issues.apache.org/jira/browse/SPARK-34720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Tathagata Das
>Priority: Major
>
> The natural expectation of INSERT * or UPDATE * in MERGE is to assign source 
> columns to target columns of the same name. However, the current logic here 
> generates the assignment by position. 
> https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214
> This can very easily lead to incorrect results without the user realizing 
> this odd behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322284#comment-17322284
 ] 

Apache Spark commented on SPARK-34720:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32192

> Incorrect star expansion logic MERGE INSERT * / UPDATE *
> 
>
> Key: SPARK-34720
> URL: https://issues.apache.org/jira/browse/SPARK-34720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Tathagata Das
>Priority: Major
>
> The natural expectation of INSERT * or UPDATE * in MERGE is to assign source 
> columns to target columns of the same name. However, the current logic here 
> generates the assignment by position. 
> https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214
> This can very easily lead to incorrect results without the user realizing 
> this odd behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34720:


Assignee: Apache Spark

> Incorrect star expansion logic MERGE INSERT * / UPDATE *
> 
>
> Key: SPARK-34720
> URL: https://issues.apache.org/jira/browse/SPARK-34720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> The natural expectation of INSERT * or UPDATE * in MERGE is to assign source 
> columns to target columns of the same name. However, the current logic here 
> generates the assignment by position. 
> https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214
> This can very easily lead to incorrect results without the user realizing 
> this odd behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34720) Incorrect star expansion logic MERGE INSERT * / UPDATE *

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34720:


Assignee: (was: Apache Spark)

> Incorrect star expansion logic MERGE INSERT * / UPDATE *
> 
>
> Key: SPARK-34720
> URL: https://issues.apache.org/jira/browse/SPARK-34720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Tathagata Das
>Priority: Major
>
> The natural expectation of INSERT * or UPDATE * in MERGE is to assign source 
> columns to target columns of the same name. However, the current logic here 
> generates the assignment by position. 
> https://github.com/apache/spark/commit/7cfd589868b8430bc79e28e4d547008b222781a5#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R1214
> This can very easily lead to incorrect results without the user realizing 
> this odd behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.

2021-04-15 Thread akiyamaneko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-35092:

Attachment: the rdd title in storage page shows too long.png

> the auto-generated rdd's name in the storage tab should be truncated if it is 
> too long.
> ---
>
> Key: SPARK-35092
> URL: https://issues.apache.org/jira/browse/SPARK-35092
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Trivial
> Attachments: the rdd title in storage page shows too long.png
>
>
> the auto-generated rdd's name in the storage tab should be truncated if it is 
> too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.

2021-04-15 Thread akiyamaneko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-35092:

Attachment: (was: the rdd title in storage page shows too long.png)

> the auto-generated rdd's name in the storage tab should be truncated if it is 
> too long.
> ---
>
> Key: SPARK-35092
> URL: https://issues.apache.org/jira/browse/SPARK-35092
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Trivial
> Attachments: the rdd title in storage page shows too long.png
>
>
> the auto-generated rdd's name in the storage tab should be truncated if it is 
> too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.

2021-04-15 Thread akiyamaneko (Jira)
akiyamaneko created SPARK-35092:
---

 Summary: the auto-generated rdd's name in the storage tab should 
be truncated if it is too long.
 Key: SPARK-35092
 URL: https://issues.apache.org/jira/browse/SPARK-35092
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.1
Reporter: akiyamaneko
 Attachments: the rdd title in storage page shows too long.png

the auto-generated rdd's name in the storage tab should be truncated if it is 
too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35092) the auto-generated rdd's name in the storage tab should be truncated if it is too long.

2021-04-15 Thread akiyamaneko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-35092:

Attachment: the rdd title in storage page shows too long.png

> the auto-generated rdd's name in the storage tab should be truncated if it is 
> too long.
> ---
>
> Key: SPARK-35092
> URL: https://issues.apache.org/jira/browse/SPARK-35092
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Trivial
> Attachments: the rdd title in storage page shows too long.png
>
>
> the auto-generated rdd's name in the storage tab should be truncated if it is 
> too long.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35091) Support ANSI intervals by date_part()

2021-04-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-35091:
-
Description: 
Support year-month and day-time intervals by date_part(). For example:

{code:sql}
  > SELECT date_part('days', interval '5 0:0:0' day to second);
   5
  > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO 
SECOND);
   30.001001
{code}


  was:
Support field extraction from year-month and day-time intervals. For example:

{code:sql}
  > SELECT extract(days FROM interval '5 0:0:0' day to second);
   5
  > SELECT extract(seconds FROM interval '10 11:12:30.001001' DAY TO 
SECOND);
   30.001001
{code}



> Support ANSI intervals by date_part()
> -
>
> Key: SPARK-35091
> URL: https://issues.apache.org/jira/browse/SPARK-35091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support year-month and day-time intervals by date_part(). For example:
> {code:sql}
>   > SELECT date_part('days', interval '5 0:0:0' day to second);
>5
>   > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO 
> SECOND);
>30.001001
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35091) Support ANSI intervals by date_part()

2021-04-15 Thread Max Gekk (Jira)
Max Gekk created SPARK-35091:


 Summary: Support ANSI intervals by date_part()
 Key: SPARK-35091
 URL: https://issues.apache.org/jira/browse/SPARK-35091
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


Support field extraction from year-month and day-time intervals. For example:

{code:sql}
  > SELECT extract(days FROM interval '5 0:0:0' day to second);
   5
  > SELECT extract(seconds FROM interval '10 11:12:30.001001' DAY TO 
SECOND);
   30.001001
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322146#comment-17322146
 ] 

Apache Spark commented on SPARK-35087:
--

User 'kyoty' has created a pull request for this issue:
https://github.com/apache/spark/pull/32190

> Some columns  in table ` Aggregated Metrics by Executor` of stage-detail page 
> shows incorrectly.
> 
>
> Key: SPARK-35087
> URL: https://issues.apache.org/jira/browse/SPARK-35087
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
> Environment: spark version: 3.1.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: sort-result-incorrent.png
>
>
> Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc 
> in table ` Aggregated Metrics by Executor` of stage-detail page shouble be 
> sorted as numerical-order instead of lexicographical-order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35090) Extract a field from ANSI interval

2021-04-15 Thread Max Gekk (Jira)
Max Gekk created SPARK-35090:


 Summary: Extract a field from ANSI interval
 Key: SPARK-35090
 URL: https://issues.apache.org/jira/browse/SPARK-35090
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


Support field extraction from year-month and day-time intervals. For example:

{code:sql}
  > SELECT extract(days FROM interval '5 0:0:0' day to second);
   5
  > SELECT extract(seconds FROM interval '10 11:12:30.001001' DAY TO 
SECOND);
   30.001001
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322138#comment-17322138
 ] 

Apache Spark commented on SPARK-34787:
--

User 'kyoty' has created a pull request for this issue:
https://github.com/apache/spark/pull/32189

> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX)
> --
>
> Key: SPARK-34787
> URL: https://issues.apache.org/jira/browse/SPARK-34787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: akiyamaneko
>Priority: Minor
>
> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX):
> {code:html}
> 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt 
> application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO 
> FsHistoryProvider: Parsing 
> hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421
>  for listing data...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35087:


Assignee: Apache Spark

> Some columns  in table ` Aggregated Metrics by Executor` of stage-detail page 
> shows incorrectly.
> 
>
> Key: SPARK-35087
> URL: https://issues.apache.org/jira/browse/SPARK-35087
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
> Environment: spark version: 3.1.1
>Reporter: akiyamaneko
>Assignee: Apache Spark
>Priority: Minor
> Attachments: sort-result-incorrent.png
>
>
> Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc 
> in table ` Aggregated Metrics by Executor` of stage-detail page shouble be 
> sorted as numerical-order instead of lexicographical-order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34995:


Assignee: (was: Apache Spark)

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322135#comment-17322135
 ] 

Apache Spark commented on SPARK-35087:
--

User 'kyoty' has created a pull request for this issue:
https://github.com/apache/spark/pull/32187

> Some columns  in table ` Aggregated Metrics by Executor` of stage-detail page 
> shows incorrectly.
> 
>
> Key: SPARK-35087
> URL: https://issues.apache.org/jira/browse/SPARK-35087
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
> Environment: spark version: 3.1.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: sort-result-incorrent.png
>
>
> Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc 
> in table ` Aggregated Metrics by Executor` of stage-detail page shouble be 
> sorted as numerical-order instead of lexicographical-order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34995:


Assignee: Apache Spark

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.

2021-04-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35087:


Assignee: (was: Apache Spark)

> Some columns  in table ` Aggregated Metrics by Executor` of stage-detail page 
> shows incorrectly.
> 
>
> Key: SPARK-35087
> URL: https://issues.apache.org/jira/browse/SPARK-35087
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
> Environment: spark version: 3.1.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: sort-result-incorrent.png
>
>
> Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc 
> in table ` Aggregated Metrics by Executor` of stage-detail page shouble be 
> sorted as numerical-order instead of lexicographical-order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function

2021-04-15 Thread Domagoj (Jira)
Domagoj created SPARK-35089:
---

 Summary: non consistent results running count for same dataset 
after filter and lead window function
 Key: SPARK-35089
 URL: https://issues.apache.org/jira/browse/SPARK-35089
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.1
Reporter: Domagoj


I have found an inconsistency with count function results after lead window 
function and filter.

 

I have a dataframe (this is simplified version, but it's enough to reproduce) 
with millions of records, with these columns:
 * df1:
 ** start(timestamp)
 ** user_id(int)
 ** type(string)

I need to define duration between two rows, and filter on that duration and 
type. I used window lead function to get the next event time (that define end 
for current event), so every row now gets start and stop times. If NULL (last 
row for example), add next midnight as stop. Data is stored in ORC file (tried 
with Parquet format, no difference)

This only happens with multiple cluster nodes, for example AWS EMR cluster or 
local docker cluster setup. If I run it on single instance (local on laptop), I 
get consistent results every time. Spark version is 3.0.1, both in AWS and 
local and docker setup.

Here is some simple code that you can use to reproduce it, I've used jupyterLab 
notebook on AWS EMR. Spark version is 3.0.1.

 

 
{code:java}
import org.apache.spark.sql.expressions.Window

// this dataframe generation code should be executed only once, and data have 
to be saved, and then opened from disk, so it's always same.

val getRandomUser = udf(()=>{
val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
   users(scala.util.Random.nextInt(7))
})

val getRandomType = udf(()=>{
val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
types(scala.util.Random.nextInt(5))
})

val getRandomStart = udf((x:Int)=>{
x+scala.util.Random.nextInt(47)
})
// for loop is used to avoid out of memory error during creation of dataframe
for( a <- 0 to 23){
// use iterator a to continue with next million, repeat 1 mil times
val x=Range(a*100,(a*100)+100).toDF("id").
withColumn("start",getRandomStart(col("id"))).
withColumn("user",getRandomUser()).
withColumn("type",getRandomType()).
drop("id")

x.write.mode("append").orc("hdfs:///random.orc")
}

// above code should be run only once, I used a cell in Jupyter

// define window and lead
val w = Window.partitionBy("user").orderBy("start")
// if null, replace with 30.000.000
val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))

// read data to dataframe, create stop column and calculate duration
val fox2 = spark.read.orc("hdfs:///random.orc").
withColumn("end", ts_lead).
withColumn("duration", col("end")-col("start"))


// repeated executions of this line returns different results for count 
// I have it in separate cell in JupyterLab
fox2.where("type='TypeA' and duration>4").count()
{code}
My results for three consecutive runs of last line were:
 * run 1: 2551259
 * run 2: 2550756
 * run 3: 2551279

It's very important to say that if I use filter:

fox2.where("type='TypeA' ")

or 

fox2.where("duration>4"),

 

each of them can be executed repeatedly and I get consistent result every time.

I can save dataframe after crating stop and duration columns, and after that, I 
get consistent results every time.

It is not very practical workaround, as I need a lot of space and time to 
implement it.

This dataset is really big (in my eyes at least, aprox 100.000.000 new records 
per day).

If I run this same example on my local machine using master = local[*], 
everything works as expected, it's just on cluster setup. I tried to create 
cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with 
one master and two workers, and have successfully reproduced issue.

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35088) Accept ANSI intervals by the Sequence expression

2021-04-15 Thread Max Gekk (Jira)
Max Gekk created SPARK-35088:


 Summary: Accept ANSI intervals by the Sequence expression
 Key: SPARK-35088
 URL: https://issues.apache.org/jira/browse/SPARK-35088
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


Currently, the expression accepts only CalendarIntervalType as the step 
expression. It should support ANSI intervals as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34995:
-
Fix Version/s: (was: 3.2.0)

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-34995:
--
  Assignee: (was: Haejoon Lee)

Reverted at 
https://github.com/apache/spark/commit/637f59360b3b10c7200daa29b15c80ce9b710850

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.

2021-04-15 Thread akiyamaneko (Jira)
akiyamaneko created SPARK-35087:
---

 Summary: Some columns  in table ` Aggregated Metrics by Executor` 
of stage-detail page shows incorrectly.
 Key: SPARK-35087
 URL: https://issues.apache.org/jira/browse/SPARK-35087
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.1
 Environment: spark version: 3.1.1
Reporter: akiyamaneko
 Attachments: sort-result-incorrent.png

Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc in 
table ` Aggregated Metrics by Executor` of stage-detail page shouble be sorted 
as numerical-order instead of lexicographical-order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35087) Some columns in table ` Aggregated Metrics by Executor` of stage-detail page shows incorrectly.

2021-04-15 Thread akiyamaneko (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-35087:

Attachment: sort-result-incorrent.png

> Some columns  in table ` Aggregated Metrics by Executor` of stage-detail page 
> shows incorrectly.
> 
>
> Key: SPARK-35087
> URL: https://issues.apache.org/jira/browse/SPARK-35087
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
> Environment: spark version: 3.1.1
>Reporter: akiyamaneko
>Priority: Minor
> Attachments: sort-result-incorrent.png
>
>
> Some columns like 'Shuffle Read Size / Records', 'Output Size/ Records' etc 
> in table ` Aggregated Metrics by Executor` of stage-detail page shouble be 
> sorted as numerical-order instead of lexicographical-order.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34995.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32154
[https://github.com/apache/spark/pull/32154]

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34995:


Assignee: Haejoon Lee

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3

2021-04-15 Thread kondziolka9ld (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322050#comment-17322050
 ] 

kondziolka9ld commented on SPARK-34792:
---

I found a root cause of this difference, namely it is related to 
https://issues.apache.org/jira/browse/SPARK-23643 change.

Obviously, input paritioning of dataframe has inpact on results as well.

When we assure that input paritioning on spark-2-4-7 and spark-3 are the same 
and when we revert change in 
`core/src/main/scala/org/apache/spark/util/random/XORShiftRandom.scala` from 
[https://github.com/apache/spark/pull/20793/|https://github.com/apache/spark/pull/20793/f]
 we get consistent results between  both versions. 

> Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
> -
>
> Key: SPARK-34792
> URL: https://issues.apache.org/jira/browse/SPARK-34792
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.0.1
>Reporter: kondziolka9ld
>Priority: Major
>
> Hi, 
> Please consider a following difference of `randomSplit` method even despite 
> of the same seed.
>  
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |4|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |5|
> |6|
> |7|
> |8|
> |9|
> |   10|
> +-+
> {code}
> while as on spark-3
> {code:java}
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |5|
> |   10|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |4|
> |6|
> |7|
> |8|
> |9|
> +-+
> {code}
> I guess that implementation of `sample` method changed.
> Is it possible to restore previous behaviour?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed due to stride alignment.

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321997#comment-17321997
 ] 

Apache Spark commented on SPARK-34843:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32186

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed due to stride alignment.
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Assignee: Jason Yarbrough
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34843) JDBCRelation columnPartition function improperly determines stride size. Upper bound is skewed due to stride alignment.

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321996#comment-17321996
 ] 

Apache Spark commented on SPARK-34843:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32186

> JDBCRelation columnPartition function improperly determines stride size. 
> Upper bound is skewed due to stride alignment.
> ---
>
> Key: SPARK-34843
> URL: https://issues.apache.org/jira/browse/SPARK-34843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Assignee: Jason Yarbrough
>Priority: Minor
> Fix For: 3.2.0
>
> Attachments: SPARK-34843.patch
>
>
> Currently, in JDBCRelation (line 123), the stride size is calculated as 
> follows:
> val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
>  
> Due to truncation happening on both divisions, the stride size can fall short 
> of what it should be. This can lead to a big difference between the provided 
> upper bound and the actual start of the last partition.
> I propose this formula, as it is much more accurate and leads to better 
> distribution:
> val stride = (upperBound / numPartitions.toFloat - lowerBound / 
> numPartitions.toFloat).toLong
>  
> An example (using a date column):
> Say you're creating 1,000 partitions. If you provide a lower bound of 
> 1927-04-05 (this gets translated to -15611), and an upper bound of 2020-10-27 
> (translated to 18563), Spark determines the stride size as follows:
>  
> (18563L / 1000L) - (-15611 / 1000L) = 33
> Starting from the lower bound, doing strides of 33, you'll end up at 
> 2017-07-08. This is over 3 years of extra data that will go into the last 
> partition, and depending on the shape of the data could cause a very long 
> running task at the end of a job.
>  
> Using the formula I'm proposing, you'd get:
> ((18563L / 1000F) - (-15611 / 1000F)).toLong = 34
> This would put the upper bound at 2020-04-02, which is much closer to the 
> original supplied upper bound. This is the best you can do to get as close as 
> possible to the upper bound (without adjusting the number of partitions). For 
> example, a stride size of 35 would go well past the supplied upper bound 
> (over 2 years, 2022-11-22).
>  
> In the above example, there is only a difference of 1 between the stride size 
> using the current formula and the stride size using the proposed formula, but 
> with greater distance between the lower and upper bounds, or a lower number 
> of partitions, the difference can be much greater. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org