[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-04-16 Thread Lianhui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-30602:
-
Description: 
白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html

  was:
In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html


> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 2020-

[jira] [Resolved] (SPARK-35083) Support remote scheduler pool file

2021-04-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35083.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32184
[https://github.com/apache/spark/pull/32184]

> Support remote scheduler pool file
> --
>
> Key: SPARK-35083
> URL: https://issues.apache.org/jira/browse/SPARK-35083
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> Make `spark.scheduler.allocation.file` suport remote file. When using Spark 
> as a server (e.g. SparkThriftServer), it's hard for user specify a local path 
> as the scheduler pool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35083) Support remote scheduler pool file

2021-04-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35083:
-

Assignee: ulysses you

> Support remote scheduler pool file
> --
>
> Key: SPARK-35083
> URL: https://issues.apache.org/jira/browse/SPARK-35083
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> Make `spark.scheduler.allocation.file` suport remote file. When using Spark 
> as a server (e.g. SparkThriftServer), it's hard for user specify a local path 
> as the scheduler pool.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34494) move JSON data source options from Python and Scala into a single page.

2021-04-16 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34494:

Summary: move JSON data source options from Python and Scala into a single 
page.  (was: Move other data source options than Parquet from Python and Scala 
into a single page.)

> move JSON data source options from Python and Scala into a single page.
> ---
>
> Key: SPARK-34494
> URL: https://issues.apache.org/jira/browse/SPARK-34494
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.

2021-04-16 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-34494:

Summary: Move JSON data source options from Python and Scala into a single 
page.  (was: move JSON data source options from Python and Scala into a single 
page.)

> Move JSON data source options from Python and Scala into a single page.
> ---
>
> Key: SPARK-34494
> URL: https://issues.apache.org/jira/browse/SPARK-34494
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35104) Fix ugly indentation of multiple JSON records in a single split file generated by JacksonGenerator when pretty option is true

2021-04-16 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35104.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32203
[https://github.com/apache/spark/pull/32203]

> Fix ugly indentation of multiple JSON records in a single split file 
> generated by JacksonGenerator when pretty option is true
> -
>
> Key: SPARK-35104
> URL: https://issues.apache.org/jira/browse/SPARK-35104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> When writing multiple JSON records into a single split file with pretty 
> option true, indentation will be broken except for the first JSON record.
> {code:java}
> // Run in the Spark Shell.
> // Set spark.sql.leafNodeDefaultParallelism to 1 for the current master.
> // Or set spark.default.parallelism for the previous releases.
> spark.conf.set("spark.sql.leafNodeDefaultParallelism", 1)
> val df = Seq("a", "b", "c").toDF
> df.write.option("pretty", "true").json("/path/to/output")
> # Run in a Shell
> $ cat /path/to/output/*.json
> {
>   "value" : "a"
> }
>  {
>   "value" : "b"
> }
>  {
>   "value" : "c"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34494:


Assignee: (was: Apache Spark)

> Move JSON data source options from Python and Scala into a single page.
> ---
>
> Key: SPARK-34494
> URL: https://issues.apache.org/jira/browse/SPARK-34494
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17322718#comment-17322718
 ] 

Apache Spark commented on SPARK-34494:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32204

> Move JSON data source options from Python and Scala into a single page.
> ---
>
> Key: SPARK-34494
> URL: https://issues.apache.org/jira/browse/SPARK-34494
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34494) Move JSON data source options from Python and Scala into a single page.

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34494:


Assignee: Apache Spark

> Move JSON data source options from Python and Scala into a single page.
> ---
>
> Key: SPARK-34494
> URL: https://issues.apache.org/jira/browse/SPARK-34494
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34995:


Assignee: Haejoon Lee

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34995) Port/integrate Koalas remaining codes into PySpark

2021-04-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34995.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32197
[https://github.com/apache/spark/pull/32197]

> Port/integrate Koalas remaining codes into PySpark
> --
>
> Key: SPARK-34995
> URL: https://issues.apache.org/jira/browse/SPARK-34995
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> There are some more commits remaining after the main codes were ported.
> - 
> [https://github.com/databricks/koalas/commit/c8f803d6becb3accd767afdb3774c8656d0d0b47]
> - 
> [https://github.com/databricks/koalas/commit/913d68868d38ee7158c640aceb837484f417267e]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands

2021-04-16 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-35105:
--

 Summary: Support multiple paths for ADD FILE/JAR/ARCHIVE commands
 Key: SPARK-35105
 URL: https://issues.apache.org/jira/browse/SPARK-35105
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path 
arguments.
It's great if those commands can take multiple paths like Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323618#comment-17323618
 ] 

Apache Spark commented on SPARK-35105:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32205

> Support multiple paths for ADD FILE/JAR/ARCHIVE commands
> 
>
> Key: SPARK-35105
> URL: https://issues.apache.org/jira/browse/SPARK-35105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path 
> arguments.
> It's great if those commands can take multiple paths like Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35105:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Support multiple paths for ADD FILE/JAR/ARCHIVE commands
> 
>
> Key: SPARK-35105
> URL: https://issues.apache.org/jira/browse/SPARK-35105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path 
> arguments.
> It's great if those commands can take multiple paths like Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35105) Support multiple paths for ADD FILE/JAR/ARCHIVE commands

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35105:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Support multiple paths for ADD FILE/JAR/ARCHIVE commands
> 
>
> Key: SPARK-35105
> URL: https://issues.apache.org/jira/browse/SPARK-35105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, ADD FILE/JAR/ARCHIVE don't support multiple path 
> arguments.
> It's great if those commands can take multiple paths like Hive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2021-04-16 Thread Lianhui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-30602:
-
Description: 
In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html]

  was:
白月山火禾In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html


> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>  Labels: release-notes
> Attachments: Screen Shot 202

[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2021-04-16 Thread Ste Millington (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323773#comment-17323773
 ] 

Ste Millington commented on SPARK-19842:


Is there any progress on this, has anything landed in Spark 3? Feels like there 
is potential for some massive query speedups here if hints are allowed to 
specify unique and/or referential constraints. 

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-04-16 Thread Erik Krogen (Jira)
Erik Krogen created SPARK-35106:
---

 Summary: HadoopMapReduceCommitProtocol performs bad rename when 
dynamic partition overwrite is used
 Key: SPARK-35106
 URL: https://issues.apache.org/jira/browse/SPARK-35106
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 3.1.1
Reporter: Erik Krogen


Recently when evaluating the code in 
{{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
the {{dynamicPartitionOverwrite == true}} scenario:

{code:language=scala}
  # BLOCK 1
  if (dynamicPartitionOverwrite) {
val absPartitionPaths = filesToMove.values.map(new 
Path(_).getParent).toSet
logDebug(s"Clean up absolute partition directories for overwriting: 
$absPartitionPaths")
absPartitionPaths.foreach(fs.delete(_, true))
  }
  # BLOCK 2
  for ((src, dst) <- filesToMove) {
fs.rename(new Path(src), new Path(dst))
  }

  # BLOCK 3
  if (dynamicPartitionOverwrite) {
val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
logDebug(s"Clean up default partition directories for overwriting: 
$partitionPaths")
for (part <- partitionPaths) {
  val finalPartPath = new Path(path, part)
  if (!fs.delete(finalPartPath, true) && 
!fs.exists(finalPartPath.getParent)) {
// According to the official hadoop FileSystem API spec, delete op 
should assume
// the destination is no longer present regardless of return value, 
thus we do not
// need to double check if finalPartPath exists before rename.
// Also in our case, based on the spec, delete returns false only 
when finalPartPath
// does not exist. When this happens, we need to take action if 
parent of finalPartPath
// also does not exist(e.g. the scenario described on SPARK-23815), 
because
// FileSystem API spec on rename op says the rename 
dest(finalPartPath) must have
// a parent that exists, otherwise we may get unexpected result on 
the rename.
fs.mkdirs(finalPartPath.getParent)
  }
  fs.rename(new Path(stagingDir, part), finalPartPath)
}
  }
{code}

Assuming {{dynamicPartitionOverwrite == true}}, we have the following sequence 
of events:
# Block 1 deletes all parent directories of {{filesToMove.values}}
# Block 2 attempts to rename all {{filesToMove.keys}} to {{filesToMove.values}}
# Block 3 does directory-level renames to place files into their final locations

All renames in Block 2 will always fail, since all parent directories of 
{{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
failure scenario, as opposed to throwing an exception. There is a separate 
issue here that Block 2 should probably be checking for those {{false}} return 
values -- but this allows for {{dynamicPartitionOverwrite}} to "work", albeit 
with a bunch of failed renames in the middle. Really, we should only run Block 
2 in the {{dynamicPartitionOverwrite == false}} case, and consolidate Blocks 1 
and 3 to run in the {{true}} case.

We discovered this issue when testing against a {{FileSystem}} implementation 
which was throwing an exception for this failed rename scenario instead of 
returning false, escalating the silent/ignored rename failures into actual 
failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35106:


Assignee: Apache Spark

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35106:


Assignee: (was: Apache Spark)

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35106) HadoopMapReduceCommitProtocol performs bad rename when dynamic partition overwrite is used

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323924#comment-17323924
 ] 

Apache Spark commented on SPARK-35106:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32207

> HadoopMapReduceCommitProtocol performs bad rename when dynamic partition 
> overwrite is used
> --
>
> Key: SPARK-35106
> URL: https://issues.apache.org/jira/browse/SPARK-35106
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> Recently when evaluating the code in 
> {{HadoopMapReduceCommitProtocol#commitJob}}, I found some bad codepath under 
> the {{dynamicPartitionOverwrite == true}} scenario:
> {code:language=scala}
>   # BLOCK 1
>   if (dynamicPartitionOverwrite) {
> val absPartitionPaths = filesToMove.values.map(new 
> Path(_).getParent).toSet
> logDebug(s"Clean up absolute partition directories for overwriting: 
> $absPartitionPaths")
> absPartitionPaths.foreach(fs.delete(_, true))
>   }
>   # BLOCK 2
>   for ((src, dst) <- filesToMove) {
> fs.rename(new Path(src), new Path(dst))
>   }
>   # BLOCK 3
>   if (dynamicPartitionOverwrite) {
> val partitionPaths = allPartitionPaths.foldLeft(Set[String]())(_ ++ _)
> logDebug(s"Clean up default partition directories for overwriting: 
> $partitionPaths")
> for (part <- partitionPaths) {
>   val finalPartPath = new Path(path, part)
>   if (!fs.delete(finalPartPath, true) && 
> !fs.exists(finalPartPath.getParent)) {
> // According to the official hadoop FileSystem API spec, delete 
> op should assume
> // the destination is no longer present regardless of return 
> value, thus we do not
> // need to double check if finalPartPath exists before rename.
> // Also in our case, based on the spec, delete returns false only 
> when finalPartPath
> // does not exist. When this happens, we need to take action if 
> parent of finalPartPath
> // also does not exist(e.g. the scenario described on 
> SPARK-23815), because
> // FileSystem API spec on rename op says the rename 
> dest(finalPartPath) must have
> // a parent that exists, otherwise we may get unexpected result 
> on the rename.
> fs.mkdirs(finalPartPath.getParent)
>   }
>   fs.rename(new Path(stagingDir, part), finalPartPath)
> }
>   }
> {code}
> Assuming {{dynamicPartitionOverwrite == true}}, we have the following 
> sequence of events:
> # Block 1 deletes all parent directories of {{filesToMove.values}}
> # Block 2 attempts to rename all {{filesToMove.keys}} to 
> {{filesToMove.values}}
> # Block 3 does directory-level renames to place files into their final 
> locations
> All renames in Block 2 will always fail, since all parent directories of 
> {{filesToMove.values}} were just deleted in Block 1. Under a normal HDFS 
> scenario, the contract of {{fs.rename}} is to return {{false}} under such a 
> failure scenario, as opposed to throwing an exception. There is a separate 
> issue here that Block 2 should probably be checking for those {{false}} 
> return values -- but this allows for {{dynamicPartitionOverwrite}} to "work", 
> albeit with a bunch of failed renames in the middle. Really, we should only 
> run Block 2 in the {{dynamicPartitionOverwrite == false}} case, and 
> consolidate Blocks 1 and 3 to run in the {{true}} case.
> We discovered this issue when testing against a {{FileSystem}} implementation 
> which was throwing an exception for this failed rename scenario instead of 
> returning false, escalating the silent/ignored rename failures into actual 
> failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime

2021-04-16 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323943#comment-17323943
 ] 

Xiao Li commented on SPARK-35097:
-

Can you help fix this, [~angerszhuuu] ?

> Add column name to SparkUpgradeException about ancient datetime
> ---
>
> Key: SPARK-35097
> URL: https://issues.apache.org/jira/browse/SPARK-35097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> The error message:
> {code:java}
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
> before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
> may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
> hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
> calendar. See more details in SPARK-31404. You can set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
> datetime values w.r.t. the calendar difference during reading. Or set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
> datetime values as it is.
> {code}
> doesn't have any clues of which column causes the issue. Need to improve the 
> message and add column name to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35103) Improve type coercion rule performance

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323984#comment-17323984
 ] 

Apache Spark commented on SPARK-35103:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32208

> Improve type coercion rule performance
> --
>
> Key: SPARK-35103
> URL: https://issues.apache.org/jira/browse/SPARK-35103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Reduce the time spent on type coercion rules by running them together 
> one-tree-node-at-a-time in a combined rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35103) Improve type coercion rule performance

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35103:


Assignee: Apache Spark

> Improve type coercion rule performance
> --
>
> Key: SPARK-35103
> URL: https://issues.apache.org/jira/browse/SPARK-35103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> Reduce the time spent on type coercion rules by running them together 
> one-tree-node-at-a-time in a combined rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35103) Improve type coercion rule performance

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35103:


Assignee: (was: Apache Spark)

> Improve type coercion rule performance
> --
>
> Key: SPARK-35103
> URL: https://issues.apache.org/jira/browse/SPARK-35103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Reduce the time spent on type coercion rules by running them together 
> one-tree-node-at-a-time in a combined rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35103) Improve type coercion rule performance

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323986#comment-17323986
 ] 

Apache Spark commented on SPARK-35103:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32208

> Improve type coercion rule performance
> --
>
> Key: SPARK-35103
> URL: https://issues.apache.org/jira/browse/SPARK-35103
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Reduce the time spent on type coercion rules by running them together 
> one-tree-node-at-a-time in a combined rule.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35097) Add column name to SparkUpgradeException about ancient datetime

2021-04-16 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17323998#comment-17323998
 ] 

angerszhu commented on SPARK-35097:
---

[~smilegator] Sure, working on this.

> Add column name to SparkUpgradeException about ancient datetime
> ---
>
> Key: SPARK-35097
> URL: https://issues.apache.org/jira/browse/SPARK-35097
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> The error message:
> {code:java}
> org.apache.spark.SparkUpgradeException: You may get a different result due to 
> the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps 
> before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files 
> may be written by Spark 2.x or legacy versions of Hive, which uses a legacy 
> hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian 
> calendar. See more details in SPARK-31404. You can set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the 
> datetime values w.r.t. the calendar difference during reading. Or set 
> spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the 
> datetime values as it is.
> {code}
> doesn't have any clues of which column causes the issue. Need to improve the 
> message and add column name to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals

2021-04-16 Thread Max Gekk (Jira)
Max Gekk created SPARK-35107:


 Summary: Parse unit-to-unit interval literals to ANSI intervals
 Key: SPARK-35107
 URL: https://issues.apache.org/jira/browse/SPARK-35107
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either 
YearMonthIntervalType or DayTimeIntervalType by default. But when 
spark.sql.legacy.interval.enabled is set to true, parse them to 
CalendarInterval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324034#comment-17324034
 ] 

Apache Spark commented on SPARK-35107:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32209

> Parse unit-to-unit interval literals to ANSI intervals
> --
>
> Key: SPARK-35107
> URL: https://issues.apache.org/jira/browse/SPARK-35107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either 
> YearMonthIntervalType or DayTimeIntervalType by default. But when 
> spark.sql.legacy.interval.enabled is set to true, parse them to 
> CalendarInterval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35107:


Assignee: Apache Spark  (was: Max Gekk)

> Parse unit-to-unit interval literals to ANSI intervals
> --
>
> Key: SPARK-35107
> URL: https://issues.apache.org/jira/browse/SPARK-35107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either 
> YearMonthIntervalType or DayTimeIntervalType by default. But when 
> spark.sql.legacy.interval.enabled is set to true, parse them to 
> CalendarInterval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35107) Parse unit-to-unit interval literals to ANSI intervals

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35107:


Assignee: Max Gekk  (was: Apache Spark)

> Parse unit-to-unit interval literals to ANSI intervals
> --
>
> Key: SPARK-35107
> URL: https://issues.apache.org/jira/browse/SPARK-35107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Parse unit-to-unit intervals like INTERVAL '1-1' YEAR TO MONTH to either 
> YearMonthIntervalType or DayTimeIntervalType by default. But when 
> spark.sql.legacy.interval.enabled is set to true, parse them to 
> CalendarInterval.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-16 Thread Robert Joseph Evans (Jira)
Robert Joseph Evans created SPARK-35108:
---

 Summary: Pickle produces incorrect key labels for 
GenericRowWithSchema (data corruption)
 Key: SPARK-35108
 URL: https://issues.apache.org/jira/browse/SPARK-35108
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2, 3.0.1
Reporter: Robert Joseph Evans
 Attachments: test.py, test.sh

I think this also shows up for all versions of Spark that pickle the data when 
doing a collect from python.

When you do a collect in python java will do a collect and convert the 
UnsafeRows into GenericRowWithSchema instances before it sends them to the 
Pickler. The Pickler, by default, will try to dedupe objects using hashCode and 
.equals for the object.  But .equals and .hashCode for GenericRowWithSchema 
only looks at the data, not the schema. But when we pickle the row the keys 
from the schema are written out.

This can result in data corruption, sort of, in a few cases where a row has the 
same number of elements as a struct within the row does, or a sub-struct within 
another struct. 

If the data happens to be the same, the keys for the resulting row or struct 
can be wrong.

My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-16 Thread Robert Joseph Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Joseph Evans updated SPARK-35108:

Attachment: test.sh
test.py

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-16 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324088#comment-17324088
 ] 

Robert Joseph Evans commented on SPARK-35108:
-

If you have SPARK_HOME set when you run test.sh on a system with 6 cores or 
more it should reproduce the issue.

I was able to mitigate the issue by adding .equals and .hashCode to 
GenericRowWithSchema so it took into account the schema. But we could also try 
to turn off the dedupe or value compare dedupe (Pickler has options to disable 
these things). I am not sure what the proper fix for this would be because the 
code for all of these is shared with other code paths.

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-04-16 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324131#comment-17324131
 ] 

Shane Knapp commented on SPARK-34738:
-

alright, my canary build w/skipping the PV integration test passed w/the docker 
driver:https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone/20/

i'll put together a PR for this over the weekend (it's a one-liner) and once we 
merge i can get the remaining workers upgraded early next week.

> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> -
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.
>  
> Added by Shane:
> we also need to move from the kvm2 virtualization layer to docker.  docker is 
> a recommended driver w/the latest versions of minikube, and this will allow 
> devs to more easily recreate the minikube/k8s env on their local workstations 
> and run the integration tests in an identical environment as jenkins.
> the TL;DR is that upgrading to docker works, except that the PV integration 
> tests are failing due to a couple of possible reasons:
> 1) the 'spark-kubernetes-driver' isn't properly being loaded 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517)
> 2) during the PV test run, the error message 'Given path 
> (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in 
> the logs.  however, the mk cluster *does* mount successfully to the local 
> bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount 
> and read/write successfully to it 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548)
> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-04-16 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324165#comment-17324165
 ] 

Attila Zsolt Piros commented on SPARK-34738:


@shane I can modify my PR to skip the PV tests by just simply remove the k8s 
label from it
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala#L125
and
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L567


As the tests are selected via that label, see the  *-Dtest.include.tags=k8s* in 
the mvn command:

{{+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn install -f 
/home/jenkins/workspace/SparkPullRequestBuilder-K8s/pom.xml -pl 
resource-managers/kubernetes/integration-tests -am -Pscala-2.12 -Phadoop-3.2 
-Pkubernetes -Pkubernetes-integration-tests -Djava.version=8 
-Dspark.kubernetes.test.sparkTgz=/home/jenkins/workspace/SparkPullRequestBuilder-K8s/spark-3.2.0-SNAPSHOT-bin-20210408-a823d0e7c36.tgz
 -Dspark.kubernetes.test.imageTag=N/A 
-Dspark.kubernetes.test.imageRepo=docker.io/kubespark 
-Dspark.kubernetes.test.deployMode=minikube -Dtest.include.tags=k8s 
-Dspark.kubernetes.test.jvmImage=spark 
-Dspark.kubernetes.test.pythonImage=spark-py 
-Dspark.kubernetes.test.rImage=spark-r -Dlog4j.logger.org.apache.spark=DEBUG}}

I can use a different brand new label. This way if you can make the PR building 
controlled by a text in the the PR title like it is done by maven version we 
can switch on the test only for those PRs where we try to fix the problem.

Regarding:

> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.

I am happy to help I just need access to logs. As I told earlier it's probably 
enough if the PR builder already uses Docker for testing and PV tests are 
running. That was I referred to:
> This way I can run experiments via the PR builder and see the logs with my 
> own eyes.

That way in my PR (or the next one) I can temporary can log out what I need. 
But to be the safe side you can give me that ssh access too. 
The only problem is time I can do the analizes part only at the end of next 
week.



> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> -
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.
>  
> Added by Shane:
> we also need to move from the kvm2 virtualization layer to docker.  docker is 
> a recommended driver w/the latest versions of minikube, and this will allow 
> devs to more easily recreate the minikube/k8s env on their local workstations 
> and run the integration tests in an identical environment as jenkins.
> the TL;DR is that upgrading to docker works, except that the PV integration 
> tests are failing due to a couple of possible reasons:
> 1) the 'spark-kubernetes-driver' isn't properly being loaded 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517)
> 2) during the PV test run, the error message 'Given path 
> (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in 
> the logs.  however, the mk cluster *does* mount successfully to the local 
> bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount 
> and read/write successfully to it 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548)
> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-04-16 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324165#comment-17324165
 ] 

Attila Zsolt Piros edited comment on SPARK-34738 at 4/17/21, 3:56 AM:
--

@shane I can modify my PR to skip the PV tests by just simply remove the k8s 
label from it
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala#L125
and
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L567


As the tests are selected via that label, see the  *-Dtest.include.tags=k8s* in 
the mvn command:

{{+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn install -f 
/home/jenkins/workspace/SparkPullRequestBuilder-K8s/pom.xml -pl 
resource-managers/kubernetes/integration-tests -am -Pscala-2.12 -Phadoop-3.2 
-Pkubernetes -Pkubernetes-integration-tests -Djava.version=8 
-Dspark.kubernetes.test.sparkTgz=/home/jenkins/workspace/SparkPullRequestBuilder-K8s/spark-3.2.0-SNAPSHOT-bin-20210408-a823d0e7c36.tgz
 -Dspark.kubernetes.test.imageTag=N/A 
-Dspark.kubernetes.test.imageRepo=docker.io/kubespark 
-Dspark.kubernetes.test.deployMode=minikube -Dtest.include.tags=k8s 
-Dspark.kubernetes.test.jvmImage=spark 
-Dspark.kubernetes.test.pythonImage=spark-py 
-Dspark.kubernetes.test.rImage=spark-r -Dlog4j.logger.org.apache.spark=DEBUG}}

I can use a different brand new label. This way if you can make the PR building 
controlled by a text in the the PR title like it is done by maven version we 
can switch on the test only for those PRs where we try to fix the problem. I 
mean the section "Running Different Test Permutations on Jenkins" at 
https://spark.apache.org/developer-tools.html.


Regarding:

> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.

I am happy to help I just need access to logs. As I told earlier it's probably 
enough if the PR builder already uses Docker for testing and PV tests are 
running. That was I referred to:
> This way I can run experiments via the PR builder and see the logs with my 
> own eyes.

That way in my PR (or the next one) I can temporary can log out what I need. 
But to be the safe side you can give me that ssh access too. 
The only problem is time I can do the analizes part only at the end of next 
week.




was (Author: attilapiros):
@shane I can modify my PR to skip the PV tests by just simply remove the k8s 
label from it
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala#L125
and
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L567


As the tests are selected via that label, see the  *-Dtest.include.tags=k8s* in 
the mvn command:

{{+ /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn install -f 
/home/jenkins/workspace/SparkPullRequestBuilder-K8s/pom.xml -pl 
resource-managers/kubernetes/integration-tests -am -Pscala-2.12 -Phadoop-3.2 
-Pkubernetes -Pkubernetes-integration-tests -Djava.version=8 
-Dspark.kubernetes.test.sparkTgz=/home/jenkins/workspace/SparkPullRequestBuilder-K8s/spark-3.2.0-SNAPSHOT-bin-20210408-a823d0e7c36.tgz
 -Dspark.kubernetes.test.imageTag=N/A 
-Dspark.kubernetes.test.imageRepo=docker.io/kubespark 
-Dspark.kubernetes.test.deployMode=minikube -Dtest.include.tags=k8s 
-Dspark.kubernetes.test.jvmImage=spark 
-Dspark.kubernetes.test.pythonImage=spark-py 
-Dspark.kubernetes.test.rImage=spark-r -Dlog4j.logger.org.apache.spark=DEBUG}}

I can use a different brand new label. This way if you can make the PR building 
controlled by a text in the the PR title like it is done by maven version we 
can switch on the test only for those PRs where we try to fix the problem.

Regarding:

> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.

I am happy to help I just need access to logs. As I told earlier it's probably 
enough if the PR builder already uses Docker for testing and PV tests are 
running. That was I referred to:
> This way I can run experiments via the PR builder and see the logs with my 
> own eyes.

That way in my PR (or the next one) I can temporary can log out what I need. 
But to be the safe side you can give me that ssh access too. 
The only problem is time I can do the analizes part only at the end of next 
week.



> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> ---

[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-04-16 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324167#comment-17324167
 ] 

Attila Zsolt Piros commented on SPARK-34738:


My new commit:
https://github.com/apache/spark/pull/31829/commits/1f594aa40061da96545d29d28ca2b2a410655c34

So the new tag is "persistentVolume".

This PR on its own should switch off the "PVs with local storage" unit test, 
but to be the safe side please run a test with it.

> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> -
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.
>  
> Added by Shane:
> we also need to move from the kvm2 virtualization layer to docker.  docker is 
> a recommended driver w/the latest versions of minikube, and this will allow 
> devs to more easily recreate the minikube/k8s env on their local workstations 
> and run the integration tests in an identical environment as jenkins.
> the TL;DR is that upgrading to docker works, except that the PV integration 
> tests are failing due to a couple of possible reasons:
> 1) the 'spark-kubernetes-driver' isn't properly being loaded 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517)
> 2) during the PV test run, the error message 'Given path 
> (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in 
> the logs.  however, the mk cluster *does* mount successfully to the local 
> bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount 
> and read/write successfully to it 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548)
> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes and move to docker 'virtualization' layer

2021-04-16 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324167#comment-17324167
 ] 

Attila Zsolt Piros edited comment on SPARK-34738 at 4/17/21, 4:27 AM:
--

My new commit:
https://github.com/apache/spark/pull/31829/commits/1f594aa40061da96545d29d28ca2b2a410655c34

So the new tag is "persistentVolume".

This PR on its own should switch off the "PVs with local storage" unit test, 
but to be the safe side please run a test with it.

Regarding controlling the build by the PR title we have to set set one flag for 
the dev-run-integration-tests.sh. It is the "--include-tags", see:
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh#L88-L91
 


was (Author: attilapiros):
My new commit:
https://github.com/apache/spark/pull/31829/commits/1f594aa40061da96545d29d28ca2b2a410655c34

So the new tag is "persistentVolume".

This PR on its own should switch off the "PVs with local storage" unit test, 
but to be the safe side please run a test with it.

> Upgrade Minikube and kubernetes and move to docker 'virtualization' layer
> -
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.
>  
> Added by Shane:
> we also need to move from the kvm2 virtualization layer to docker.  docker is 
> a recommended driver w/the latest versions of minikube, and this will allow 
> devs to more easily recreate the minikube/k8s env on their local workstations 
> and run the integration tests in an identical environment as jenkins.
> the TL;DR is that upgrading to docker works, except that the PV integration 
> tests are failing due to a couple of possible reasons:
> 1) the 'spark-kubernetes-driver' isn't properly being loaded 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312517&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312517)
> 2) during the PV test run, the error message 'Given path 
> (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist' shows up in 
> the logs.  however, the mk cluster *does* mount successfully to the local 
> bare-metal filesystem *and* if i 'minikube ssh' in to it, i can see the mount 
> and read/write successfully to it 
> (https://issues.apache.org/jira/browse/SPARK-34738?focusedCommentId=17312548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17312548)
> i could really use some help, and if it's useful, i can create some local 
> accounts manually and allow ssh access for a couple of people to assist me.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324173#comment-17324173
 ] 

Apache Spark commented on SPARK-32634:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32210

> Introduce sort-based fallback mechanism for shuffled hash join 
> ---
>
> Key: SPARK-32634
> URL: https://issues.apache.org/jira/browse/SPARK-32634
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> A major pain point for spark users to stay away from using shuffled hash join 
> is out of memory issue. Shuffled hash join tends to have OOM issue because it 
> allocates in-memory hashed relation (`UnsafeHashedRelation` or 
> `LongHashedRelation`) for build side, and there's no recovery (e.g. 
> fallback/spill) once the size of hashed relation grows and cannot fit in 
> memory. On the other hand, shuffled hash join is more CPU and IO efficient 
> than sort merge join when joining one large table and a small table (but 
> small table is too large to be broadcasted), as SHJ does not sort the large 
> table, but SMJ needs to do that.
> To improve the reliability of shuffled hash join, a fallback mechanism can be 
> introduced to avoid shuffled hash join OOM issue completely. Similarly we 
> already have a fallback to sort-based aggregation for hash aggregate. The 
> idea is:
> (1).Build hashed relation as current, but monitor the hashed relation size 
> when inserting each build side row. If size of hashed relation being always 
> smaller than a configurable threshold, go to (2.1), else go to (2.2).
> (2.1).Current shuffled hash join logic: reading stream side rows and probing 
> hashed relation.
> (2.2).Fall back to sort merge join: Sort stream side rows, and sort build 
> side rows (iterate rows already in hashed relation (e.g. through 
> `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build 
> side rows). Then doing sort merge join for stream + build side rows.
>  
> Note:
> (1).the fallback is dynamic and happened per task, which means task 0 can 
> incur the fallback e.g. if it has a big build side, but task 1,2 don't need 
> to incur the fallback depending on the size of hashed relation.
> (2).there's no major code change for SHJ and SMJ. Major change is around 
> HashedRelation to introduce some new methods, e.g. 
> `HashedRelation.destructiveValues()` to return an Iterator of build side rows 
> in hashed relation and cleaning up hashed relation along the way.
> (3).we have run this feature by default in our internal fork more than 2 
> years, and we benefit a lot from it with users can choose to use SHJ, and we 
> don't need to worry about SHJ reliability (see 
> https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal 
> from our side, I tweak here to make it less intrusive and more acceptable, 
> e.g. not introducing a separate join operator, but doing the fallback 
> automatically inside SHJ operator itself).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32634:


Assignee: (was: Apache Spark)

> Introduce sort-based fallback mechanism for shuffled hash join 
> ---
>
> Key: SPARK-32634
> URL: https://issues.apache.org/jira/browse/SPARK-32634
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> A major pain point for spark users to stay away from using shuffled hash join 
> is out of memory issue. Shuffled hash join tends to have OOM issue because it 
> allocates in-memory hashed relation (`UnsafeHashedRelation` or 
> `LongHashedRelation`) for build side, and there's no recovery (e.g. 
> fallback/spill) once the size of hashed relation grows and cannot fit in 
> memory. On the other hand, shuffled hash join is more CPU and IO efficient 
> than sort merge join when joining one large table and a small table (but 
> small table is too large to be broadcasted), as SHJ does not sort the large 
> table, but SMJ needs to do that.
> To improve the reliability of shuffled hash join, a fallback mechanism can be 
> introduced to avoid shuffled hash join OOM issue completely. Similarly we 
> already have a fallback to sort-based aggregation for hash aggregate. The 
> idea is:
> (1).Build hashed relation as current, but monitor the hashed relation size 
> when inserting each build side row. If size of hashed relation being always 
> smaller than a configurable threshold, go to (2.1), else go to (2.2).
> (2.1).Current shuffled hash join logic: reading stream side rows and probing 
> hashed relation.
> (2.2).Fall back to sort merge join: Sort stream side rows, and sort build 
> side rows (iterate rows already in hashed relation (e.g. through 
> `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build 
> side rows). Then doing sort merge join for stream + build side rows.
>  
> Note:
> (1).the fallback is dynamic and happened per task, which means task 0 can 
> incur the fallback e.g. if it has a big build side, but task 1,2 don't need 
> to incur the fallback depending on the size of hashed relation.
> (2).there's no major code change for SHJ and SMJ. Major change is around 
> HashedRelation to introduce some new methods, e.g. 
> `HashedRelation.destructiveValues()` to return an Iterator of build side rows 
> in hashed relation and cleaning up hashed relation along the way.
> (3).we have run this feature by default in our internal fork more than 2 
> years, and we benefit a lot from it with users can choose to use SHJ, and we 
> don't need to worry about SHJ reliability (see 
> https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal 
> from our side, I tweak here to make it less intrusive and more acceptable, 
> e.g. not introducing a separate join operator, but doing the fallback 
> automatically inside SHJ operator itself).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32634) Introduce sort-based fallback mechanism for shuffled hash join

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32634:


Assignee: Apache Spark

> Introduce sort-based fallback mechanism for shuffled hash join 
> ---
>
> Key: SPARK-32634
> URL: https://issues.apache.org/jira/browse/SPARK-32634
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> A major pain point for spark users to stay away from using shuffled hash join 
> is out of memory issue. Shuffled hash join tends to have OOM issue because it 
> allocates in-memory hashed relation (`UnsafeHashedRelation` or 
> `LongHashedRelation`) for build side, and there's no recovery (e.g. 
> fallback/spill) once the size of hashed relation grows and cannot fit in 
> memory. On the other hand, shuffled hash join is more CPU and IO efficient 
> than sort merge join when joining one large table and a small table (but 
> small table is too large to be broadcasted), as SHJ does not sort the large 
> table, but SMJ needs to do that.
> To improve the reliability of shuffled hash join, a fallback mechanism can be 
> introduced to avoid shuffled hash join OOM issue completely. Similarly we 
> already have a fallback to sort-based aggregation for hash aggregate. The 
> idea is:
> (1).Build hashed relation as current, but monitor the hashed relation size 
> when inserting each build side row. If size of hashed relation being always 
> smaller than a configurable threshold, go to (2.1), else go to (2.2).
> (2.1).Current shuffled hash join logic: reading stream side rows and probing 
> hashed relation.
> (2.2).Fall back to sort merge join: Sort stream side rows, and sort build 
> side rows (iterate rows already in hashed relation (e.g. through 
> `BytesToBytesMap.destructiveIterator`), then iterate rest of un-read build 
> side rows). Then doing sort merge join for stream + build side rows.
>  
> Note:
> (1).the fallback is dynamic and happened per task, which means task 0 can 
> incur the fallback e.g. if it has a big build side, but task 1,2 don't need 
> to incur the fallback depending on the size of hashed relation.
> (2).there's no major code change for SHJ and SMJ. Major change is around 
> HashedRelation to introduce some new methods, e.g. 
> `HashedRelation.destructiveValues()` to return an Iterator of build side rows 
> in hashed relation and cleaning up hashed relation along the way.
> (3).we have run this feature by default in our internal fork more than 2 
> years, and we benefit a lot from it with users can choose to use SHJ, and we 
> don't need to worry about SHJ reliability (see 
> https://issues.apache.org/jira/browse/SPARK-21505 for the original proposal 
> from our side, I tweak here to make it less intrusive and more acceptable, 
> e.g. not introducing a separate join operator, but doing the fallback 
> automatically inside SHJ operator itself).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin

2021-04-16 Thread Cheng Su (Jira)
Cheng Su created SPARK-35109:


 Summary: Fix minor exception messages of HashedRelation and 
HashedJoin
 Key: SPARK-35109
 URL: https://issues.apache.org/jira/browse/SPARK-35109
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


It seems that we miss classifying one `SparkOutOfMemoryError` in 
`HashedRelation`. Add the error classification for it. In addition, clean up 
two errors definition of `HashJoin` as they are not used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35109:


Assignee: Apache Spark

> Fix minor exception messages of HashedRelation and HashedJoin
> -
>
> Key: SPARK-35109
> URL: https://issues.apache.org/jira/browse/SPARK-35109
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> It seems that we miss classifying one `SparkOutOfMemoryError` in 
> `HashedRelation`. Add the error classification for it. In addition, clean up 
> two errors definition of `HashJoin` as they are not used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin

2021-04-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324180#comment-17324180
 ] 

Apache Spark commented on SPARK-35109:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32211

> Fix minor exception messages of HashedRelation and HashedJoin
> -
>
> Key: SPARK-35109
> URL: https://issues.apache.org/jira/browse/SPARK-35109
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Trivial
>
> It seems that we miss classifying one `SparkOutOfMemoryError` in 
> `HashedRelation`. Add the error classification for it. In addition, clean up 
> two errors definition of `HashJoin` as they are not used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35109) Fix minor exception messages of HashedRelation and HashedJoin

2021-04-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35109:


Assignee: (was: Apache Spark)

> Fix minor exception messages of HashedRelation and HashedJoin
> -
>
> Key: SPARK-35109
> URL: https://issues.apache.org/jira/browse/SPARK-35109
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Trivial
>
> It seems that we miss classifying one `SparkOutOfMemoryError` in 
> `HashedRelation`. Add the error classification for it. In addition, clean up 
> two errors definition of `HashJoin` as they are not used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35109) Fix minor exception messages of HashedRelation and HashJoin

2021-04-16 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-35109:
-
Summary: Fix minor exception messages of HashedRelation and HashJoin  (was: 
Fix minor exception messages of HashedRelation and HashedJoin)

> Fix minor exception messages of HashedRelation and HashJoin
> ---
>
> Key: SPARK-35109
> URL: https://issues.apache.org/jira/browse/SPARK-35109
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Trivial
>
> It seems that we miss classifying one `SparkOutOfMemoryError` in 
> `HashedRelation`. Add the error classification for it. In addition, clean up 
> two errors definition of `HashJoin` as they are not used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org