date:20201018

[jira] [Resolved] (SPARK-33179) Switch default Hadoop profile in run-tests.py

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33179.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30090
[https://github.com/apache/spark/pull/30090]

> Switch default Hadoop profile in run-tests.py
> -
>
> Key: SPARK-33179
> URL: https://issues.apache.org/jira/browse/SPARK-33179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33179) Switch default Hadoop profile in run-tests.py

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33179:


Assignee: William Hyun

> Switch default Hadoop profile in run-tests.py
> -
>
> Key: SPARK-33179
> URL: https://issues.apache.org/jira/browse/SPARK-33179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33128) mismatched input since Spark 3.0

2020-10-18 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216485#comment-17216485
 ] 

Yang Jie edited comment on SPARK-33128 at 10/19/20, 6:43 AM:
-

[~yumwang] I found that without SPARK-21136, case can be passed and there are 
some related reminders in the sql-migration-guide as follow:
{code:java}
### Query Engine

  - In Spark version 2.4 and below, SQL queries such as `FROM ` or `FROM 
 UNION ALL FROM ` are supported by accident. In hive-style `FROM 
 SELECT `, the `SELECT` clause is not negligible. Neither Hive nor 
Presto support this syntax. These queries are treated as invalid in Spark 3.0.

{code}
but it looks like a mistake for "SELECT 1 UNION ALL SELECT 1"


was (Author: luciferyang):
[~yumwang] I found that without SPARK-21136, case can be passed and there are 
some related reminders in the sql-migration-guide as follow:
{code:java}
### Query Engine

  - In Spark version 2.4 and below, SQL queries such as `FROM ` or `FROM 
 UNION ALL FROM ` are supported by accident. In hive-style `FROM 
 SELECT `, the `SELECT` clause is not negligible. Neither Hive nor 
Presto support this syntax. These queries are treated as invalid in Spark 3.0.

{code}
but it looks like a mistake

> mismatched input since Spark 3.0
> 
>
> Key: SPARK-33128
> URL: https://issues.apache.org/jira/browse/SPARK-33128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.4:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>   /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_221)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
> +---+
> |  1|
> +---+
> |  1|
> |  1|
> +---+
> {noformat}
> Spark 3.x:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 14.0.1)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)
> == SQL ==
> SELECT 1 UNION SELECT 1 UNION ALL SELECT 1
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
>   ... 47 elided
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33128) mismatched input since Spark 3.0

2020-10-18 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216485#comment-17216485
 ] 

Yang Jie commented on SPARK-33128:
--

[~yumwang] I found that without SPARK-21136, case can be passed and there are 
some related reminders in the sql-migration-guide as follow:
{code:java}
### Query Engine

  - In Spark version 2.4 and below, SQL queries such as `FROM ` or `FROM 
 UNION ALL FROM ` are supported by accident. In hive-style `FROM 
 SELECT `, the `SELECT` clause is not negligible. Neither Hive nor 
Presto support this syntax. These queries are treated as invalid in Spark 3.0.

{code}
but it looks like a mistake

> mismatched input since Spark 3.0
> 
>
> Key: SPARK-33128
> URL: https://issues.apache.org/jira/browse/SPARK-33128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.4:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.4
>   /_/
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_221)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
> +---+
> |  1|
> +---+
> |  1|
> |  1|
> +---+
> {noformat}
> Spark 3.x:
> {noformat}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 14.0.1)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.sql("SELECT 1 UNION SELECT 1 UNION ALL SELECT 1").show
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'SELECT' expecting {, ';'}(line 1, pos 15)
> == SQL ==
> SELECT 1 UNION SELECT 1 UNION ALL SELECT 1
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
>   ... 47 elided
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32557) Logging and Swallowing the Exception Per Entry in History Server

2020-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-32557:
-
Fix Version/s: 3.0.2

> Logging and Swallowing the Exception Per Entry in History Server
> 
>
> Key: SPARK-32557
> URL: https://issues.apache.org/jira/browse/SPARK-32557
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Assignee: Yan Xiaole
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> As discussed in [https://github.com/apache/spark/pull/29350]
> To avoid any entry affect others while History server scanning log dir, we 
> would like to add a try catch to log and swallow the exception per entry.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33146) Encountering an invalid rolling event log folder prevents loading other applications in SHS

2020-10-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-33146:
-
Fix Version/s: 3.0.2

> Encountering an invalid rolling event log folder prevents loading other 
> applications in SHS
> ---
>
> Key: SPARK-33146
> URL: https://issues.apache.org/jira/browse/SPARK-33146
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> A follow-on issue from https://issues.apache.org/jira/browse/SPARK-33133
> If an invalid rolling event log folder is encountered by the Spark History 
> Server upon startup, it crashes the whole loading process and prevents any 
> valid applications from loading. We should simply catch the error, log it, 
> and continue loading other applications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33139) protect setActiveSession and clearActiveSession

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216468#comment-17216468
 ] 

Apache Spark commented on SPARK-33139:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/30092

> protect setActiveSession and clearActiveSession
> ---
>
> Key: SPARK-33139
> URL: https://issues.apache.org/jira/browse/SPARK-33139
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> This PR is a sub-task of 
> [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to 
> make SQLConf.get reliable and stable, we need to make sure user can't pollute 
> the SQLConf and SparkSession Context via calling setActiveSession and 
> clearActiveSession.
> Change of the PR:
> * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to 
> old behavior if user do need to call these two API.
> * by default, if user call these two API, it will throw exception
> * add extra two internal and private API setActiveSessionInternal and 
> clearActiveSessionInternal for current internal usage
> * change all internal reference to new internal API exception for 
> SQLContext.setActive and SQLContext.clearActive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33139) protect setActiveSession and clearActiveSession

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216467#comment-17216467
 ] 

Apache Spark commented on SPARK-33139:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/30092

> protect setActiveSession and clearActiveSession
> ---
>
> Key: SPARK-33139
> URL: https://issues.apache.org/jira/browse/SPARK-33139
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> This PR is a sub-task of 
> [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to 
> make SQLConf.get reliable and stable, we need to make sure user can't pollute 
> the SQLConf and SparkSession Context via calling setActiveSession and 
> clearActiveSession.
> Change of the PR:
> * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to 
> old behavior if user do need to call these two API.
> * by default, if user call these two API, it will throw exception
> * add extra two internal and private API setActiveSessionInternal and 
> clearActiveSessionInternal for current internal usage
> * change all internal reference to new internal API exception for 
> SQLContext.setActive and SQLContext.clearActive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33180) Enables 'fail_if_no_tests' when reporting test results

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216461#comment-17216461
 ] 

Apache Spark commented on SPARK-33180:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30091

> Enables 'fail_if_no_tests' when reporting test results
> --
>
> Key: SPARK-33180
> URL: https://issues.apache.org/jira/browse/SPARK-33180
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33069 skipped because it raises a false alarm when there are no test 
> cases. This is now fixed in 
> https://github.com/ScaCap/action-surefire-report/issues/29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33180) Enables 'fail_if_no_tests' when reporting test results

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216460#comment-17216460
 ] 

Apache Spark commented on SPARK-33180:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30091

> Enables 'fail_if_no_tests' when reporting test results
> --
>
> Key: SPARK-33180
> URL: https://issues.apache.org/jira/browse/SPARK-33180
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33069 skipped because it raises a false alarm when there are no test 
> cases. This is now fixed in 
> https://github.com/ScaCap/action-surefire-report/issues/29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33180) Enables 'fail_if_no_tests' when reporting test results

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33180:


Assignee: (was: Apache Spark)

> Enables 'fail_if_no_tests' when reporting test results
> --
>
> Key: SPARK-33180
> URL: https://issues.apache.org/jira/browse/SPARK-33180
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33069 skipped because it raises a false alarm when there are no test 
> cases. This is now fixed in 
> https://github.com/ScaCap/action-surefire-report/issues/29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33180) Enables 'fail_if_no_tests' when reporting test results

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33180:


Assignee: Apache Spark

> Enables 'fail_if_no_tests' when reporting test results
> --
>
> Key: SPARK-33180
> URL: https://issues.apache.org/jira/browse/SPARK-33180
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-33069 skipped because it raises a false alarm when there are no test 
> cases. This is now fixed in 
> https://github.com/ScaCap/action-surefire-report/issues/29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33180) Enables 'fail_if_no_tests' when reporting test results

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33180:
-
Summary: Enables 'fail_if_no_tests' when reporting test results  (was: 
Enables 'fail_if_no_tests' in GitHub Actions instead of manually skipping)

> Enables 'fail_if_no_tests' when reporting test results
> --
>
> Key: SPARK-33180
> URL: https://issues.apache.org/jira/browse/SPARK-33180
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SPARK-33069 skipped because it raises a false alarm when there are no test 
> cases. This is now fixed in 
> https://github.com/ScaCap/action-surefire-report/issues/29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33180) Enables 'fail_if_no_tests' in GitHub Actions instead of manually skipping

2020-10-18 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-33180:


 Summary: Enables 'fail_if_no_tests' in GitHub Actions instead of 
manually skipping
 Key: SPARK-33180
 URL: https://issues.apache.org/jira/browse/SPARK-33180
 Project: Spark
  Issue Type: Sub-task
  Components: Project Infra
Affects Versions: 2.4.8, 3.0.2, 3.1.0
Reporter: Hyukjin Kwon


SPARK-33069 skipped because it raises a false alarm when there are no test 
cases. This is now fixed in 
https://github.com/ScaCap/action-surefire-report/issues/29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33123.
--
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30020
[https://github.com/apache/spark/pull/30020]

> Ignore `GitHub Action file` change in Amplab Jenkins
> 
>
> Key: SPARK-33123
> URL: https://issues.apache.org/jira/browse/SPARK-33123
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33123) Ignore `GitHub Action file` change in Amplab Jenkins

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33123:


Assignee: William Hyun

> Ignore `GitHub Action file` change in Amplab Jenkins
> 
>
> Key: SPARK-33123
> URL: https://issues.apache.org/jira/browse/SPARK-33123
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33179) Switch default Hadoop profile in run-tests.py

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216444#comment-17216444
 ] 

Apache Spark commented on SPARK-33179:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30090

> Switch default Hadoop profile in run-tests.py
> -
>
> Key: SPARK-33179
> URL: https://issues.apache.org/jira/browse/SPARK-33179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33179) Switch default Hadoop profile in run-tests.py

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33179:


Assignee: Apache Spark

> Switch default Hadoop profile in run-tests.py
> -
>
> Key: SPARK-33179
> URL: https://issues.apache.org/jira/browse/SPARK-33179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33179) Switch default Hadoop profile in run-tests.py

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216445#comment-17216445
 ] 

Apache Spark commented on SPARK-33179:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30090

> Switch default Hadoop profile in run-tests.py
> -
>
> Key: SPARK-33179
> URL: https://issues.apache.org/jira/browse/SPARK-33179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33179) Switch default Hadoop profile in run-tests.py

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33179:


Assignee: (was: Apache Spark)

> Switch default Hadoop profile in run-tests.py
> -
>
> Key: SPARK-33179
> URL: https://issues.apache.org/jira/browse/SPARK-33179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33179) Switch default Hadoop profile in run-tests.py

2020-10-18 Thread William Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Hyun updated SPARK-33179:
-
Summary: Switch default Hadoop profile in run-tests.py  (was: Switch 
default Hadoop version in run-tests.py)

> Switch default Hadoop profile in run-tests.py
> -
>
> Key: SPARK-33179
> URL: https://issues.apache.org/jira/browse/SPARK-33179
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33179) Switch default Hadoop version in run-tests.py

2020-10-18 Thread William Hyun (Jira)

William Hyun created SPARK-33179:


 Summary: Switch default Hadoop version in run-tests.py
 Key: SPARK-33179
 URL: https://issues.apache.org/jira/browse/SPARK-33179
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.1.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32069:
-

Assignee: angerszhu

> Improve error message on reading unexpected directory which is not a table 
> partition
> 
>
> Key: SPARK-32069
> URL: https://issues.apache.org/jira/browse/SPARK-32069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: angerszhu
>Priority: Minor
>  Labels: starter
> Fix For: 3.1.0
>
>
> To reproduce:
> {code:java}
> spark-sql> create table test(i long);
> spark-sql> insert into test values(1);
> {code}
> {code:java}
> bash $ mkdir ./spark-warehouse/test/data
> {code}
> There will be such error messge
> {code:java}
> java.io.IOException: Not a file: 
> file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:282)
>   at 
> org.apache.spark.sql.hive.thriftserver.Spar

[jira] [Resolved] (SPARK-32069) Improve error message on reading unexpected directory which is not a table partition

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32069.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30027
[https://github.com/apache/spark/pull/30027]

> Improve error message on reading unexpected directory which is not a table 
> partition
> 
>
> Key: SPARK-32069
> URL: https://issues.apache.org/jira/browse/SPARK-32069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>  Labels: starter
> Fix For: 3.1.0
>
>
> To reproduce:
> {code:java}
> spark-sql> create table test(i long);
> spark-sql> insert into test values(1);
> {code}
> {code:java}
> bash $ mkdir ./spark-warehouse/test/data
> {code}
> There will be such error messge
> {code:java}
> java.io.IOException: Not a file: 
> file:/Users/gengliang.wang/projects/spark/spark-warehouse/test/data
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>   at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2173)
>   at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>   at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:76)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriv

[jira] [Resolved] (SPARK-33177) CollectList and CollectSet should not be nullable

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33177.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30087
[https://github.com/apache/spark/pull/30087]

> CollectList and CollectSet should not be nullable
> -
>
> Key: SPARK-33177
> URL: https://issues.apache.org/jira/browse/SPARK-33177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Minor
> Fix For: 3.1.0
>
>
> CollectList and CollectSet SQL expressions never return null value. Marking 
> them as non-nullable can have some performance benefits, because some 
> optimizer rules apply only to non-nullable expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33177) CollectList and CollectSet should not be nullable

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33177:


Assignee: Tanel Kiis

> CollectList and CollectSet should not be nullable
> -
>
> Key: SPARK-33177
> URL: https://issues.apache.org/jira/browse/SPARK-33177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Minor
>
> CollectList and CollectSet SQL expressions never return null value. Marking 
> them as non-nullable can have some performance benefits, because some 
> optimizer rules apply only to non-nullable expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17333) Make pyspark interface friendly with mypy static analysis

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17333:
-
   Parent: SPARK-32681
Affects Version/s: 3.1.0
   Issue Type: Sub-task  (was: Improvement)

> Make pyspark interface friendly with mypy static analysis
> -
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Assaf Mendelson
>Priority: Major
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17333) Make pyspark interface friendly with mypy static analysis

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17333:
-
Priority: Major  (was: Trivial)

> Make pyspark interface friendly with mypy static analysis
> -
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Major
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216386#comment-17216386
 ] 

Apache Spark commented on SPARK-33137:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30089

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (PostgreSQL dialect)
> -
>
> Key: SPARK-33137
> URL: https://issues.apache.org/jira/browse/SPARK-33137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following PostgreSQL JDBC dialect according to official documentation.
> Write PostgreSQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33137:


Assignee: (was: Apache Spark)

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (PostgreSQL dialect)
> -
>
> Key: SPARK-33137
> URL: https://issues.apache.org/jira/browse/SPARK-33137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following PostgreSQL JDBC dialect according to official documentation.
> Write PostgreSQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33137:


Assignee: Apache Spark

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (PostgreSQL dialect)
> -
>
> Key: SPARK-33137
> URL: https://issues.apache.org/jira/browse/SPARK-33137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following PostgreSQL JDBC dialect according to official documentation.
> Write PostgreSQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33137) Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (PostgreSQL dialect)

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216385#comment-17216385
 ] 

Apache Spark commented on SPARK-33137:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30089

> Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of 
> columns (PostgreSQL dialect)
> -
>
> Key: SPARK-33137
> URL: https://issues.apache.org/jira/browse/SPARK-33137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Override the default SQL strings for:
> ALTER TABLE UPDATE COLUMN TYPE
> ALTER TABLE UPDATE COLUMN NULLABILITY
> in the following PostgreSQL JDBC dialect according to official documentation.
> Write PostgreSQL integration tests for JDBC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2020-10-18 Thread Shixiong Zhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216375#comment-17216375
 ] 

Shixiong Zhu commented on SPARK-21065:
--

If you are seeing many active batches, it's likely your streaming application 
is too slow. You can try to look at UI and see if there are anything obvious 
that you can optimize.

> Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
> --
>
> Key: SPARK-21065
> URL: https://issues.apache.org/jira/browse/SPARK-21065
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler, Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Dan Dutrow
>Priority: Major
>
> My streaming application has 200+ output operations, many of them stateful 
> and several of them windowed. In an attempt to reduce the processing times, I 
> set "spark.streaming.concurrentJobs" to 2+. Initial results are very 
> positive, cutting our processing time from ~3 minutes to ~1 minute, but 
> eventually we encounter an exception as follows:
> (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
> batch from 45 minutes before the exception is thrown.)
> 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
> org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
> StreamingJobProgressListener threw an exception
> java.util.NoSuchElementException: key not found 149697756 ms
> at scala.collection.MalLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
> ...
> The Spark code causing the exception is here:
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
>   override def onOutputOperationCompleted(
>   outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
> Unit = synchronized {
> // This method is called before onBatchCompleted
> {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
>   updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
> }
> It seems to me that it may be caused by that batch being removed earlier.
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102
>   override def onBatchCompleted(batchCompleted: 
> StreamingListenerBatchCompleted): Unit = {
> synchronized {
>   waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
>   
> {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
>   val batchUIData = BatchUIData(batchCompleted.batchInfo)
>   completedBatchUIData.enqueue(batchUIData)
>   if (completedBatchUIData.size > batchUIDataLimit) {
> val removedBatch = completedBatchUIData.dequeue()
> batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
>   }
>   totalCompletedBatches += 1L
>   totalProcessedRecords += batchUIData.numRecords
> }
> }
> What is the solution here? Should I make my spark streaming context remember 
> duration a lot longer? ssc.remember(batchDuration * rememberMultiple)
> Otherwise, it seems like there should be some kind of existence check on 
> runningBatchUIData before dereferencing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32797) Install mypy on the Jenkins CI workers

2020-10-18 Thread Fokko Driesprong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-32797:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> Install mypy on the Jenkins CI workers
> --
>
> Key: SPARK-32797
> URL: https://issues.apache.org/jira/browse/SPARK-32797
> Project: Spark
>  Issue Type: Improvement
>  Components: jenkins, PySpark
>Affects Versions: 3.1.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> We want to check the types of the PySpark code. This requires mypy to be 
> installed on the CI. Can you do this [~shaneknapp]? 
> Related PR: [https://github.com/apache/spark/pull/29180]
> You can install this using pip: [https://pypi.org/project/mypy/] Should be 
> similar to flake8 and sphinx. The latest version is ok! Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17333) Make pyspark interface friendly with mypy static analysis

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216345#comment-17216345
 ] 

Apache Spark commented on SPARK-17333:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/30088

> Make pyspark interface friendly with mypy static analysis
> -
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2020-10-18 Thread Sachit Murarka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216342#comment-17216342
 ] 

Sachit Murarka commented on SPARK-21065:


[~zsxwing] Thanks for quick response , any suggestion on optimizing many active 
batches. (Probably I should reduce the processing time or increase the batch 
interval). Correct? Any other thing?

> Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
> --
>
> Key: SPARK-21065
> URL: https://issues.apache.org/jira/browse/SPARK-21065
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler, Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Dan Dutrow
>Priority: Major
>
> My streaming application has 200+ output operations, many of them stateful 
> and several of them windowed. In an attempt to reduce the processing times, I 
> set "spark.streaming.concurrentJobs" to 2+. Initial results are very 
> positive, cutting our processing time from ~3 minutes to ~1 minute, but 
> eventually we encounter an exception as follows:
> (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
> batch from 45 minutes before the exception is thrown.)
> 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
> org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
> StreamingJobProgressListener threw an exception
> java.util.NoSuchElementException: key not found 149697756 ms
> at scala.collection.MalLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
> ...
> The Spark code causing the exception is here:
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
>   override def onOutputOperationCompleted(
>   outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
> Unit = synchronized {
> // This method is called before onBatchCompleted
> {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
>   updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
> }
> It seems to me that it may be caused by that batch being removed earlier.
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102
>   override def onBatchCompleted(batchCompleted: 
> StreamingListenerBatchCompleted): Unit = {
> synchronized {
>   waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
>   
> {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
>   val batchUIData = BatchUIData(batchCompleted.batchInfo)
>   completedBatchUIData.enqueue(batchUIData)
>   if (completedBatchUIData.size > batchUIDataLimit) {
> val removedBatch = completedBatchUIData.dequeue()
> batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
>   }
>   totalCompletedBatches += 1L
>   totalProcessedRecords += batchUIData.numRecords
> }
> }
> What is the solution here? Should I make my spark streaming context remember 
> duration a lot longer? ssc.remember(batchDuration * rememberMultiple)
> Otherwise, it seems like there should be some kind of existence check on 
> runningBatchUIData before dereferencing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2020-10-18 Thread Shixiong Zhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216341#comment-17216341
 ] 

Shixiong Zhu commented on SPARK-21065:
--

`spark.streaming.concurrentJobs` is not safe. Fixing it requires fundamental 
system changes. We don't have any plan for this.

> Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
> --
>
> Key: SPARK-21065
> URL: https://issues.apache.org/jira/browse/SPARK-21065
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler, Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Dan Dutrow
>Priority: Major
>
> My streaming application has 200+ output operations, many of them stateful 
> and several of them windowed. In an attempt to reduce the processing times, I 
> set "spark.streaming.concurrentJobs" to 2+. Initial results are very 
> positive, cutting our processing time from ~3 minutes to ~1 minute, but 
> eventually we encounter an exception as follows:
> (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
> batch from 45 minutes before the exception is thrown.)
> 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
> org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
> StreamingJobProgressListener threw an exception
> java.util.NoSuchElementException: key not found 149697756 ms
> at scala.collection.MalLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
> ...
> The Spark code causing the exception is here:
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
>   override def onOutputOperationCompleted(
>   outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
> Unit = synchronized {
> // This method is called before onBatchCompleted
> {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
>   updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
> }
> It seems to me that it may be caused by that batch being removed earlier.
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102
>   override def onBatchCompleted(batchCompleted: 
> StreamingListenerBatchCompleted): Unit = {
> synchronized {
>   waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
>   
> {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
>   val batchUIData = BatchUIData(batchCompleted.batchInfo)
>   completedBatchUIData.enqueue(batchUIData)
>   if (completedBatchUIData.size > batchUIDataLimit) {
> val removedBatch = completedBatchUIData.dequeue()
> batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
>   }
>   totalCompletedBatches += 1L
>   totalProcessedRecords += batchUIData.numRecords
> }
> }
> What is the solution here? Should I make my spark streaming context remember 
> duration a lot longer? ssc.remember(batchDuration * rememberMultiple)
> Otherwise, it seems like there should be some kind of existence check on 
> runningBatchUIData before dereferencing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2020-10-18 Thread Sachit Murarka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216330#comment-17216330
 ] 

Sachit Murarka commented on SPARK-21065:


[~zsxwing] , Any idea if concurrentJobs still causes issue in 2.4 release of 
Spark as well

> Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
> --
>
> Key: SPARK-21065
> URL: https://issues.apache.org/jira/browse/SPARK-21065
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler, Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: Dan Dutrow
>Priority: Major
>
> My streaming application has 200+ output operations, many of them stateful 
> and several of them windowed. In an attempt to reduce the processing times, I 
> set "spark.streaming.concurrentJobs" to 2+. Initial results are very 
> positive, cutting our processing time from ~3 minutes to ~1 minute, but 
> eventually we encounter an exception as follows:
> (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
> batch from 45 minutes before the exception is thrown.)
> 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
> org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
> StreamingJobProgressListener threw an exception
> java.util.NoSuchElementException: key not found 149697756 ms
> at scala.collection.MalLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
> ...
> The Spark code causing the exception is here:
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
>   override def onOutputOperationCompleted(
>   outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
> Unit = synchronized {
> // This method is called before onBatchCompleted
> {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
>   updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
> }
> It seems to me that it may be caused by that batch being removed earlier.
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102
>   override def onBatchCompleted(batchCompleted: 
> StreamingListenerBatchCompleted): Unit = {
> synchronized {
>   waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
>   
> {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
>   val batchUIData = BatchUIData(batchCompleted.batchInfo)
>   completedBatchUIData.enqueue(batchUIData)
>   if (completedBatchUIData.size > batchUIDataLimit) {
> val removedBatch = completedBatchUIData.dequeue()
> batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
>   }
>   totalCompletedBatches += 1L
>   totalProcessedRecords += batchUIData.numRecords
> }
> }
> What is the solution here? Should I make my spark streaming context remember 
> duration a lot longer? ssc.remember(batchDuration * rememberMultiple)
> Otherwise, it seems like there should be some kind of existence check on 
> runningBatchUIData before dereferencing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-33175) Detect duplicated mountPath and fail at Spark side

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33175:
--
Comment: was deleted

(was: User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30080)

> Detect duplicated mountPath and fail at Spark side
> --
>
> Key: SPARK-33175
> URL: https://issues.apache.org/jira/browse/SPARK-33175
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> If there is a mountPath conflict, the pod is created and repeats the 
> following error messages and keep running. This should not keep running and 
> we had better fail at Spark side.
> {code}
> $ k get pod -l 'spark-role in (driver,executor)'
> NAMEREADY   STATUSRESTARTS   AGE
> tpcds   1/1 Running   0  14m
> {code}
> {code}
> 20/10/18 05:09:26 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: ...
> Message: Pod "tpcds-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/data1": must 
> be unique.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33175) Detect duplicated mountPath and fail at Spark side

2020-10-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216301#comment-17216301
 ] 

Dongjoon Hyun commented on SPARK-33175:
---

This is resolved via https://github.com/apache/spark/pull/30084

> Detect duplicated mountPath and fail at Spark side
> --
>
> Key: SPARK-33175
> URL: https://issues.apache.org/jira/browse/SPARK-33175
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> If there is a mountPath conflict, the pod is created and repeats the 
> following error messages and keep running. This should not keep running and 
> we had better fail at Spark side.
> {code}
> $ k get pod -l 'spark-role in (driver,executor)'
> NAMEREADY   STATUSRESTARTS   AGE
> tpcds   1/1 Running   0  14m
> {code}
> {code}
> 20/10/18 05:09:26 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: ...
> Message: Pod "tpcds-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/data1": must 
> be unique.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33175) Detect duplicated mountPath and fail at Spark side

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33175.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Detect duplicated mountPath and fail at Spark side
> --
>
> Key: SPARK-33175
> URL: https://issues.apache.org/jira/browse/SPARK-33175
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> If there is a mountPath conflict, the pod is created and repeats the 
> following error messages and keep running. This should not keep running and 
> we had better fail at Spark side.
> {code}
> $ k get pod -l 'spark-role in (driver,executor)'
> NAMEREADY   STATUSRESTARTS   AGE
> tpcds   1/1 Running   0  14m
> {code}
> {code}
> 20/10/18 05:09:26 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: ...
> Message: Pod "tpcds-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/data1": must 
> be unique.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33176) Use 11-jre-slim as default in K8s Dockerfile

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33176:
--
Affects Version/s: 3.0.1

> Use 11-jre-slim as default in K8s Dockerfile
> 
>
> Key: SPARK-33176
> URL: https://issues.apache.org/jira/browse/SPARK-33176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> Apache Spark supports both Java8/Java11. However, there is a difference.
> 1. Java8-built distribution can run both Java8/Java11
> 2. Java11-built distribution can run on Java11, but not Java8.
> In short, we had better use Java11 in Dockerfile to embrace both cases 
> without any issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33176) Use 11-jre-slim as default in K8s Dockerfile

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33176:
--
Fix Version/s: 3.0.2

> Use 11-jre-slim as default in K8s Dockerfile
> 
>
> Key: SPARK-33176
> URL: https://issues.apache.org/jira/browse/SPARK-33176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> Apache Spark supports both Java8/Java11. However, there is a difference.
> 1. Java8-built distribution can run both Java8/Java11
> 2. Java11-built distribution can run on Java11, but not Java8.
> In short, we had better use Java11 in Dockerfile to embrace both cases 
> without any issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33176) Use 11-jre-slim as default in K8s Dockerfile

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33176.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30083
[https://github.com/apache/spark/pull/30083]

> Use 11-jre-slim as default in K8s Dockerfile
> 
>
> Key: SPARK-33176
> URL: https://issues.apache.org/jira/browse/SPARK-33176
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Apache Spark supports both Java8/Java11. However, there is a difference.
> 1. Java8-built distribution can run both Java8/Java11
> 2. Java11-built distribution can run on Java11, but not Java8.
> In short, we had better use Java11 in Dockerfile to embrace both cases 
> without any issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33178) Dynamic Scaling In/Scaling out of executors - Kubernetes

2020-10-18 Thread Vishnu G Singhal (Jira)

Vishnu G Singhal created SPARK-33178:


 Summary: Dynamic Scaling In/Scaling out of executors - Kubernetes
 Key: SPARK-33178
 URL: https://issues.apache.org/jira/browse/SPARK-33178
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes, Spark Core
Affects Versions: 3.0.1, 3.0.0
 Environment: Spark deployment on Kubernetes.
Reporter: Vishnu G Singhal


Can we have dynamic scaling in/out of executors in kubernetes spark deployment 
based on the load .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33177) CollectList and CollectSet should not be nullable

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216276#comment-17216276
 ] 

Apache Spark commented on SPARK-33177:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30087

> CollectList and CollectSet should not be nullable
> -
>
> Key: SPARK-33177
> URL: https://issues.apache.org/jira/browse/SPARK-33177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> CollectList and CollectSet SQL expressions never return null value. Marking 
> them as non-nullable can have some performance benefits, because some 
> optimizer rules apply only to non-nullable expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33177) CollectList and CollectSet should not be nullable

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33177:


Assignee: Apache Spark

> CollectList and CollectSet should not be nullable
> -
>
> Key: SPARK-33177
> URL: https://issues.apache.org/jira/browse/SPARK-33177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Minor
>
> CollectList and CollectSet SQL expressions never return null value. Marking 
> them as non-nullable can have some performance benefits, because some 
> optimizer rules apply only to non-nullable expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33177) CollectList and CollectSet should not be nullable

2020-10-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33177:


Assignee: (was: Apache Spark)

> CollectList and CollectSet should not be nullable
> -
>
> Key: SPARK-33177
> URL: https://issues.apache.org/jira/browse/SPARK-33177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> CollectList and CollectSet SQL expressions never return null value. Marking 
> them as non-nullable can have some performance benefits, because some 
> optimizer rules apply only to non-nullable expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33177) CollectList and CollectSet should not be nullable

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216275#comment-17216275
 ] 

Apache Spark commented on SPARK-33177:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30087

> CollectList and CollectSet should not be nullable
> -
>
> Key: SPARK-33177
> URL: https://issues.apache.org/jira/browse/SPARK-33177
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Priority: Minor
>
> CollectList and CollectSet SQL expressions never return null value. Marking 
> them as non-nullable can have some performance benefits, because some 
> optimizer rules apply only to non-nullable expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33177) CollectList and CollectSet should not be nullable

2020-10-18 Thread Tanel Kiis (Jira)

Tanel Kiis created SPARK-33177:
--

 Summary: CollectList and CollectSet should not be nullable
 Key: SPARK-33177
 URL: https://issues.apache.org/jira/browse/SPARK-33177
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Tanel Kiis


CollectList and CollectSet SQL expressions never return null value. Marking 
them as non-nullable can have some performance benefits, because some optimizer 
rules apply only to non-nullable expressions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

2020-10-18 Thread Arnaud Nauwynck (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216174#comment-17216174
 ] 

Arnaud Nauwynck commented on SPARK-33164:
-


Notice that there is also the feature "REPLACE" that might be implemented as in 
BigQuery
{noformat}
select * (REPLACE expr as name) 
{noformat}

see : 
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_replace

Finally, another syntaxic sugar might be 
{noformat}
select * (RENAME oldname as newname)
{noformat}
 ... which may be (apart from column order) equivalent to 
{noformat}
select * (EXCEPT oldname) FROM (select *, oldname as new name) ..
{noformat}


> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to 
> DataSet.dropColumn(someColumn)
> 
>
> Key: SPARK-33164
> URL: https://issues.apache.org/jira/browse/SPARK-33164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Arnaud Nauwynck
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM 
> .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery 
> or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many 
> developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count ( * )  countCol" column, and then filter on it using for 
> example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which 
> always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>  select count(*) countCol, * 
>from MyTable 
>where ... 
>group by ... having countCol = 1
>   )
> {noformat}
>
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where 
> rankCol=1"
>  For example to get the latest row before a given time, in a time series 
> table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 
> type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>SELECT *,
>   RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>FROM t_updates
>where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back 
> to "t"
> {noformat}
>CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>
>/* create corresponding table: (snap_id string, col1 type1, col2 type2, 
> col3 type3, ... col42 type42) */
>CREATE TABLE t_snapshot
>AS SELECT '' as snap_id, * FROM t WHERE 1=2
>/* insert data from t to some snapshot */
>INSERT INTO t_snapshot
>SELECT 'snap1' as snap_id, * from t 
>
>/* select some data from snapshot table (without snap_id column) .. */   
>SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
>
> {noformat}
>
>
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar. 
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>  
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or you need to HARD-CODE manually all columns in your SQL. Therefore your 
> code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(), 
> Postgresql, BigQuery
>  
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet 
> API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL 
> editors for a new SQL syntax. 
> *Q7.* How long will it take?
> No idea. I guess someone experience

[jira] [Updated] (SPARK-33143) Make SocketAuthServer socket timeout configurable

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33143:
-
Component/s: PySpark

> Make SocketAuthServer socket timeout configurable
> -
>
> Key: SPARK-33143
> URL: https://issues.apache.org/jira/browse/SPARK-33143
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Miklos Szurap
>Priority: Major
>
> In SPARK-21551 the socket timeout for the Pyspark applications has been 
> increased from 3 to 15 seconds. However it is still hardcoded.
> In certain situations even the 15 seconds is not enough, so it should be made 
> configurable. 
> This is requested after seeing it in real-life workload failures.
> Also it has been suggested and requested in an earlier comment in 
> [SPARK-18649|https://issues.apache.org/jira/browse/SPARK-18649?focusedCommentId=16493498&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16493498]
> In 
> Spark 2.4 it is under
> [PythonRDD.scala|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L899]
> in Spark 3.x the code has been moved to
> [SocketAuthServer.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/security/SocketAuthServer.scala#L51]
> {code}
> serverSocket.setSoTimeout(15000)
> {code}
> Please include this in both 2.4 and 3.x branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33168) spark REST API Unable to get JobDescription

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33168.
--
Resolution: Cannot Reproduce

> spark REST API Unable to get JobDescription
> ---
>
> Key: SPARK-33168
> URL: https://issues.apache.org/jira/browse/SPARK-33168
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: zhaoyachao
>Priority: Major
>
> spark set job description ,use spark REST API 
> (localhost:4040/api/v1/applications/xxx/jobs)unable to get job 
> description,but it can be displayed at localhost:4040/jobs
> spark.sparkContext.setJobDescription({color:#6a8759}"test_count"{color})
> spark.range({color:#6897bb}100{color}).count()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33168) spark REST API Unable to get JobDescription

2020-10-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216154#comment-17216154
 ] 

Hyukjin Kwon commented on SPARK-33168:
--

I can retrieve the job description as below:


{code:java}
localhost:4040/api/v1/applications/local-1603014688155/jobs
[ {
  "jobId" : 0,
  "name" : "count at :24",
  "description" : "test_count",
  "submissionTime" : "2020-10-18T09:51:32.690GMT",
  "completionTime" : "2020-10-18T09:51:33.473GMT",
  "stageIds" : [ 0, 1 ],
  "status" : "SUCCEEDED",
  "numTasks" : 17,
  "numActiveTasks" : 0,
  "numCompletedTasks" : 17,
  "numSkippedTasks" : 0,
  "numFailedTasks" : 0,
  "numKilledTasks" : 0,
  "numCompletedIndices" : 17,
  "numActiveStages" : 0,
  "numCompletedStages" : 2,
  "numSkippedStages" : 0,
  "numFailedStages" : 0,
  "killedTasksSummary" : { }
} ]%
{code}

> spark REST API Unable to get JobDescription
> ---
>
> Key: SPARK-33168
> URL: https://issues.apache.org/jira/browse/SPARK-33168
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: zhaoyachao
>Priority: Major
>
> spark set job description ,use spark REST API 
> (localhost:4040/api/v1/applications/xxx/jobs)unable to get job 
> description,but it can be displayed at localhost:4040/jobs
> spark.sparkContext.setJobDescription({color:#6a8759}"test_count"{color})
> spark.range({color:#6897bb}100{color}).count()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32323) Javascript/HTML bug in spark application UI

2020-10-18 Thread Ihor Bobak (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216145#comment-17216145
 ] 

Ihor Bobak commented on SPARK-32323:


Firefox 81.0.2  64bit,  the latest one.
But the bug appeared long ago, and until then Firefox updated multiple times on 
my VM.  I believe this is not about 

> Javascript/HTML bug in spark application UI
> ---
>
> Key: SPARK-32323
> URL: https://issues.apache.org/jira/browse/SPARK-32323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Ubuntu 18,  Spark 3.0.0 standalone cluster
>Reporter: Ihor Bobak
>Priority: Major
> Attachments: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png
>
>
> I attached screeenshot - everything is written on it.
> This appeared in Spark 3.0.0 in the Firefox browser (latest version)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32323) Javascript/HTML bug in spark application UI

2020-10-18 Thread Ihor Bobak (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216145#comment-17216145
 ] 

Ihor Bobak edited comment on SPARK-32323 at 10/18/20, 9:17 AM:
---

Firefox 81.0.2  64bit,  the latest one.
But the bug appeared long ago, and until then Firefox updated multiple times on 
my VM.  I believe this is not related to the version of the browser


was (Author: ibobak):
Firefox 81.0.2  64bit,  the latest one.
But the bug appeared long ago, and until then Firefox updated multiple times on 
my VM.  I believe this is not about 

> Javascript/HTML bug in spark application UI
> ---
>
> Key: SPARK-32323
> URL: https://issues.apache.org/jira/browse/SPARK-32323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Ubuntu 18,  Spark 3.0.0 standalone cluster
>Reporter: Ihor Bobak
>Priority: Major
> Attachments: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png
>
>
> I attached screeenshot - everything is written on it.
> This appeared in Spark 3.0.0 in the Firefox browser (latest version)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-32323) Javascript/HTML bug in spark application UI

2020-10-18 Thread akiyamaneko (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

akiyamaneko updated SPARK-32323:

Comment: was deleted

(was: [~ibobak] hi, can you provide your browser info>)

> Javascript/HTML bug in spark application UI
> ---
>
> Key: SPARK-32323
> URL: https://issues.apache.org/jira/browse/SPARK-32323
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
> Environment: Ubuntu 18,  Spark 3.0.0 standalone cluster
>Reporter: Ihor Bobak
>Priority: Major
> Attachments: 2020-07-15 16_36_31-pyspark-shell - Spark Jobs.png
>
>
> I attached screeenshot - everything is written on it.
> This appeared in Spark 3.0.0 in the Firefox browser (latest version)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33172) Spark SQL CodeGenerator does not check for UserDefined type

2020-10-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33172:
-
Target Version/s:   (was: 2.4.8, 3.0.2)

> Spark SQL CodeGenerator does not check for UserDefined type
> ---
>
> Key: SPARK-33172
> URL: https://issues.apache.org/jira/browse/SPARK-33172
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: David Rabinowitz
>Priority: Minor
>
> The CodeGenerator takes the DataType given to   {{getValueFromVector()}} as 
> is, and generates code based on its type. The generated code is not aware of 
> the actual type, and therefore cannot be compiled. For example, using a 
> DataFrame with a Spark ML Vector (VectorUDT) the generated code is:
> {{InternalRow datasourcev2scan_value_2 = datasourcev2scan_isNull_2 ? null : 
> (datasourcev2scan_mutableStateArray_2[2].getStruct(datasourcev2scan_rowIdx_0, 
> 4));}}
> {{ Which leads to a runtime error of}}
> {{20/10/14 13:20:51 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 153, Column 126: No applicable constructor/method found for actual parameters 
> "int, int"; candidates are: "public 
> org.apache.spark.sql.vectorized.ColumnarRow 
> org.apache.spark.sql.vectorized.ColumnVector.getStruct(int)"}}
> {{ org.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 153, Column 126: No applicable constructor/method found for actual 
> parameters "int, int"; candidates are: "public 
> org.apache.spark.sql.vectorized.ColumnarRow 
> org.apache.spark.sql.vectorized.ColumnVector.getStruct(int)"}}
> {{ at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12124)}}
> {{...}}
> {{ which then throws Spark to an infinite loop of this error.}}
> The solution is quite simple, {{getValueFromVector()}} should match nad 
> handle UserDefinedType the same as {{CodeGenerator.javaType()}} is doing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33109) Upgrade to SBT 1.4 and support `dependencyTree` back

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216127#comment-17216127
 ] 

Apache Spark commented on SPARK-33109:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30085

> Upgrade to SBT 1.4 and support `dependencyTree` back
> 
>
> Key: SPARK-33109
> URL: https://issues.apache.org/jira/browse/SPARK-33109
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Denis Pyshev
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33131:
--
Fix Version/s: 2.4.8

> Fix grouping sets with having clause can not resolve qualified col name
> ---
>
> Key: SPARK-33131
> URL: https://issues.apache.org/jira/browse/SPARK-33131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Grouping sets construct new aggregate lost the qualified name of grouping 
> expression. Here is a example:
> {code:java}
> -- Works resolved by ResolveReferences
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 
> 1
> -- Works because of the extra expression c1
> select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) 
> having t1.c1 = 1
> -- Failed
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having 
> t1.c1 = 1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33131) Fix grouping sets with having clause can not resolve qualified col name

2020-10-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33131:
--
Priority: Major  (was: Minor)

> Fix grouping sets with having clause can not resolve qualified col name
> ---
>
> Key: SPARK-33131
> URL: https://issues.apache.org/jira/browse/SPARK-33131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 3.1.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Grouping sets construct new aggregate lost the qualified name of grouping 
> expression. Here is a example:
> {code:java}
> -- Works resolved by ResolveReferences
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 
> 1
> -- Works because of the extra expression c1
> select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) 
> having t1.c1 = 1
> -- Failed
> select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having 
> t1.c1 = 1{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33175) Detect duplicated mountPath and fail at Spark side

2020-10-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216114#comment-17216114
 ] 

Apache Spark commented on SPARK-33175:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30084

> Detect duplicated mountPath and fail at Spark side
> --
>
> Key: SPARK-33175
> URL: https://issues.apache.org/jira/browse/SPARK-33175
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> If there is a mountPath conflict, the pod is created and repeats the 
> following error messages and keep running. This should not keep running and 
> we had better fail at Spark side.
> {code}
> $ k get pod -l 'spark-role in (driver,executor)'
> NAMEREADY   STATUSRESTARTS   AGE
> tpcds   1/1 Running   0  14m
> {code}
> {code}
> 20/10/18 05:09:26 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: ...
> Message: Pod "tpcds-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/data1": must 
> be unique.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

62 matches

Mail list logo