date:20210209

[jira] [Closed] (SPARK-12312) Support JDBC Kerberos w/ keytab

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-12312.
-

> Support JDBC Kerberos w/ keytab
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.1.0
>
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12312) Support JDBC Kerberos w/ keytab

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-12312.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

> Support JDBC Kerberos w/ keytab
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.1.0
>
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-31857.
-

> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-31884.
---
Resolution: Won't Do

> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-31884.
-

> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282280#comment-17282280
 ] 

Gabor Somogyi commented on SPARK-31884:
---

Since the API is implemented I think it's now possible to add the mentioned 
feature as external plugin so closing this jira.
If committers or PMCs think we can add this as built-in provider feel free to 
re-open this Jira.


> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-31857.
---
Resolution: Won't Do

> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi closed SPARK-31815.
-

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282279#comment-17282279
 ] 

Gabor Somogyi commented on SPARK-31857:
---

Since the API is implemented I think it's now possible to add the mentioned 
feature as external plugin so closing this jira.
If committers or PMCs think we can add this as built-in provider feel free to 
re-open this Jira.


> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi resolved SPARK-31815.
---
Resolution: Won't Do

> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2021-02-09 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282277#comment-17282277
 ] 

Gabor Somogyi commented on SPARK-31815:
---

Since the API is implemented I think it's now possible to add the mentioned 
feature as external plugin so closing this jira.
If committers or PMCs think we can add this as built-in provider feel free to 
re-open this Jira.


> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34414) OptimizeMetadataOnlyQuery should only apply for deterministic filters

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34414:


Assignee: (was: Apache Spark)

> OptimizeMetadataOnlyQuery should only apply for deterministic filters
> -
>
> Key: SPARK-34414
> URL: https://issues.apache.org/jira/browse/SPARK-34414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Similar to FileSourcePartitionPruning, OptimizeMetadataOnlyQuery should only 
> apply for deterministic filters. If filters are non-deterministic, they have 
> to be evaluated against partitions separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34414) OptimizeMetadataOnlyQuery should only apply for deterministic filters

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34414:


Assignee: Apache Spark

> OptimizeMetadataOnlyQuery should only apply for deterministic filters
> -
>
> Key: SPARK-34414
> URL: https://issues.apache.org/jira/browse/SPARK-34414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Yesheng Ma
>Assignee: Apache Spark
>Priority: Major
>
> Similar to FileSourcePartitionPruning, OptimizeMetadataOnlyQuery should only 
> apply for deterministic filters. If filters are non-deterministic, they have 
> to be evaluated against partitions separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34414) OptimizeMetadataOnlyQuery should only apply for deterministic filters

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282273#comment-17282273
 ] 

Apache Spark commented on SPARK-34414:
--

User 'yeshengm' has created a pull request for this issue:
https://github.com/apache/spark/pull/31542

> OptimizeMetadataOnlyQuery should only apply for deterministic filters
> -
>
> Key: SPARK-34414
> URL: https://issues.apache.org/jira/browse/SPARK-34414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Similar to FileSourcePartitionPruning, OptimizeMetadataOnlyQuery should only 
> apply for deterministic filters. If filters are non-deterministic, they have 
> to be evaluated against partitions separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34414) OptimizeMetadataOnlyQuery should only apply for deterministic filters

2021-02-09 Thread Yesheng Ma (Jira)

Yesheng Ma created SPARK-34414:
--

 Summary: OptimizeMetadataOnlyQuery should only apply for 
deterministic filters
 Key: SPARK-34414
 URL: https://issues.apache.org/jira/browse/SPARK-34414
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Yesheng Ma


Similar to FileSourcePartitionPruning, OptimizeMetadataOnlyQuery should only 
apply for deterministic filters. If filters are non-deterministic, they have to 
be evaluated against partitions separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34414) OptimizeMetadataOnlyQuery should only apply for deterministic filters

2021-02-09 Thread Yesheng Ma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma updated SPARK-34414:
---
Issue Type: Bug  (was: Improvement)

> OptimizeMetadataOnlyQuery should only apply for deterministic filters
> -
>
> Key: SPARK-34414
> URL: https://issues.apache.org/jira/browse/SPARK-34414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Similar to FileSourcePartitionPruning, OptimizeMetadataOnlyQuery should only 
> apply for deterministic filters. If filters are non-deterministic, they have 
> to be evaluated against partitions separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34209) Allow multiple namespaces with session catalog

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282259#comment-17282259
 ] 

Apache Spark commented on SPARK-34209:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31541

> Allow multiple namespaces with session catalog
> --
>
> Key: SPARK-34209
> URL: https://issues.apache.org/jira/browse/SPARK-34209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Trivial
> Fix For: 3.2.0
>
>
> SPARK-30885 removed the ability for tables in session catalogs being queried 
> with SQL to have multiple namespaces. This seems to have been added as a 
> follow up, not as part of the core change. We should explore if this 
> restriction can be relaxed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34209) Allow multiple namespaces with session catalog

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282258#comment-17282258
 ] 

Apache Spark commented on SPARK-34209:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31541

> Allow multiple namespaces with session catalog
> --
>
> Key: SPARK-34209
> URL: https://issues.apache.org/jira/browse/SPARK-34209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Trivial
> Fix For: 3.2.0
>
>
> SPARK-30885 removed the ability for tables in session catalogs being queried 
> with SQL to have multiple namespaces. This seems to have been added as a 
> follow up, not as part of the core change. We should explore if this 
> restriction can be relaxed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34234) Remove TreeNodeException that didn't work

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34234.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31337
[https://github.com/apache/spark/pull/31337]

> Remove TreeNodeException that didn't work
> -
>
> Key: SPARK-34234
> URL: https://issues.apache.org/jira/browse/SPARK-34234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> TreeNodeException causes the error msg not clear and it didn't work well.
> Because the TreeNodeException looks redundancy, we could remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34234) Remove TreeNodeException that didn't work

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34234:
---

Assignee: jiaan.geng

> Remove TreeNodeException that didn't work
> -
>
> Key: SPARK-34234
> URL: https://issues.apache.org/jira/browse/SPARK-34234
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> TreeNodeException causes the error msg not clear and it didn't work well.
> Because the TreeNodeException looks redundancy, we could remove it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34404) Support Avro datasource options to control datetime rebasing in read

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34404.
-
Resolution: Fixed

Issue resolved by pull request 31529
[https://github.com/apache/spark/pull/31529]

> Support Avro datasource options to control datetime rebasing in read
> 
>
> Key: SPARK-34404
> URL: https://issues.apache.org/jira/browse/SPARK-34404
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new Avro option similar to the SQL configs 
> {{spark.sql.legacy.parquet.datetimeRebaseModeInRead}}{{.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34347) CatalogImpl.uncacheTable should invalidate in cascade for temp views

2021-02-09 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34347.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31462
[https://github.com/apache/spark/pull/31462]

> CatalogImpl.uncacheTable should invalidate in cascade for temp views 
> -
>
> Key: SPARK-34347
> URL: https://issues.apache.org/jira/browse/SPARK-34347
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> When {{spark.sql.legacy.storeAnalyzedPlanForView}} is false, 
> {{CatalogImpl.uncacheTable}} should invalidate caches for temp view in 
> cascade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34347) CatalogImpl.uncacheTable should invalidate in cascade for temp views

2021-02-09 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-34347:
---

Assignee: Chao Sun

> CatalogImpl.uncacheTable should invalidate in cascade for temp views 
> -
>
> Key: SPARK-34347
> URL: https://issues.apache.org/jira/browse/SPARK-34347
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> When {{spark.sql.legacy.storeAnalyzedPlanForView}} is false, 
> {{CatalogImpl.uncacheTable}} should invalidate caches for temp view in 
> cascade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31816) Create high level description about JDBC connection providers for users/developers

2021-02-09 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31816.
--
Fix Version/s: 3.2.0
 Assignee: Gabor Somogyi
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31384

> Create high level description about JDBC connection providers for 
> users/developers
> --
>
> Key: SPARK-31816
> URL: https://issues.apache.org/jira/browse/SPARK-31816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation

2021-02-09 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282201#comment-17282201
 ] 

Attila Zsolt Piros commented on SPARK-33763:


I am not positive about the "stage re-submitted because of fetch failure" 
solution too as "stages.failedStages.count" is already available and most 
failed stages are retried.

When the tests on my PR (which contains the counter metrics for the different 
loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP 
label).{{}}

> Add metrics for better tracking of dynamic allocation
> -
>
> Key: SPARK-33763
> URL: https://issues.apache.org/jira/browse/SPARK-33763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> We should add metrics to track the following:
> 1- Graceful decommissions & DA scheduled deletes
> 2- Jobs resubmitted
> 3- Fetch failures
> 4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33763) Add metrics for better tracking of dynamic allocation

2021-02-09 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282201#comment-17282201
 ] 

Attila Zsolt Piros edited comment on SPARK-33763 at 2/10/21, 3:30 AM:
--

I am not positive about the "stage re-submitted because of fetch failure" 
solution too as "stages.failedStages.count" is already available and most 
failed stages are retried.

When the tests on my PR (which contains the counter metrics for the different 
loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP 
label).


was (Author: attilapiros):
I am not positive about the "stage re-submitted because of fetch failure" 
solution too as "stages.failedStages.count" is already available and most 
failed stages are retried.

When the tests on my PR (which contains the counter metrics for the different 
loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP 
label).{{}}

> Add metrics for better tracking of dynamic allocation
> -
>
> Key: SPARK-33763
> URL: https://issues.apache.org/jira/browse/SPARK-33763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> We should add metrics to track the following:
> 1- Graceful decommissions & DA scheduled deletes
> 2- Jobs resubmitted
> 3- Fetch failures
> 4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33995) Make datetime addition easier for years, weeks, hours, minutes, and seconds

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33995.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31073
[https://github.com/apache/spark/pull/31073]

> Make datetime addition easier for years, weeks, hours, minutes, and seconds
> ---
>
> Key: SPARK-33995
> URL: https://issues.apache.org/jira/browse/SPARK-33995
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Assignee: Matthew Powers
>Priority: Minor
> Fix For: 3.2.0
>
>
> There are add_months and date_add functions that make it easy to perform 
> datetime addition with months and days, but there isn't an easy way to 
> perform datetime addition with years, weeks, hours, minutes, or seconds with 
> the Scala/Python/R APIs.
> Users need to write code like expr("first_datetime + INTERVAL 2 hours") to 
> add two hours to a timestamp with the Scala API, which isn't desirable.  We 
> don't want to make Scala users manipulate SQL strings.
> We can expose the [make_interval SQL 
> function|https://github.com/apache/spark/pull/26446/files] to make any 
> combination of datetime addition possible.  That'll make tons of different 
> datetime addition operations possible and will be valuable for a wide array 
> of users.
> make_interval takes 7 arguments: years, months, weeks, days, hours, mins, and 
> secs.
> There are different ways to expose the make_interval functionality to 
> Scala/Python/R users:
>  * Option 1: Single make_interval function that takes 7 arguments
>  * Option 2: expose a few interval functions
>  ** make_date_interval function that takes years, months, days
>  ** make_time_interval function that takes hours, minutes, seconds
>  ** make_datetime_interval function that takes years, months, days, hours, 
> minutes, seconds
>  * Option 3: expose add_years, add_months, add_days, add_weeks, add_hours, 
> add_minutes, and add_seconds as Column methods.  
>  * Option 4: Expose the add_years, add_hours, etc. as column functions.  
> add_weeks and date_add have already been exposed in this manner.  
> Option 1 is nice from a maintenance perspective cause it's a single function, 
> but it's not standard from a user perspective.  Most languages support 
> datetime instantiation with these arguments: years, months, days, hours, 
> minutes, seconds.  Mixing weeks into the equation is not standard.
> As a user, Option 3 would be my preference.  
> col("first_datetime").addHours(2).addSeconds(30) is easy for me to remember 
> and type.  col("first_datetime") + make_time_interval(lit(2), lit(0), 
> lit(30)) isn't as nice.  col("first_datetime") + make_interval(lit(0), 
> lit(0), lit(0), lit(0), lit(2), lit(0), lit(30)) is harder still.
> Any of these options is an improvement to the status quo.  Let me know what 
> option you think is best and then I'll make a PR to implement it, building 
> off of Max's foundational work of course ;)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33995) Make datetime addition easier for years, weeks, hours, minutes, and seconds

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33995:
---

Assignee: Matthew Powers

> Make datetime addition easier for years, weeks, hours, minutes, and seconds
> ---
>
> Key: SPARK-33995
> URL: https://issues.apache.org/jira/browse/SPARK-33995
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Assignee: Matthew Powers
>Priority: Minor
>
> There are add_months and date_add functions that make it easy to perform 
> datetime addition with months and days, but there isn't an easy way to 
> perform datetime addition with years, weeks, hours, minutes, or seconds with 
> the Scala/Python/R APIs.
> Users need to write code like expr("first_datetime + INTERVAL 2 hours") to 
> add two hours to a timestamp with the Scala API, which isn't desirable.  We 
> don't want to make Scala users manipulate SQL strings.
> We can expose the [make_interval SQL 
> function|https://github.com/apache/spark/pull/26446/files] to make any 
> combination of datetime addition possible.  That'll make tons of different 
> datetime addition operations possible and will be valuable for a wide array 
> of users.
> make_interval takes 7 arguments: years, months, weeks, days, hours, mins, and 
> secs.
> There are different ways to expose the make_interval functionality to 
> Scala/Python/R users:
>  * Option 1: Single make_interval function that takes 7 arguments
>  * Option 2: expose a few interval functions
>  ** make_date_interval function that takes years, months, days
>  ** make_time_interval function that takes hours, minutes, seconds
>  ** make_datetime_interval function that takes years, months, days, hours, 
> minutes, seconds
>  * Option 3: expose add_years, add_months, add_days, add_weeks, add_hours, 
> add_minutes, and add_seconds as Column methods.  
>  * Option 4: Expose the add_years, add_hours, etc. as column functions.  
> add_weeks and date_add have already been exposed in this manner.  
> Option 1 is nice from a maintenance perspective cause it's a single function, 
> but it's not standard from a user perspective.  Most languages support 
> datetime instantiation with these arguments: years, months, days, hours, 
> minutes, seconds.  Mixing weeks into the equation is not standard.
> As a user, Option 3 would be my preference.  
> col("first_datetime").addHours(2).addSeconds(30) is easy for me to remember 
> and type.  col("first_datetime") + make_time_interval(lit(2), lit(0), 
> lit(30)) isn't as nice.  col("first_datetime") + make_interval(lit(0), 
> lit(0), lit(0), lit(0), lit(2), lit(0), lit(30)) is harder still.
> Any of these options is an improvement to the status quo.  Let me know what 
> option you think is best and then I'll make a PR to implement it, building 
> off of Max's foundational work of course ;)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34137) The tree string does not contain statistics for nested scalar sub queries

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34137.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31485
[https://github.com/apache/spark/pull/31485]

> The tree string does not contain statistics for nested scalar sub queries
> -
>
> Key: SPARK-34137
> URL: https://issues.apache.org/jira/browse/SPARK-34137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> How to reproduce:
> {code:scala}
> spark.sql("create table t1 using parquet as select id as a, id as b from 
> range(1000)")
> spark.sql("create table t2 using parquet as select id as c, id as d from 
> range(2000)")
> spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql(
>   """
> |WITH max_store_sales AS
> |  (SELECT max(csales) tpcds_cmax
> |  FROM (SELECT
> |sum(b) csales
> |  FROM t1 WHERE a < 100 ) x),
> |best_ss_customer AS
> |  (SELECT
> |c
> |  FROM t2
> |  WHERE d > (SELECT * FROM max_store_sales))
> |
> |SELECT c FROM best_ss_customer
> |""".stripMargin).explain("cost")
> {code}
> Output:
> {noformat}
> == Optimized Logical Plan ==
> Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3)
> +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), 
> Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
>:  +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L]
>: +- Aggregate [sum(b#4266L) AS csales#4260L]
>:+- Project [b#4266L]
>:   +- Filter ((a#4265L < 100) AND isnotnull(a#4265L))
>:  +- Relation default.t1[a#4265L,b#4266L] parquet, 
> Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
>+- Relation default.t2[c#4263L,d#4264L] parquet, 
> Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
> {noformat}
> Another case is TPC-DS q23a.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34137) The tree string does not contain statistics for nested scalar sub queries

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34137:
---

Assignee: Yuming Wang

> The tree string does not contain statistics for nested scalar sub queries
> -
>
> Key: SPARK-34137
> URL: https://issues.apache.org/jira/browse/SPARK-34137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> spark.sql("create table t1 using parquet as select id as a, id as b from 
> range(1000)")
> spark.sql("create table t2 using parquet as select id as c, id as d from 
> range(2000)")
> spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS")
> spark.sql("set spark.sql.cbo.enabled=true")
> spark.sql(
>   """
> |WITH max_store_sales AS
> |  (SELECT max(csales) tpcds_cmax
> |  FROM (SELECT
> |sum(b) csales
> |  FROM t1 WHERE a < 100 ) x),
> |best_ss_customer AS
> |  (SELECT
> |c
> |  FROM t2
> |  WHERE d > (SELECT * FROM max_store_sales))
> |
> |SELECT c FROM best_ss_customer
> |""".stripMargin).explain("cost")
> {code}
> Output:
> {noformat}
> == Optimized Logical Plan ==
> Project [c#4263L], Statistics(sizeInBytes=31.3 KiB, rowCount=2.00E+3)
> +- Filter (isnotnull(d#4264L) AND (d#4264L > scalar-subquery#4262 [])), 
> Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
>:  +- Aggregate [max(csales#4260L) AS tpcds_cmax#4261L]
>: +- Aggregate [sum(b#4266L) AS csales#4260L]
>:+- Project [b#4266L]
>:   +- Filter ((a#4265L < 100) AND isnotnull(a#4265L))
>:  +- Relation default.t1[a#4265L,b#4266L] parquet, 
> Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3)
>+- Relation default.t2[c#4263L,d#4264L] parquet, 
> Statistics(sizeInBytes=46.9 KiB, rowCount=2.00E+3)
> {noformat}
> Another case is TPC-DS q23a.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34240) Unify output of SHOW TBLPROPERTIES pass output attributes properly

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34240.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31378
[https://github.com/apache/spark/pull/31378]

> Unify output of SHOW TBLPROPERTIES pass output attributes properly
> --
>
> Key: SPARK-34240
> URL: https://issues.apache.org/jira/browse/SPARK-34240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> Unify output of SHOW TBLPROPERTIES pass output attributes properly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34240) Unify output of SHOW TBLPROPERTIES pass output attributes properly

2021-02-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34240:
---

Assignee: angerszhu

> Unify output of SHOW TBLPROPERTIES pass output attributes properly
> --
>
> Key: SPARK-34240
> URL: https://issues.apache.org/jira/browse/SPARK-34240
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Unify output of SHOW TBLPROPERTIES pass output attributes properly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0

2021-02-09 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282194#comment-17282194
 ] 

Yuming Wang commented on SPARK-34392:
-

PostgreSQL:
{noformat}
postgres=# SELECT TIMESTAMP WITH TIME ZONE '2020-02-07 16:00:00' AT TIME ZONE 
'GMT+8:00', current_timestamp, current_timestamp AT TIME ZONE 'GMT+8:00', 
version();
  timezone   |   current_timestamp   |  timezone
  | version
-+---++--
 2020-02-07 08:00:00 | 2021-02-10 03:10:48.407459+00 | 2021-02-09 
19:10:48.407459 | PostgreSQL 11.3 (Debian 11.3-1.pgdg90+1) on 
x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 
64-bit
(1 row)
{noformat}

Hive:
{noformat}
hive> SELECT current_timestamp, to_utc_timestamp(current_timestamp, 
"GMT+8:00"), version();
2021-02-09 20:09:16.201 2021-02-09 12:09:16.201 2.3.7 
rcb213d88304034393d68cc31a95be24f5aac62b6
Time taken: 0.095 seconds, Fetched: 1 row(s)
hive>
{noformat}

 Presto/Trino
{noformat}
trino:sf1> SELECT current_timestamp, current_timestamp AT TIME ZONE 'GMT+8:00';
_col0| _col1
-+
 2021-02-10 03:07:27.807 UTC | 2021-02-10 11:07:27.807 +08:00
(1 row)

Query 20210210_030727_00015_2i5r6, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.22 [0 rows, 0B] [0 rows/s, 0B/s]
{noformat}


> Invalid ID for offset-based ZoneId since Spark 3.0
> --
>
> Key: SPARK-34392
> URL: https://issues.apache.org/jira/browse/SPARK-34392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> {code}
> Spark 2.4:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 2020-02-07 08:00:00
> Time taken: 0.089 seconds, Fetched 1 row(s)
> {noformat}
> Spark 3.x:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select 
> to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")]
> java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00
>   at java.time.ZoneId.ofWithPrefix(ZoneId.java:437)
>   at java.time.ZoneId.of(ZoneId.java:407)
>   at java.time.ZoneId.of(ZoneId.java:359)
>   at java.time.ZoneId.of(ZoneId.java:315)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34413) Clean up AnsiTypeCoercionSuite and TypeCoercionSuite

2021-02-09 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-34413:
--

 Summary: Clean up AnsiTypeCoercionSuite and TypeCoercionSuite
 Key: SPARK-34413
 URL: https://issues.apache.org/jira/browse/SPARK-34413
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.2.0
Reporter: Gengliang Wang


Create an abstract class and remove duplicated code from the two suites.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34248) Allow implicit casting string literal to other data types under ANSI mode

2021-02-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-34248.

  Assignee: Gengliang Wang
Resolution: Fixed

> Allow implicit casting string literal to other data types under ANSI mode
> -
>
> Key: SPARK-34248
> URL: https://issues.apache.org/jira/browse/SPARK-34248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> As the default type coercion rules allow converting String type to other 
> primitive types, we can specially allow converting string literal to other 
> primitive types in ANSI mode. So that users won't get many SQL execution 
> failure on switching to ANSI mode.
> This is learned from PostgreSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34250) Add rule WindowFrameCoercion into ANSI implicit cast rules

2021-02-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-34250.

  Assignee: Gengliang Wang
Resolution: Fixed

> Add rule WindowFrameCoercion into ANSI implicit cast rules
> --
>
> Key: SPARK-34250
> URL: https://issues.apache.org/jira/browse/SPARK-34250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> To make review easier, the rule WindowFrameCoercion is not added to ANSI 
> implicit cast rules in PR https://github.com/apache/spark/pull/31349
> We should add it back and add new test cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34311) PostgresDialect can't treat arrays of some types

2021-02-09 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34311.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31419

> PostgresDialect can't treat arrays of some types
> 
>
> Key: SPARK-34311
> URL: https://issues.apache.org/jira/browse/SPARK-34311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> PostgresDialect implements the logic to treat array of some types but it's 
> not enough.
> Currently, the following types can't be treated.
> * xml
> * tsvector
> * tsquery
> * macaddr
> * txid_snapshot
> * point
> * line
> * lseg
> * box
> * path
> * polygon
> * circle
> * pg_lsn
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20977) NPE in CollectionAccumulator

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20977:


Assignee: (was: Apache Spark)

> NPE in CollectionAccumulator
> 
>
> Key: SPARK-20977
> URL: https://issues.apache.org/jira/browse/SPARK-20977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Priority: Major
>
> {code:java}
> 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
> heartbeat-receiver-event-loop-thread
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Is that the bug of spark? Has anybody ever hit the problem?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20977) NPE in CollectionAccumulator

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20977:


Assignee: Apache Spark

> NPE in CollectionAccumulator
> 
>
> Key: SPARK-20977
> URL: https://issues.apache.org/jira/browse/SPARK-20977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
> heartbeat-receiver-event-loop-thread
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Is that the bug of spark? Has anybody ever hit the problem?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20977) NPE in CollectionAccumulator

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282157#comment-17282157
 ] 

Apache Spark commented on SPARK-20977:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/31540

> NPE in CollectionAccumulator
> 
>
> Key: SPARK-20977
> URL: https://issues.apache.org/jira/browse/SPARK-20977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>Priority: Major
>
> {code:java}
> 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
> heartbeat-receiver-event-loop-thread
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Is that the bug of spark? Has anybody ever hit the problem?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-09 Thread Ranju (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282111#comment-17282111
 ] 

Ranju commented on SPARK-34389:
---

Thanks for this information

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282110#comment-17282110
 ] 

Apache Spark commented on SPARK-34105:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31539

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282109#comment-17282109
 ] 

Apache Spark commented on SPARK-34105:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31539

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282108#comment-17282108
 ] 

Apache Spark commented on SPARK-34104:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31539

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34105:


Assignee: Apache Spark  (was: Holden Karau)

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34104:


Assignee: Holden Karau  (was: Apache Spark)

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34105:


Assignee: Holden Karau  (was: Apache Spark)

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34104:


Assignee: Apache Spark  (was: Holden Karau)

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34104:
-
Target Version/s:   (was: 3.2.0)

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34104:
-
Fix Version/s: (was: 3.2.0)

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-34104:
--

Reverted in 
https://github.com/apache/spark/commit/c8628c943cd12bbad7561bdc297cea9ff23becc7

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34105:
-
Fix Version/s: (was: 3.2.0)

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-34105:
--

Reverted in 
https://github.com/apache/spark/commit/c8628c943cd12bbad7561bdc297cea9ff23becc7

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.2.0
>
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34105:
-
Target Version/s:   (was: 3.2.0)

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34411) Remove python2 support

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34411.
--
Resolution: Duplicate

> Remove python2 support
> --
>
> Key: SPARK-34411
> URL: https://issues.apache.org/jira/browse/SPARK-34411
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Björn Boschman
>Priority: Major
>
> Not sure about pyspark itself
> provided k8s dockerfile still installs python2.7 which is EOL since January 
> 1st, 2020
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34412) RemoveNoopOperators can remove non-trivial projects

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282089#comment-17282089
 ] 

Apache Spark commented on SPARK-34412:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/31538

> RemoveNoopOperators can remove non-trivial projects
> ---
>
> Key: SPARK-34412
> URL: https://issues.apache.org/jira/browse/SPARK-34412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> RemoveNoopOperators can remove non-trivial projects. This can happen when the 
> top project has a non-trivial expression that reuses expression id of the 
> child project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34412) RemoveNoopOperators can remove non-trivial projects

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34412:


Assignee: Apache Spark  (was: Herman van Hövell)

> RemoveNoopOperators can remove non-trivial projects
> ---
>
> Key: SPARK-34412
> URL: https://issues.apache.org/jira/browse/SPARK-34412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>
> RemoveNoopOperators can remove non-trivial projects. This can happen when the 
> top project has a non-trivial expression that reuses expression id of the 
> child project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34412) RemoveNoopOperators can remove non-trivial projects

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34412:


Assignee: Herman van Hövell  (was: Apache Spark)

> RemoveNoopOperators can remove non-trivial projects
> ---
>
> Key: SPARK-34412
> URL: https://issues.apache.org/jira/browse/SPARK-34412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> RemoveNoopOperators can remove non-trivial projects. This can happen when the 
> top project has a non-trivial expression that reuses expression id of the 
> child project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34412) RemoveNoopOperators can remove non-trivial projects

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282090#comment-17282090
 ] 

Apache Spark commented on SPARK-34412:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/31538

> RemoveNoopOperators can remove non-trivial projects
> ---
>
> Key: SPARK-34412
> URL: https://issues.apache.org/jira/browse/SPARK-34412
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> RemoveNoopOperators can remove non-trivial projects. This can happen when the 
> top project has a non-trivial expression that reuses expression id of the 
> child project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34063) Major slowdown in spark streaming after 6 days

2021-02-09 Thread Calvin Pietersen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282081#comment-17282081
 ] 

Calvin Pietersen commented on SPARK-34063:
--

So this happened consistently every 6 days about 8 times consecutively. After 
downgrading to spark 2.4.4 and emr 6.0.0 the issue was resolved. I haven't 
compared memory dumps between spark 2.4.4. vs 3.0.0, however, I'd be willing to 
if someone wanted investigate. 

> Major slowdown in spark streaming after 6 days
> --
>
> Key: SPARK-34063
> URL: https://issues.apache.org/jira/browse/SPARK-34063
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
> Environment: AWS EMR 6.1.0
> Spark 3.0.0
> Kinesis
>Reporter: Calvin Pietersen
>Priority: Major
> Attachments: 2020-12-29.pdf, normal-job, slow-job
>
>
> Spark streaming application runs at 60s batch intervals.
> The application runs fine processing batches around 40s. After ~8600 batches 
> (around 6 days), the application all of a sudden hits a wall and processing 
> time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This 
> happens consistently every 6 days, regardless of data. 
> Looking at the application logs, it seems like when the issue begins, tasks 
> are being completed by executors, however the driver is taking a while to 
> acknowledge. I have taken numerous memory dumps of the driver (before it hits 
> the 6 day wall) using *jcmd* and can see the 
> *org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the 
> fact that the application is able to keep up with batches. I have yet to take 
> a snapshot of the application in the broken state.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34104) Allow users to specify a maximum decommissioning time

2021-02-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-34104.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

> Allow users to specify a maximum decommissioning time
> -
>
> Key: SPARK-34104
> URL: https://issues.apache.org/jira/browse/SPARK-34104
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.2.0
>
>
> We currently have the ability for users to set the predicted time of the 
> cluster manager or cloud provider to terminate a decommissioning executor, 
> but for nodes where Spark it's self is triggering decommissioning we should 
> add the ability of users to specify a maximum time we want to allow the 
> executor to decommission.
>  
> This is important especially if we start to in more places (like with 
> excluded executors that are found to be flaky, that may or may not be able to 
> decommission successfully).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34105) In addition to killing exlcuded/flakey executors which should support decommissioning

2021-02-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-34105.
--
Fix Version/s: 3.2.0
 Assignee: Holden Karau
   Resolution: Fixed

> In addition to killing exlcuded/flakey executors which should support 
> decommissioning
> -
>
> Key: SPARK-34105
> URL: https://issues.apache.org/jira/browse/SPARK-34105
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.2.0
>
>
> Decommissioning will give the executor a chance to migrate it's files to a 
> more stable node.
>  
> Note: we want SPARK-34104 to be integrated as well so that flaky executors 
> which can not decommission are eventually killed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34412) RemoveNoopOperators can remove non-trivial projects

2021-02-09 Thread Jira

Herman van Hövell created SPARK-34412:
-

 Summary: RemoveNoopOperators can remove non-trivial projects
 Key: SPARK-34412
 URL: https://issues.apache.org/jira/browse/SPARK-34412
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Herman van Hövell
Assignee: Herman van Hövell


RemoveNoopOperators can remove non-trivial projects. This can happen when the 
top project has a non-trivial expression that reuses expression id of the child 
project.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34372.
--
Resolution: Invalid

> Speculation results in broken CSV files in Amazon S3
> 
>
> Key: SPARK-34372
> URL: https://issues.apache.org/jira/browse/SPARK-34372
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.7
> Environment: Amazon EMR with AMI version 5.32.0
>Reporter: Daehee Han
>Priority: Minor
>  Labels: csv, s3, spark, speculation
>
> Hi, we've been experiencing some rows get corrupted while partitioned CSV 
> files were written to Amazon S3. Some records were found broken without any 
> error on Spark. Digging into the root cause, we found out Spark speculation 
> tried to upload a partition being uploaded slowly and ended up uploading only 
> a part of the partition, letting broken data uploaded to S3.
> Here're stacktraces we've found. There are two executor involved - A: the 
> first executor which tried to upload the file, but it took much longer than 
> other executor (but still succeeded), which made spark speculation cut in and 
> kick off another executor B. Executor B started to upload the file too, but 
> was interrupted during uploading (killed: another attempt succeeded), and 
> ended up uploading only a part of the whole file. You can see in the log, the 
> file executor A uploaded (8461990 bytes originally) was overwritten by 
> executor B (uploaded only 3145728 bytes).
>  
> Executor A:
> {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
> 13201) 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 10 local blocks and 460 remote blocks 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 18 ms 
>  21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
>  21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
> closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
>  21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 8461990 bytes 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 10. 
>  21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of 
> task because needsTaskCommit=false: 
> attempt_20210128172219_0045_m_000426_13201 
>  21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
> 13201). 8782 bytes result sent to driver
> {quote}
> Executor B:
> {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
> 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
> 13245) 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 11 local blocks and 459 remote blocks 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 2 ms 
>  21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer 
> class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
>  21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
> stage 45.0 (TID 13245), reason: another attempt succeeded 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
>  21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 3145728 bytes 
>  21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 0. 
>  21/01/28 17:22:32 ERROR Utils: Aborting task 
> com.univocity.parsers.common.TextWritingException: Error writing row. 
> Internal state when error was thrown: recordCount=18449, recordData=[
> Unknown macro: \{obfuscated}
> ] at 
> com.univocity.parsers.common.AbstractWriter.throwExcept

[jira] [Resolved] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34372.
--
Resolution: Won't Fix

> Speculation results in broken CSV files in Amazon S3
> 
>
> Key: SPARK-34372
> URL: https://issues.apache.org/jira/browse/SPARK-34372
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.7
> Environment: Amazon EMR with AMI version 5.32.0
>Reporter: Daehee Han
>Priority: Minor
>  Labels: csv, s3, spark, speculation
>
> Hi, we've been experiencing some rows get corrupted while partitioned CSV 
> files were written to Amazon S3. Some records were found broken without any 
> error on Spark. Digging into the root cause, we found out Spark speculation 
> tried to upload a partition being uploaded slowly and ended up uploading only 
> a part of the partition, letting broken data uploaded to S3.
> Here're stacktraces we've found. There are two executor involved - A: the 
> first executor which tried to upload the file, but it took much longer than 
> other executor (but still succeeded), which made spark speculation cut in and 
> kick off another executor B. Executor B started to upload the file too, but 
> was interrupted during uploading (killed: another attempt succeeded), and 
> ended up uploading only a part of the whole file. You can see in the log, the 
> file executor A uploaded (8461990 bytes originally) was overwritten by 
> executor B (uploaded only 3145728 bytes).
>  
> Executor A:
> {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
> 13201) 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 10 local blocks and 460 remote blocks 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 18 ms 
>  21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
>  21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
> closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
>  21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 8461990 bytes 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 10. 
>  21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of 
> task because needsTaskCommit=false: 
> attempt_20210128172219_0045_m_000426_13201 
>  21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
> 13201). 8782 bytes result sent to driver
> {quote}
> Executor B:
> {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
> 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
> 13245) 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 11 local blocks and 459 remote blocks 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 2 ms 
>  21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer 
> class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
>  21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
> stage 45.0 (TID 13245), reason: another attempt succeeded 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
>  21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 3145728 bytes 
>  21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 0. 
>  21/01/28 17:22:32 ERROR Utils: Aborting task 
> com.univocity.parsers.common.TextWritingException: Error writing row. 
> Internal state when error was thrown: recordCount=18449, recordData=[
> Unknown macro: \{obfuscated}
> ] at 
> com.univocity.parsers.common.AbstractWriter.throwExce

[jira] [Reopened] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-34372:
--

> Speculation results in broken CSV files in Amazon S3
> 
>
> Key: SPARK-34372
> URL: https://issues.apache.org/jira/browse/SPARK-34372
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.7
> Environment: Amazon EMR with AMI version 5.32.0
>Reporter: Daehee Han
>Priority: Minor
>  Labels: csv, s3, spark, speculation
>
> Hi, we've been experiencing some rows get corrupted while partitioned CSV 
> files were written to Amazon S3. Some records were found broken without any 
> error on Spark. Digging into the root cause, we found out Spark speculation 
> tried to upload a partition being uploaded slowly and ended up uploading only 
> a part of the partition, letting broken data uploaded to S3.
> Here're stacktraces we've found. There are two executor involved - A: the 
> first executor which tried to upload the file, but it took much longer than 
> other executor (but still succeeded), which made spark speculation cut in and 
> kick off another executor B. Executor B started to upload the file too, but 
> was interrupted during uploading (killed: another attempt succeeded), and 
> ended up uploading only a part of the whole file. You can see in the log, the 
> file executor A uploaded (8461990 bytes originally) was overwritten by 
> executor B (uploaded only 3145728 bytes).
>  
> Executor A:
> {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
> 13201) 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 10 local blocks and 460 remote blocks 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 18 ms 
>  21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
>  21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
> closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
>  21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 8461990 bytes 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 10. 
>  21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of 
> task because needsTaskCommit=false: 
> attempt_20210128172219_0045_m_000426_13201 
>  21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
> 13201). 8782 bytes result sent to driver
> {quote}
> Executor B:
> {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
> 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
> 13245) 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 11 local blocks and 459 remote blocks 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 2 ms 
>  21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer 
> class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
>  21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
> stage 45.0 (TID 13245), reason: another attempt succeeded 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
>  21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 3145728 bytes 
>  21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 0. 
>  21/01/28 17:22:32 ERROR Utils: Aborting task 
> com.univocity.parsers.common.TextWritingException: Error writing row. 
> Internal state when error was thrown: recordCount=18449, recordData=[
> Unknown macro: \{obfuscated}
> ] at 
> com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWrit

[jira] [Created] (SPARK-34411) Remove python2 support

2021-02-09 Thread Jira

Björn Boschman created SPARK-34411:
--

 Summary: Remove python2 support
 Key: SPARK-34411
 URL: https://issues.apache.org/jira/browse/SPARK-34411
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, PySpark
Affects Versions: 3.0.1, 3.0.0
Reporter: Björn Boschman


Not sure about pyspark itself

provided k8s dockerfile still installs python2.7 which is EOL since January 
1st, 2020

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27658) Catalog API to load functions

2021-02-09 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281984#comment-17281984
 ] 

Thomas Graves commented on SPARK-27658:
---

can we link the SPIP to this 

> Catalog API to load functions
> -
>
> Key: SPARK-27658
> URL: https://issues.apache.org/jira/browse/SPARK-27658
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-24252 added an API that catalog plugins can implement to expose table 
> operations. Catalogs should also be able to provide function implementations 
> to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33716) Decommissioning Race Condition during Pod Snapshot

2021-02-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-33716:
-
Fix Version/s: (was: 3.1.0)
   3.2.0

> Decommissioning Race Condition during Pod Snapshot
> --
>
> Key: SPARK-33716
> URL: https://issues.apache.org/jira/browse/SPARK-33716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.2.0
>
>
> Some version of Kubernetes may create a deletion timestamp field before 
> changing the pod status to terminating, so a decommissioning node may have a 
> deletion timestamp and a stage of running. Depending on when the K8s snapshot 
> comes back this can cause a race condition with Spark believing the pod has 
> been deleted before it has been.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34209) Allow multiple namespaces with session catalog

2021-02-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-34209.
--
Fix Version/s: 3.2.0
 Assignee: Holden Karau
   Resolution: Fixed

> Allow multiple namespaces with session catalog
> --
>
> Key: SPARK-34209
> URL: https://issues.apache.org/jira/browse/SPARK-34209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Trivial
> Fix For: 3.2.0
>
>
> SPARK-30885 removed the ability for tables in session catalogs being queried 
> with SQL to have multiple namespaces. This seems to have been added as a 
> follow up, not as part of the core change. We should explore if this 
> restriction can be relaxed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34334) ExecutorPodsAllocator fails to identify some excess requests during downscaling

2021-02-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34334:
-
Affects Version/s: 3.1.2
   3.1.1
   3.0.0
   3.0.1

> ExecutorPodsAllocator fails to identify some excess requests during 
> downscaling
> ---
>
> Key: SPARK-34334
> URL: https://issues.apache.org/jira/browse/SPARK-34334
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
> Fix For: 3.2.0, 3.1.2
>
>
> During downscaling there are two kinds of POD requests which can lead to POD 
> deletion as it is identified as excess requests by the dynamic allocation:
>  * timed out newly created POD requests (which even haven't reached the Pod 
> Pending state, yet) 
>  * timed out pending pod requests
> The current implementation fails to delete a timed out pending pod requests 
> when there are not enough timed out newly created POD requests to delete but 
> there are some non-timed out ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34334) ExecutorPodsAllocator fails to identify some excess requests during downscaling

2021-02-09 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-34334.
--
Fix Version/s: 3.1.2
   3.2.0
 Assignee: Attila Zsolt Piros
   Resolution: Fixed

> ExecutorPodsAllocator fails to identify some excess requests during 
> downscaling
> ---
>
> Key: SPARK-34334
> URL: https://issues.apache.org/jira/browse/SPARK-34334
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
> Fix For: 3.2.0, 3.1.2
>
>
> During downscaling there are two kinds of POD requests which can lead to POD 
> deletion as it is identified as excess requests by the dynamic allocation:
>  * timed out newly created POD requests (which even haven't reached the Pod 
> Pending state, yet) 
>  * timed out pending pod requests
> The current implementation fails to delete a timed out pending pod requests 
> when there are not enough timed out newly created POD requests to delete but 
> there are some non-timed out ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34334) ExecutorPodsAllocator fails to identify some excess requests during downscaling

2021-02-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34334:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

> ExecutorPodsAllocator fails to identify some excess requests during 
> downscaling
> ---
>
> Key: SPARK-34334
> URL: https://issues.apache.org/jira/browse/SPARK-34334
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> During downscaling there are two kinds of POD requests which can lead to POD 
> deletion as it is identified as excess requests by the dynamic allocation:
>  * timed out newly created POD requests (which even haven't reached the Pod 
> Pending state, yet) 
>  * timed out pending pod requests
> The current implementation fails to delete a timed out pending pod requests 
> when there are not enough timed out newly created POD requests to delete but 
> there are some non-timed out ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28344) fail the query if detect ambiguous self join

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281852#comment-17281852
 ] 

Apache Spark commented on SPARK-28344:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/27417

> fail the query if detect ambiguous self join
> 
>
> Key: SPARK-28344
> URL: https://issues.apache.org/jira/browse/SPARK-28344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 
> 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28344) fail the query if detect ambiguous self join

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281851#comment-17281851
 ] 

Apache Spark commented on SPARK-28344:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/27417

> fail the query if detect ambiguous self join
> 
>
> Key: SPARK-28344
> URL: https://issues.apache.org/jira/browse/SPARK-28344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 
> 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27816) make TreeNode tag type safe

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281850#comment-17281850
 ] 

Apache Spark commented on SPARK-27816:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/27417

> make TreeNode tag type safe
> ---
>
> Key: SPARK-27816
> URL: https://issues.apache.org/jira/browse/SPARK-27816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27816) make TreeNode tag type safe

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281849#comment-17281849
 ] 

Apache Spark commented on SPARK-27816:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/27417

> make TreeNode tag type safe
> ---
>
> Key: SPARK-27816
> URL: https://issues.apache.org/jira/browse/SPARK-27816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0

2021-02-09 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281838#comment-17281838
 ] 

Maxim Gekk edited comment on SPARK-34392 at 2/9/21, 3:26 PM:
-

The "GMT+8:00" string is unsupported format in 3.0, see docs for the 
to_utc_timestamp() function 
(https://github.com/apache/spark/blob/30468a901577e82c855fbc4cb78e1b869facb44c/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3397-L3402):
{code:scala}
@param tz A string detailing the time zone ID that the input should be adjusted 
to. It should
  be in the format of either region-based zone IDs or zone offsets. Region IDs 
must
  have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must 
be in
  the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' 
are
  supported as aliases of '+00:00'. Other short names are not recommended to use
  because they can be ambiguous.
{code}


was (Author: maxgekk):
The "GMT+8:00" string is unsupported format in 3.0, see docs for the 
to_utc_timestamp() function:
{code:scala}
   * @param tz A string detailing the time zone ID that the input should be 
adjusted to. It should
   *   be in the format of either region-based zone IDs or zone 
offsets. Region IDs must
   *   have the form 'area/city', such as 'America/Los_Angeles'. Zone 
offsets must be in
   *   the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 
'UTC' and 'Z' are
   *   supported as aliases of '+00:00'. Other short names are not 
recommended to use
   *   because they can be ambiguous.
{code}

> Invalid ID for offset-based ZoneId since Spark 3.0
> --
>
> Key: SPARK-34392
> URL: https://issues.apache.org/jira/browse/SPARK-34392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> {code}
> Spark 2.4:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 2020-02-07 08:00:00
> Time taken: 0.089 seconds, Fetched 1 row(s)
> {noformat}
> Spark 3.x:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select 
> to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")]
> java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00
>   at java.time.ZoneId.ofWithPrefix(ZoneId.java:437)
>   at java.time.ZoneId.of(ZoneId.java:407)
>   at java.time.ZoneId.of(ZoneId.java:359)
>   at java.time.ZoneId.of(ZoneId.java:315)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34392) Invalid ID for offset-based ZoneId since Spark 3.0

2021-02-09 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281838#comment-17281838
 ] 

Maxim Gekk commented on SPARK-34392:


The "GMT+8:00" string is unsupported format in 3.0, see docs for the 
to_utc_timestamp() function:
{code:scala}
   * @param tz A string detailing the time zone ID that the input should be 
adjusted to. It should
   *   be in the format of either region-based zone IDs or zone 
offsets. Region IDs must
   *   have the form 'area/city', such as 'America/Los_Angeles'. Zone 
offsets must be in
   *   the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 
'UTC' and 'Z' are
   *   supported as aliases of '+00:00'. Other short names are not 
recommended to use
   *   because they can be ambiguous.
{code}

> Invalid ID for offset-based ZoneId since Spark 3.0
> --
>
> Key: SPARK-34392
> URL: https://issues.apache.org/jira/browse/SPARK-34392
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> {code}
> Spark 2.4:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 2020-02-07 08:00:00
> Time taken: 0.089 seconds, Fetched 1 row(s)
> {noformat}
> Spark 3.x:
> {noformat}
> spark-sql> select to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00");
> 21/02/07 01:24:32 ERROR SparkSQLDriver: Failed in [select 
> to_utc_timestamp("2020-02-07 16:00:00", "GMT+8:00")]
> java.time.DateTimeException: Invalid ID for offset-based ZoneId: GMT+8:00
>   at java.time.ZoneId.ofWithPrefix(ZoneId.java:437)
>   at java.time.ZoneId.of(ZoneId.java:407)
>   at java.time.ZoneId.of(ZoneId.java:359)
>   at java.time.ZoneId.of(ZoneId.java:315)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.getZoneId(DateTimeUtils.scala:53)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.toUTCTime(DateTimeUtils.scala:814)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-09 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281815#comment-17281815
 ] 

Attila Zsolt Piros commented on SPARK-34372:


A Direct output committer with speculation could lead to this kind of problems 
even to data loss.

Please check this out
 https://issues.apache.org/jira/browse/SPARK-10063

Although DirectParquetOutputCommitter is removed you are using 
DirectFileOutputCommitter.

There must be a warning in the logs:
[https://github.com/apache/spark/blob/18b30107adb37d3c7a767a20cc02813f0fdb86da/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1050-L1057]

> Speculation results in broken CSV files in Amazon S3
> 
>
> Key: SPARK-34372
> URL: https://issues.apache.org/jira/browse/SPARK-34372
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.7
> Environment: Amazon EMR with AMI version 5.32.0
>Reporter: Daehee Han
>Priority: Minor
>  Labels: csv, s3, spark, speculation
>
> Hi, we've been experiencing some rows get corrupted while partitioned CSV 
> files were written to Amazon S3. Some records were found broken without any 
> error on Spark. Digging into the root cause, we found out Spark speculation 
> tried to upload a partition being uploaded slowly and ended up uploading only 
> a part of the partition, letting broken data uploaded to S3.
> Here're stacktraces we've found. There are two executor involved - A: the 
> first executor which tried to upload the file, but it took much longer than 
> other executor (but still succeeded), which made spark speculation cut in and 
> kick off another executor B. Executor B started to upload the file too, but 
> was interrupted during uploading (killed: another attempt succeeded), and 
> ended up uploading only a part of the whole file. You can see in the log, the 
> file executor A uploaded (8461990 bytes originally) was overwritten by 
> executor B (uploaded only 3145728 bytes).
>  
> Executor A:
> {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
> 13201) 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 10 local blocks and 460 remote blocks 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 18 ms 
>  21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
>  21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
> closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
>  21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 8461990 bytes 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 10. 
>  21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of 
> task because needsTaskCommit=false: 
> attempt_20210128172219_0045_m_000426_13201 
>  21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
> 13201). 8782 bytes result sent to driver
> {quote}
> Executor B:
> {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
> 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
> 13245) 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 11 local blocks and 459 remote blocks 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 2 ms 
>  21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer 
> class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
>  21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
> stage 45.0 (TID 13245), reason: another attempt succeeded 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
>  21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart

[jira] [Commented] (SPARK-34408) Refactor spark.udf.register to share the same path to generate UDF instance

2021-02-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281784#comment-17281784
 ] 

Apache Spark commented on SPARK-34408:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31537

> Refactor spark.udf.register to share the same path to generate UDF instance
> ---
>
> Key: SPARK-34408
> URL: https://issues.apache.org/jira/browse/SPARK-34408
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently we have two places that create UDFs. One is _create_udf, and 
> another one is {{UserDefinedFunction(...)}}. We should better combine the 
> code path and have single place that create UDF instance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34408) Refactor spark.udf.register to share the same path to generate UDF instance

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34408:


Assignee: (was: Apache Spark)

> Refactor spark.udf.register to share the same path to generate UDF instance
> ---
>
> Key: SPARK-34408
> URL: https://issues.apache.org/jira/browse/SPARK-34408
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently we have two places that create UDFs. One is _create_udf, and 
> another one is {{UserDefinedFunction(...)}}. We should better combine the 
> code path and have single place that create UDF instance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34408) Refactor spark.udf.register to share the same path to generate UDF instance

2021-02-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34408:


Assignee: Apache Spark

> Refactor spark.udf.register to share the same path to generate UDF instance
> ---
>
> Key: SPARK-34408
> URL: https://issues.apache.org/jira/browse/SPARK-34408
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently we have two places that create UDFs. One is _create_udf, and 
> another one is {{UserDefinedFunction(...)}}. We should better combine the 
> code path and have single place that create UDF instance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34409) haversine distance spark function

2021-02-09 Thread Darshan Jani (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshan Jani updated SPARK-34409:
-
Affects Version/s: (was: 3.1.2)
   3.2.0

> haversine distance spark function
> -
>
> Key: SPARK-34409
> URL: https://issues.apache.org/jira/browse/SPARK-34409
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Darshan Jani
>Priority: Major
>
> Add new Haversine Function:
> Compute the great-circle distance between two points on a sphere given their 
> longitudes and latitudes.
> [https://en.wikipedia.org/wiki/Haversine_formula]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34410) Vincenty distance spark function

2021-02-09 Thread Darshan Jani (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darshan Jani updated SPARK-34410:
-
Affects Version/s: (was: 3.1.2)
   3.2.0

> Vincenty distance spark function
> 
>
> Key: SPARK-34410
> URL: https://issues.apache.org/jira/browse/SPARK-34410
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Darshan Jani
>Priority: Major
>
> Add new spark function to compute the distance between two geo-points using 
> vincenty distance formula
> Vincenty uses the oblate spheroid whereas haversine uses unit sphere, this 
> will give roughly
> 22m better accuracy (in worst case) than haversine
> https://en.wikipedia.org/wiki/Vincenty%27s_formulae



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34410) Vincenty distance spark function

2021-02-09 Thread Darshan Jani (Jira)

Darshan Jani created SPARK-34410:


 Summary: Vincenty distance spark function
 Key: SPARK-34410
 URL: https://issues.apache.org/jira/browse/SPARK-34410
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.2
Reporter: Darshan Jani


Add new spark function to compute the distance between two geo-points using 
vincenty distance formula
Vincenty uses the oblate spheroid whereas haversine uses unit sphere, this will 
give roughly
22m better accuracy (in worst case) than haversine
https://en.wikipedia.org/wiki/Vincenty%27s_formulae



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34409) haversine distance spark function

2021-02-09 Thread Darshan Jani (Jira)

Darshan Jani created SPARK-34409:


 Summary: haversine distance spark function
 Key: SPARK-34409
 URL: https://issues.apache.org/jira/browse/SPARK-34409
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.2
Reporter: Darshan Jani


Add new Haversine Function:
Compute the great-circle distance between two points on a sphere given their 
longitudes and latitudes.
[https://en.wikipedia.org/wiki/Haversine_formula]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34408) Refactor spark.udf.register to share the same path to generate UDF instance

2021-02-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34408:
-
Issue Type: Improvement  (was: Documentation)

> Refactor spark.udf.register to share the same path to generate UDF instance
> ---
>
> Key: SPARK-34408
> URL: https://issues.apache.org/jira/browse/SPARK-34408
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently we have two places that create UDFs. One is _create_udf, and 
> another one is {{UserDefinedFunction(...)}}. We should better combine the 
> code path and have single place that create UDF instance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34408) Refactor spark.udf.register to share the same path to generate UDF instance

2021-02-09 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-34408:


 Summary: Refactor spark.udf.register to share the same path to 
generate UDF instance
 Key: SPARK-34408
 URL: https://issues.apache.org/jira/browse/SPARK-34408
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


Currently we have two places that create UDFs. One is _create_udf, and another 
one is {{UserDefinedFunction(...)}}. We should better combine the code path and 
have single place that create UDF instance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34309) Use Caffeine instead of Guava Cache

2021-02-09 Thread Ben Manes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281677#comment-17281677
 ] 

Ben Manes commented on SPARK-34309:
---

FYI, your usage is with Guava’s default concurrency level (4) whereas the 
benchmarks are generous / fair by setting to 64. This means you would have a 
lower maximum in existing code, but that might be acceptable. An increase can 
cause issues with weighted caches by lowering the threshold (max / level).

You might have better timing by using a same thread executor 
(Caffeine.executor(Runnable::run)) rather than deferring to FJP. This would 
better match Guava.

The improved eviction policy will likely have the most impact on your 
performance.

> Use Caffeine instead of Guava Cache
> ---
>
> Key: SPARK-34309
> URL: https://issues.apache.org/jira/browse/SPARK-34309
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: image-2021-02-05-18-08-48-852.png, screenshot-1.png
>
>
> Caffeine is a high performance, near optimal caching library based on Java 8, 
> it is used in a similar way to guava cache, but with better performance. The 
> comparison results are as follow are on the [caffeine benchmarks 
> |https://github.com/ben-manes/caffeine/wiki/Benchmarks]
> At the same time, caffeine has been used in some open source projects like 
> Cassandra, Hbase, Neo4j, Druid, Spring and so on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-09 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281645#comment-17281645
 ] 

Attila Zsolt Piros commented on SPARK-34389:


> is it possible to get the available resources of the cluster and match it 
> with the required executor resources and if it satisfies then submits the job

[~ranju] This is out of the scope of Spark. A quick search gave me this: 
https://github.com/kubernetes/kubernetes/issues/17512. You can try out the 
suggestions and if you find one good enough for your case you can use it in a 
script which could start Spark when the resources are available. 

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

91 matches

Mail list logo