[jira] [Assigned] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36850:


Assignee: (was: Apache Spark)

> Migrate CreateTableStatement to v2 command framework
> 
>
> Key: SPARK-36850
> URL: https://issues.apache.org/jira/browse/SPARK-36850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36850:


Assignee: Apache Spark

> Migrate CreateTableStatement to v2 command framework
> 
>
> Key: SPARK-36850
> URL: https://issues.apache.org/jira/browse/SPARK-36850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420048#comment-17420048
 ] 

Apache Spark commented on SPARK-36850:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34060

> Migrate CreateTableStatement to v2 command framework
> 
>
> Key: SPARK-36850
> URL: https://issues.apache.org/jira/browse/SPARK-36850
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36850) Migrate CreateTableStatement to v2 command framework

2021-09-24 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-36850:
--

 Summary: Migrate CreateTableStatement to v2 command framework
 Key: SPARK-36850
 URL: https://issues.apache.org/jira/browse/SPARK-36850
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-24 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-36849:
--

 Summary: Migrate UseStatement to v2 command framework
 Key: SPARK-36849
 URL: https://issues.apache.org/jira/browse/SPARK-36849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-24 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420041#comment-17420041
 ] 

Huaxin Gao commented on SPARK-36849:


I am working on this

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420040#comment-17420040
 ] 

Apache Spark commented on SPARK-36848:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34104

> Migrate ShowCurrentNamespaceStatement to v2 command framework
> -
>
> Key: SPARK-36848
> URL: https://issues.apache.org/jira/browse/SPARK-36848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36848:


Assignee: (was: Apache Spark)

> Migrate ShowCurrentNamespaceStatement to v2 command framework
> -
>
> Key: SPARK-36848
> URL: https://issues.apache.org/jira/browse/SPARK-36848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420039#comment-17420039
 ] 

Apache Spark commented on SPARK-36848:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34104

> Migrate ShowCurrentNamespaceStatement to v2 command framework
> -
>
> Key: SPARK-36848
> URL: https://issues.apache.org/jira/browse/SPARK-36848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36848:


Assignee: Apache Spark

> Migrate ShowCurrentNamespaceStatement to v2 command framework
> -
>
> Key: SPARK-36848
> URL: https://issues.apache.org/jira/browse/SPARK-36848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework

2021-09-24 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-36848:
--

 Summary: Migrate ShowCurrentNamespaceStatement to v2 command 
framework
 Key: SPARK-36848
 URL: https://issues.apache.org/jira/browse/SPARK-36848
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420037#comment-17420037
 ] 

Apache Spark commented on SPARK-32712:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34103

> Support writing Hive non-ORC/Parquet bucketed table 
> 
>
> Key: SPARK-32712
> URL: https://issues.apache.org/jira/browse/SPARK-32712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Hive non-ORC/Parquet write code path is original Hive table write path 
> (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table 
> (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive 
> 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32712:


Assignee: (was: Apache Spark)

> Support writing Hive non-ORC/Parquet bucketed table 
> 
>
> Key: SPARK-32712
> URL: https://issues.apache.org/jira/browse/SPARK-32712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Hive non-ORC/Parquet write code path is original Hive table write path 
> (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table 
> (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive 
> 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32712:


Assignee: Apache Spark

> Support writing Hive non-ORC/Parquet bucketed table 
> 
>
> Key: SPARK-32712
> URL: https://issues.apache.org/jira/browse/SPARK-32712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> Hive non-ORC/Parquet write code path is original Hive table write path 
> (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table 
> (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive 
> 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420036#comment-17420036
 ] 

Apache Spark commented on SPARK-32712:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34103

> Support writing Hive non-ORC/Parquet bucketed table 
> 
>
> Key: SPARK-32712
> URL: https://issues.apache.org/jira/browse/SPARK-32712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Hive non-ORC/Parquet write code path is original Hive table write path 
> (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table 
> (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive 
> 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420029#comment-17420029
 ] 

Apache Spark commented on SPARK-36847:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34102

> Explicitly specify error codes when ignoring type hint errors
> -
>
> Key: SPARK-36847
> URL: https://issues.apache.org/jira/browse/SPARK-36847
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> We use a lot of {{type: ignore}} annotation to ignore type hint errors in 
> pandas-on-Spark.
> We should explicitly specify the error codes to make it clear what kind of 
> error is being ignored, then the type hint checker can check more cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420027#comment-17420027
 ] 

Apache Spark commented on SPARK-36847:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34102

> Explicitly specify error codes when ignoring type hint errors
> -
>
> Key: SPARK-36847
> URL: https://issues.apache.org/jira/browse/SPARK-36847
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> We use a lot of {{type: ignore}} annotation to ignore type hint errors in 
> pandas-on-Spark.
> We should explicitly specify the error codes to make it clear what kind of 
> error is being ignored, then the type hint checker can check more cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36847:


Assignee: Apache Spark

> Explicitly specify error codes when ignoring type hint errors
> -
>
> Key: SPARK-36847
> URL: https://issues.apache.org/jira/browse/SPARK-36847
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> We use a lot of {{type: ignore}} annotation to ignore type hint errors in 
> pandas-on-Spark.
> We should explicitly specify the error codes to make it clear what kind of 
> error is being ignored, then the type hint checker can check more cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36847:


Assignee: (was: Apache Spark)

> Explicitly specify error codes when ignoring type hint errors
> -
>
> Key: SPARK-36847
> URL: https://issues.apache.org/jira/browse/SPARK-36847
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> We use a lot of {{type: ignore}} annotation to ignore type hint errors in 
> pandas-on-Spark.
> We should explicitly specify the error codes to make it clear what kind of 
> error is being ignored, then the type hint checker can check more cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors

2021-09-24 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36847:
-

 Summary: Explicitly specify error codes when ignoring type hint 
errors
 Key: SPARK-36847
 URL: https://issues.apache.org/jira/browse/SPARK-36847
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


We use a lot of {{type: ignore}} annotation to ignore type hint errors in 
pandas-on-Spark.

We should explicitly specify the error codes to make it clear what kind of 
error is being ignored, then the type hint checker can check more cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36846:


Assignee: (was: Apache Spark)

> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
>  * Since the file contains a lot of overloads, we should revisit and manage 
> it separately.
>  * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36846:


Assignee: (was: Apache Spark)

> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
>  * Since the file contains a lot of overloads, we should revisit and manage 
> it separately.
>  * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419991#comment-17419991
 ] 

Apache Spark commented on SPARK-36846:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34101

> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
>  * Since the file contains a lot of overloads, we should revisit and manage 
> it separately.
>  * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36846:


Assignee: Apache Spark

> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
>  * Since the file contains a lot of overloads, we should revisit and manage 
> it separately.
>  * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-24 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-36846:
--
Description: 
Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
{{pyspark/sql/pandas/functions.pyi}} and files under 
{{pyspark/sql/pandas/_typing}}.

- Since the file contains a lot of overloads, we should revisit and manage it 
separately.
- We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
includes new syntax for type hints.

  was:Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
{{pyspark/sql/pandas/functions.pyi}} and files under 
{{pyspark/sql/pandas/_typing}}.


> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
> - Since the file contains a lot of overloads, we should revisit and manage it 
> separately.
> - We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-24 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-36846:
--
Description: 
Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
{{pyspark/sql/pandas/functions.pyi}} and files under 
{{pyspark/sql/pandas/_typing}}.
 * Since the file contains a lot of overloads, we should revisit and manage it 
separately.
 * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
includes new syntax for type hints.

  was:
Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
{{pyspark/sql/pandas/functions.pyi}} and files under 
{{pyspark/sql/pandas/_typing}}.

- Since the file contains a lot of overloads, we should revisit and manage it 
separately.
- We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
includes new syntax for type hints.


> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
>  * Since the file contains a lot of overloads, we should revisit and manage 
> it separately.
>  * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-24 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36846:
-

 Summary: Inline most of type hint files under pyspark/sql/pandas 
folder
 Key: SPARK-36846
 URL: https://issues.apache.org/jira/browse/SPARK-36846
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
{{pyspark/sql/pandas/functions.pyi}} and files under 
{{pyspark/sql/pandas/_typing}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36845) Inline type hint files

2021-09-24 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36845:
-

 Summary: Inline type hint files
 Key: SPARK-36845
 URL: https://issues.apache.org/jira/browse/SPARK-36845
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Currently there are type hint stub files ({{*.pyi}}) to show the expected types 
for functions, but we can also take advantage of static type checking within 
the functions by inlining the type hints.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35174) Avoid opening watch when waitAppCompletion is false

2021-09-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35174.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34095
[https://github.com/apache/spark/pull/34095]

> Avoid opening watch when waitAppCompletion is false
> ---
>
> Key: SPARK-35174
> URL: https://issues.apache.org/jira/browse/SPARK-35174
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Jonathan Lafleche
>Priority: Minor
> Fix For: 3.3.0
>
>
> In spark-submit, we currently [open a pod watch for any spark 
> submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167].
>  If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result 
> of the watcher and break out of the watcher.
> When submitting spark applications at scale, this is a source of operational 
> pain, since opening the watch relies on opening a websocket, which tends to 
> run into subtle networking issues around negotiating the websocket connection.
> I'd like to change this behaviour so that we eagerly check whether we are 
> waiting on app completion, and avoid opening the watch altogether when 
> WAIT_FOR_APP_COMPLETION is false.
> Would you accept a contribution for that change, or are there any concerns 
> I've overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36844) Window function "first" (unboundedFollowing) appears significantly slower than "last" (unboundedPreceding) in identical circumstances

2021-09-24 Thread Alain Bryden (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alain Bryden updated SPARK-36844:
-
Summary: Window function "first" (unboundedFollowing) appears significantly 
slower than "last" (unboundedPreceding) in identical circumstances  (was: 
"first" Window function is significantly slower than "last" in identical 
circumstances)

> Window function "first" (unboundedFollowing) appears significantly slower 
> than "last" (unboundedPreceding) in identical circumstances
> -
>
> Key: SPARK-36844
> URL: https://issues.apache.org/jira/browse/SPARK-36844
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 3.1.1
>Reporter: Alain Bryden
>Priority: Minor
> Attachments: Physical Plan 2 - workaround.png, Pysical Plan.png
>
>
> I originally posted a question on SO because I thought perhaps I was doing 
> something wrong:
> [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560]
> Perhaps I am, but I'm now fairly convinced that there's something wonky with 
> the implementation of `first` that's causing it to unnecessarily have a much 
> worse complexity than `last`.
>  
> More or less copy-pasted from the above post:
> I was working on a pyspark routine to interpolate the missing values in a 
> configuration table.
> Imagine a table of configuration values that go from 0 to 50,000. The user 
> specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) 
> and we interpolate the remainder. My solution mostly follows [this blog 
> post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite 
> closely, except I'm not using any UDFs.
> In troubleshooting the performance of this (takes ~3 minutes) I found that 
> one particular window function is taking all of the time, and everything else 
> I'm doing takes mere seconds.
> Here is the main area of interest - where I use window functions to fill in 
> the previous and next user-supplied configuration values:
> {code:python}
> from pyspark.sql import Window, functions as F
> # Create partition windows that are required to generate new rows from the 
> ones provided
> win_last = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0)
> win_next = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing)
> # Join back in the provided config table to populate the "known" scale factors
> df_part1 = (df_scale_factors_template
>   .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter')
>   # Add computed columns that can lookup the prior config and next config for 
> each missing value
>   .withColumn('last_rank', F.last( F.col('rank'), 
> ignorenulls=True).over(win_last))
>   .withColumn('last_sf',   F.last( F.col('scale_factor'), 
> ignorenulls=True).over(win_last))
> ).cache()
> debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1
> df_part2 = (df_part1
>   .withColumn('next_rank', F.first(F.col('rank'), 
> ignorenulls=True).over(win_next))
>   .withColumn('next_sf',   F.first(F.col('scale_factor'), 
> ignorenulls=True).over(win_next))
> ).cache()
> debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2
> df_part3 = (df_part2
>   # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * 
> (x-x1)
>   .withColumn('scale_factor', 
>   F.when(F.col('last_rank')==F.col('next_rank'), 
> F.col('last_sf')) # Handle div/0 case
>   .otherwise(F.col('last_sf') + 
> ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) 
> * (F.col('rank')-F.col('last_rank'
>   .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor')
> ).cache()
> debug_log_dataframe(df_part3, 'df_part3', explain: True)
> {code}
>  
> The above used to be a single chained dataframe statement, but I've since 
> split it into 3 parts so that I could isolate the part that's taking so long. 
> The results are:
>  * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}}
>  * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}}
>  * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}}
>  
> In trying various things to speed up my routine, it occurred to me to try 
> re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a 
> reversed sort order.
> So rewriting this:
> {code:python}
> win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
>   .orderBy('rank').rowsBetween(0, 

[jira] [Assigned] (SPARK-36721) Simplify boolean equalities if one side is literal

2021-09-24 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-36721:
---

Assignee: Kazuyuki Tanimura

> Simplify boolean equalities if one side is literal
> --
>
> Key: SPARK-36721
> URL: https://issues.apache.org/jira/browse/SPARK-36721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE (a AND b) = true
> ```
> although the following equivalent query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE (a AND b) 
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36721) Simplify boolean equalities if one side is literal

2021-09-24 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-36721.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34055
[https://github.com/apache/spark/pull/34055]

> Simplify boolean equalities if one side is literal
> --
>
> Key: SPARK-36721
> URL: https://issues.apache.org/jira/browse/SPARK-36721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE (a AND b) = true
> ```
> although the following equivalent query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE (a AND b) 
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36835) Spark 3.2.0 POMs are no longer "dependency reduced"

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419884#comment-17419884
 ] 

Apache Spark commented on SPARK-36835:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34100

> Spark 3.2.0 POMs are no longer "dependency reduced"
> ---
>
> Key: SPARK-36835
> URL: https://issues.apache.org/jira/browse/SPARK-36835
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Josh Rosen
>Assignee: Chao Sun
>Priority: Blocker
> Fix For: 3.2.0
>
>
> It looks like Spark 3.2.0's POMs are no longer "dependency reduced". As a 
> result, applications may pull in additional unnecessary dependencies when 
> depending on Spark.
> Spark uses the Maven Shade plugin to create effective POMs and to bundle 
> shaded versions of certain libraries with Spark (namely, Jetty, Guava, and 
> JPPML). [By 
> default|https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html#createDependencyReducedPom],
>  the Maven Shade plugin generates simplified POMs which remove dependencies 
> on artifacts that have been shaded.
> SPARK-33212 / 
> [b6f46ca29742029efea2790af7fdefbc2fcf52de|https://github.com/apache/spark/commit/b6f46ca29742029efea2790af7fdefbc2fcf52de]
>  changed the configuration of the Maven Shade plugin, setting 
> {{createDependencyReducedPom}} to {{false}}.
> As a result, the generated POMs now include compile-scope dependencies on the 
> shaded libraries. For example, compare the {{org.eclipse.jetty}} dependencies 
> in:
>  * Spark 3.1.2: 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.2/spark-core_2.12-3.1.2.pom]
>  * Spark 3.2.0 RC2: 
> [https://repository.apache.org/content/repositories/orgapachespark-1390/org/apache/spark/spark-core_2.12/3.2.0/spark-core_2.12-3.2.0.pom]
> I think we should revert back to generating "dependency reduced" POMs to 
> ensure that Spark declares a proper set of dependencies and to avoid "unknown 
> unknown" consequences of changing our generated POM format.
> /cc [~csun]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances

2021-09-24 Thread Alain Bryden (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419833#comment-17419833
 ] 

Alain Bryden commented on SPARK-36844:
--

I've attached the physical plan from the initial implementation. `last` ends up 
using "RunningWindowFunction" whereas `first` just says "Window"

!Pysical Plan.png! 

Here it is after I've re-written things to explicitly reverse the window sort 
order and use `last` instead of `first`.  Now "RunningWindowFunction" is used 
in both cases and the dataframe evaluates several orders of magnitude faster 
(1-2 seconds instead of ~3 minutes):

!Physical Plan 2 - workaround.png! 

> "first" Window function is significantly slower than "last" in identical 
> circumstances
> --
>
> Key: SPARK-36844
> URL: https://issues.apache.org/jira/browse/SPARK-36844
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 3.1.1
>Reporter: Alain Bryden
>Priority: Minor
> Attachments: Physical Plan 2 - workaround.png, Pysical Plan.png
>
>
> I originally posted a question on SO because I thought perhaps I was doing 
> something wrong:
> [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560]
> Perhaps I am, but I'm now fairly convinced that there's something wonky with 
> the implementation of `first` that's causing it to unnecessarily have a much 
> worse complexity than `last`.
>  
> More or less copy-pasted from the above post:
> I was working on a pyspark routine to interpolate the missing values in a 
> configuration table.
> Imagine a table of configuration values that go from 0 to 50,000. The user 
> specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) 
> and we interpolate the remainder. My solution mostly follows [this blog 
> post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite 
> closely, except I'm not using any UDFs.
> In troubleshooting the performance of this (takes ~3 minutes) I found that 
> one particular window function is taking all of the time, and everything else 
> I'm doing takes mere seconds.
> Here is the main area of interest - where I use window functions to fill in 
> the previous and next user-supplied configuration values:
> {code:python}
> from pyspark.sql import Window, functions as F
> # Create partition windows that are required to generate new rows from the 
> ones provided
> win_last = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0)
> win_next = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing)
> # Join back in the provided config table to populate the "known" scale factors
> df_part1 = (df_scale_factors_template
>   .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter')
>   # Add computed columns that can lookup the prior config and next config for 
> each missing value
>   .withColumn('last_rank', F.last( F.col('rank'), 
> ignorenulls=True).over(win_last))
>   .withColumn('last_sf',   F.last( F.col('scale_factor'), 
> ignorenulls=True).over(win_last))
> ).cache()
> debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1
> df_part2 = (df_part1
>   .withColumn('next_rank', F.first(F.col('rank'), 
> ignorenulls=True).over(win_next))
>   .withColumn('next_sf',   F.first(F.col('scale_factor'), 
> ignorenulls=True).over(win_next))
> ).cache()
> debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2
> df_part3 = (df_part2
>   # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * 
> (x-x1)
>   .withColumn('scale_factor', 
>   F.when(F.col('last_rank')==F.col('next_rank'), 
> F.col('last_sf')) # Handle div/0 case
>   .otherwise(F.col('last_sf') + 
> ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) 
> * (F.col('rank')-F.col('last_rank'
>   .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor')
> ).cache()
> debug_log_dataframe(df_part3, 'df_part3', explain: True)
> {code}
>  
> The above used to be a single chained dataframe statement, but I've since 
> split it into 3 parts so that I could isolate the part that's taking so long. 
> The results are:
>  * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}}
>  * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}}
>  * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}}
>  
> In trying various things to speed up my routine, it occurred to me to try 
> re-rewriting my usages of {{first()}} to just be usages of {{last()}} 

[jira] [Updated] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances

2021-09-24 Thread Alain Bryden (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alain Bryden updated SPARK-36844:
-
Attachment: Physical Plan 2 - workaround.png

> "first" Window function is significantly slower than "last" in identical 
> circumstances
> --
>
> Key: SPARK-36844
> URL: https://issues.apache.org/jira/browse/SPARK-36844
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 3.1.1
>Reporter: Alain Bryden
>Priority: Minor
> Attachments: Physical Plan 2 - workaround.png, Pysical Plan.png
>
>
> I originally posted a question on SO because I thought perhaps I was doing 
> something wrong:
> [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560]
> Perhaps I am, but I'm now fairly convinced that there's something wonky with 
> the implementation of `first` that's causing it to unnecessarily have a much 
> worse complexity than `last`.
>  
> More or less copy-pasted from the above post:
> I was working on a pyspark routine to interpolate the missing values in a 
> configuration table.
> Imagine a table of configuration values that go from 0 to 50,000. The user 
> specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) 
> and we interpolate the remainder. My solution mostly follows [this blog 
> post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite 
> closely, except I'm not using any UDFs.
> In troubleshooting the performance of this (takes ~3 minutes) I found that 
> one particular window function is taking all of the time, and everything else 
> I'm doing takes mere seconds.
> Here is the main area of interest - where I use window functions to fill in 
> the previous and next user-supplied configuration values:
> {code:python}
> from pyspark.sql import Window, functions as F
> # Create partition windows that are required to generate new rows from the 
> ones provided
> win_last = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0)
> win_next = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing)
> # Join back in the provided config table to populate the "known" scale factors
> df_part1 = (df_scale_factors_template
>   .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter')
>   # Add computed columns that can lookup the prior config and next config for 
> each missing value
>   .withColumn('last_rank', F.last( F.col('rank'), 
> ignorenulls=True).over(win_last))
>   .withColumn('last_sf',   F.last( F.col('scale_factor'), 
> ignorenulls=True).over(win_last))
> ).cache()
> debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1
> df_part2 = (df_part1
>   .withColumn('next_rank', F.first(F.col('rank'), 
> ignorenulls=True).over(win_next))
>   .withColumn('next_sf',   F.first(F.col('scale_factor'), 
> ignorenulls=True).over(win_next))
> ).cache()
> debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2
> df_part3 = (df_part2
>   # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * 
> (x-x1)
>   .withColumn('scale_factor', 
>   F.when(F.col('last_rank')==F.col('next_rank'), 
> F.col('last_sf')) # Handle div/0 case
>   .otherwise(F.col('last_sf') + 
> ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) 
> * (F.col('rank')-F.col('last_rank'
>   .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor')
> ).cache()
> debug_log_dataframe(df_part3, 'df_part3', explain: True)
> {code}
>  
> The above used to be a single chained dataframe statement, but I've since 
> split it into 3 parts so that I could isolate the part that's taking so long. 
> The results are:
>  * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}}
>  * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}}
>  * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}}
>  
> In trying various things to speed up my routine, it occurred to me to try 
> re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a 
> reversed sort order.
> So rewriting this:
> {code:python}
> win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
>   .orderBy('rank').rowsBetween(0, Window.unboundedFollowing))
> df_part2 = (df_part1
>   .withColumn('next_rank', F.first(F.col('rank'), 
> ignorenulls=True).over(win_next))
>   .withColumn('next_sf',   F.first(F.col('scale_factor'), 
> ignorenulls=True).over(win_next))
> )
> {code}
>  
> As this:
> {code:python}
> win_next = 

[jira] [Updated] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances

2021-09-24 Thread Alain Bryden (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alain Bryden updated SPARK-36844:
-
Attachment: Pysical Plan.png

> "first" Window function is significantly slower than "last" in identical 
> circumstances
> --
>
> Key: SPARK-36844
> URL: https://issues.apache.org/jira/browse/SPARK-36844
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 3.1.1
>Reporter: Alain Bryden
>Priority: Minor
> Attachments: Pysical Plan.png
>
>
> I originally posted a question on SO because I thought perhaps I was doing 
> something wrong:
> [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560]
> Perhaps I am, but I'm now fairly convinced that there's something wonky with 
> the implementation of `first` that's causing it to unnecessarily have a much 
> worse complexity than `last`.
>  
> More or less copy-pasted from the above post:
> I was working on a pyspark routine to interpolate the missing values in a 
> configuration table.
> Imagine a table of configuration values that go from 0 to 50,000. The user 
> specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) 
> and we interpolate the remainder. My solution mostly follows [this blog 
> post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite 
> closely, except I'm not using any UDFs.
> In troubleshooting the performance of this (takes ~3 minutes) I found that 
> one particular window function is taking all of the time, and everything else 
> I'm doing takes mere seconds.
> Here is the main area of interest - where I use window functions to fill in 
> the previous and next user-supplied configuration values:
> {code:python}
> from pyspark.sql import Window, functions as F
> # Create partition windows that are required to generate new rows from the 
> ones provided
> win_last = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0)
> win_next = Window.partitionBy('PORT_TYPE', 
> 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing)
> # Join back in the provided config table to populate the "known" scale factors
> df_part1 = (df_scale_factors_template
>   .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter')
>   # Add computed columns that can lookup the prior config and next config for 
> each missing value
>   .withColumn('last_rank', F.last( F.col('rank'), 
> ignorenulls=True).over(win_last))
>   .withColumn('last_sf',   F.last( F.col('scale_factor'), 
> ignorenulls=True).over(win_last))
> ).cache()
> debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1
> df_part2 = (df_part1
>   .withColumn('next_rank', F.first(F.col('rank'), 
> ignorenulls=True).over(win_next))
>   .withColumn('next_sf',   F.first(F.col('scale_factor'), 
> ignorenulls=True).over(win_next))
> ).cache()
> debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2
> df_part3 = (df_part2
>   # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * 
> (x-x1)
>   .withColumn('scale_factor', 
>   F.when(F.col('last_rank')==F.col('next_rank'), 
> F.col('last_sf')) # Handle div/0 case
>   .otherwise(F.col('last_sf') + 
> ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) 
> * (F.col('rank')-F.col('last_rank'
>   .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor')
> ).cache()
> debug_log_dataframe(df_part3, 'df_part3', explain: True)
> {code}
>  
> The above used to be a single chained dataframe statement, but I've since 
> split it into 3 parts so that I could isolate the part that's taking so long. 
> The results are:
>  * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}}
>  * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}}
>  * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}}
>  
> In trying various things to speed up my routine, it occurred to me to try 
> re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a 
> reversed sort order.
> So rewriting this:
> {code:python}
> win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
>   .orderBy('rank').rowsBetween(0, Window.unboundedFollowing))
> df_part2 = (df_part1
>   .withColumn('next_rank', F.first(F.col('rank'), 
> ignorenulls=True).over(win_next))
>   .withColumn('next_sf',   F.first(F.col('scale_factor'), 
> ignorenulls=True).over(win_next))
> )
> {code}
>  
> As this:
> {code:python}
> win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
>   

[jira] [Created] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances

2021-09-24 Thread Alain Bryden (Jira)
Alain Bryden created SPARK-36844:


 Summary: "first" Window function is significantly slower than 
"last" in identical circumstances
 Key: SPARK-36844
 URL: https://issues.apache.org/jira/browse/SPARK-36844
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Windows
Affects Versions: 3.1.1
Reporter: Alain Bryden


I originally posted a question on SO because I thought perhaps I was doing 
something wrong:

[https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560]

Perhaps I am, but I'm now fairly convinced that there's something wonky with 
the implementation of `first` that's causing it to unnecessarily have a much 
worse complexity than `last`.

 

More or less copy-pasted from the above post:

I was working on a pyspark routine to interpolate the missing values in a 
configuration table.

Imagine a table of configuration values that go from 0 to 50,000. The user 
specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) 
and we interpolate the remainder. My solution mostly follows [this blog 
post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite 
closely, except I'm not using any UDFs.

In troubleshooting the performance of this (takes ~3 minutes) I found that one 
particular window function is taking all of the time, and everything else I'm 
doing takes mere seconds.

Here is the main area of interest - where I use window functions to fill in the 
previous and next user-supplied configuration values:
{code:python}
from pyspark.sql import Window, functions as F

# Create partition windows that are required to generate new rows from the ones 
provided
win_last = Window.partitionBy('PORT_TYPE', 
'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0)
win_next = Window.partitionBy('PORT_TYPE', 
'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing)

# Join back in the provided config table to populate the "known" scale factors
df_part1 = (df_scale_factors_template
  .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter')
  # Add computed columns that can lookup the prior config and next config for 
each missing value
  .withColumn('last_rank', F.last( F.col('rank'), 
ignorenulls=True).over(win_last))
  .withColumn('last_sf',   F.last( F.col('scale_factor'), 
ignorenulls=True).over(win_last))
).cache()
debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1

df_part2 = (df_part1
  .withColumn('next_rank', F.first(F.col('rank'), 
ignorenulls=True).over(win_next))
  .withColumn('next_sf',   F.first(F.col('scale_factor'), 
ignorenulls=True).over(win_next))
).cache()
debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2

df_part3 = (df_part2
  # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * 
(x-x1)
  .withColumn('scale_factor', 
  F.when(F.col('last_rank')==F.col('next_rank'), F.col('last_sf')) 
# Handle div/0 case
  .otherwise(F.col('last_sf') + 
((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) * 
(F.col('rank')-F.col('last_rank'
  .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor')
).cache()
debug_log_dataframe(df_part3, 'df_part3', explain: True)
{code}
 

The above used to be a single chained dataframe statement, but I've since split 
it into 3 parts so that I could isolate the part that's taking so long. The 
results are:
 * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}}
 * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}}
 * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}}

 

In trying various things to speed up my routine, it occurred to me to try 
re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a 
reversed sort order.

So rewriting this:

{code:python}
win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
  .orderBy('rank').rowsBetween(0, Window.unboundedFollowing))

df_part2 = (df_part1
  .withColumn('next_rank', F.first(F.col('rank'), 
ignorenulls=True).over(win_next))
  .withColumn('next_sf',   F.first(F.col('scale_factor'), 
ignorenulls=True).over(win_next))
)
{code}
 
As this:

{code:python}
win_next = (Window.partitionBy('PORT_TYPE', 'loss_process')
  .orderBy(F.desc('rank')).rowsBetween(Window.unboundedPreceding, 0))

df_part2 = (df_part1
  .withColumn('next_rank', F.last(F.col('rank'), 
ignorenulls=True).over(win_next))
  .withColumn('next_sf',   F.last(F.col('scale_factor'), 
ignorenulls=True).over(win_next))
)
{code}
 
Much to my amazement, this actually solved the performance problem, and now the 
entire dataframe is generated in just 3 seconds.

I don't know anything about the internals, but 

[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-09-24 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419830#comment-17419830
 ] 

Erik Krogen commented on SPARK-35672:
-

Thanks [~petertoth] [~hyukjin.kwon] [~Gengliang.Wang] for reporting and dealing 
with the issue.

I'll work on submitting a new PR to master with the changes from PRs #31810 
(original) and #34084 (environment variable fix) incorporated.

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36821) Create a test to extend ColumnarBatch

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36821:


Assignee: (was: Apache Spark)

> Create a test to extend ColumnarBatch
> -
>
> Key: SPARK-36821
> URL: https://issues.apache.org/jira/browse/SPARK-36821
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Priority: Major
>
> As a followup of Spark-36814, to create a test to extend ColumnarBatch to 
> prevent future changes to break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36821) Create a test to extend ColumnarBatch

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419815#comment-17419815
 ] 

Apache Spark commented on SPARK-36821:
--

User 'flyrain' has created a pull request for this issue:
https://github.com/apache/spark/pull/34087

> Create a test to extend ColumnarBatch
> -
>
> Key: SPARK-36821
> URL: https://issues.apache.org/jira/browse/SPARK-36821
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Priority: Major
>
> As a followup of Spark-36814, to create a test to extend ColumnarBatch to 
> prevent future changes to break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36821) Create a test to extend ColumnarBatch

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36821:


Assignee: Apache Spark

> Create a test to extend ColumnarBatch
> -
>
> Key: SPARK-36821
> URL: https://issues.apache.org/jira/browse/SPARK-36821
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yufei Gu
>Assignee: Apache Spark
>Priority: Major
>
> As a followup of Spark-36814, to create a test to extend ColumnarBatch to 
> prevent future changes to break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36843) Add an iterator method to Dataset

2021-09-24 Thread Li Xian (Jira)
Li Xian created SPARK-36843:
---

 Summary: Add an iterator method to Dataset
 Key: SPARK-36843
 URL: https://issues.apache.org/jira/browse/SPARK-36843
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Li Xian


The current org.apache.spark.sql.Dataset#toLocalIterator will submit multiple 
jobs for multiple partitions. 

In my case, I would like to collect all partition at once to save the job 
scheduling cost and also has an iterator to save the memory on deserialization 
(instead of deserialize all rows at once, I want only one row is deserialized 
during the iteration)

. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36842) Stop task result getter properly on spark context stopping

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419728#comment-17419728
 ] 

Apache Spark commented on SPARK-36842:
--

User 'lxian' has created a pull request for this issue:
https://github.com/apache/spark/pull/34098

> Stop task result getter properly on spark context stopping
> --
>
> Key: SPARK-36842
> URL: https://issues.apache.org/jira/browse/SPARK-36842
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Li Xian
>Priority: Major
>
> org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception 
> properly. If one component throws exceptions on stopping, the exception is 
> thrown and TaskSchedulerImpl.stop() will not be executed completely.
> For example if backend.stop() fails, then taskResultGetter.stop() won't be 
> executed. The result is that after a couple of restart of the spark context, 
> there will be a lot of '
> task-result-getter' threads retained.
>  
> !image-2021-09-24-18-50-57-072.png!
> !image-2021-09-24-18-51-03-837.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36842) Stop task result getter properly on spark context stopping

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36842:


Assignee: (was: Apache Spark)

> Stop task result getter properly on spark context stopping
> --
>
> Key: SPARK-36842
> URL: https://issues.apache.org/jira/browse/SPARK-36842
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Li Xian
>Priority: Major
>
> org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception 
> properly. If one component throws exceptions on stopping, the exception is 
> thrown and TaskSchedulerImpl.stop() will not be executed completely.
> For example if backend.stop() fails, then taskResultGetter.stop() won't be 
> executed. The result is that after a couple of restart of the spark context, 
> there will be a lot of '
> task-result-getter' threads retained.
>  
> !image-2021-09-24-18-50-57-072.png!
> !image-2021-09-24-18-51-03-837.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36842) Stop task result getter properly on spark context stopping

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36842:


Assignee: Apache Spark

> Stop task result getter properly on spark context stopping
> --
>
> Key: SPARK-36842
> URL: https://issues.apache.org/jira/browse/SPARK-36842
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Li Xian
>Assignee: Apache Spark
>Priority: Major
>
> org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception 
> properly. If one component throws exceptions on stopping, the exception is 
> thrown and TaskSchedulerImpl.stop() will not be executed completely.
> For example if backend.stop() fails, then taskResultGetter.stop() won't be 
> executed. The result is that after a couple of restart of the spark context, 
> there will be a lot of '
> task-result-getter' threads retained.
>  
> !image-2021-09-24-18-50-57-072.png!
> !image-2021-09-24-18-51-03-837.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36842) Stop task result getter properly on spark context stopping

2021-09-24 Thread Li Xian (Jira)
Li Xian created SPARK-36842:
---

 Summary: Stop task result getter properly on spark context stopping
 Key: SPARK-36842
 URL: https://issues.apache.org/jira/browse/SPARK-36842
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Li Xian


org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception 
properly. If one component throws exceptions on stopping, the exception is 
thrown and TaskSchedulerImpl.stop() will not be executed completely.

For example if backend.stop() fails, then taskResultGetter.stop() won't be 
executed. The result is that after a couple of restart of the spark context, 
there will be a lot of '

task-result-getter' threads retained.

 

!image-2021-09-24-18-50-57-072.png!

!image-2021-09-24-18-51-03-837.png!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419711#comment-17419711
 ] 

Apache Spark commented on SPARK-36792:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34097

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Inset(Double.NaN, Seq(Double.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419710#comment-17419710
 ] 

Apache Spark commented on SPARK-36792:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34097

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> Inset(Double.NaN, Seq(Double.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36831) Read/write dataframes with ANSI intervals from/to CSV files

2021-09-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-36831:


Assignee: (was: Max Gekk)

> Read/write dataframes with ANSI intervals from/to CSV files
> ---
>
> Key: SPARK-36831
> URL: https://issues.apache.org/jira/browse/SPARK-36831
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Implement writing and reading ANSI intervals (year-month and day-time 
> intervals) columns in dataframes to Parquet datasources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419705#comment-17419705
 ] 

Apache Spark commented on SPARK-36841:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34096

> Provide ansi syntax  `set catalog xxx` to change the current catalog  
> --
>
> Key: SPARK-36841
> URL: https://issues.apache.org/jira/browse/SPARK-36841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> !SET-CATALOG.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36841:


Assignee: (was: Apache Spark)

> Provide ansi syntax  `set catalog xxx` to change the current catalog  
> --
>
> Key: SPARK-36841
> URL: https://issues.apache.org/jira/browse/SPARK-36841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> !SET-CATALOG.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419703#comment-17419703
 ] 

Apache Spark commented on SPARK-36841:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34096

> Provide ansi syntax  `set catalog xxx` to change the current catalog  
> --
>
> Key: SPARK-36841
> URL: https://issues.apache.org/jira/browse/SPARK-36841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> !SET-CATALOG.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36841:


Assignee: Apache Spark

> Provide ansi syntax  `set catalog xxx` to change the current catalog  
> --
>
> Key: SPARK-36841
> URL: https://issues.apache.org/jira/browse/SPARK-36841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> !SET-CATALOG.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog

2021-09-24 Thread PengLei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PengLei updated SPARK-36841:

Description: !SET-CATALOG.PNG!  (was: !截图.PNG!)

> Provide ansi syntax  `set catalog xxx` to change the current catalog  
> --
>
> Key: SPARK-36841
> URL: https://issues.apache.org/jira/browse/SPARK-36841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> !SET-CATALOG.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog

2021-09-24 Thread PengLei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PengLei updated SPARK-36841:

Description: !截图.PNG!

> Provide ansi syntax  `set catalog xxx` to change the current catalog  
> --
>
> Key: SPARK-36841
> URL: https://issues.apache.org/jira/browse/SPARK-36841
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Major
> Fix For: 3.3.0
>
>
> !截图.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog

2021-09-24 Thread PengLei (Jira)
PengLei created SPARK-36841:
---

 Summary: Provide ansi syntax  `set catalog xxx` to change the 
current catalog  
 Key: SPARK-36841
 URL: https://issues.apache.org/jira/browse/SPARK-36841
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: PengLei
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35174) Avoid opening watch when waitAppCompletion is false

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35174:


Assignee: (was: Apache Spark)

> Avoid opening watch when waitAppCompletion is false
> ---
>
> Key: SPARK-35174
> URL: https://issues.apache.org/jira/browse/SPARK-35174
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Jonathan Lafleche
>Priority: Minor
>
> In spark-submit, we currently [open a pod watch for any spark 
> submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167].
>  If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result 
> of the watcher and break out of the watcher.
> When submitting spark applications at scale, this is a source of operational 
> pain, since opening the watch relies on opening a websocket, which tends to 
> run into subtle networking issues around negotiating the websocket connection.
> I'd like to change this behaviour so that we eagerly check whether we are 
> waiting on app completion, and avoid opening the watch altogether when 
> WAIT_FOR_APP_COMPLETION is false.
> Would you accept a contribution for that change, or are there any concerns 
> I've overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak

2021-09-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36827.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34092
[https://github.com/apache/spark/pull/34092]

> Task/Stage/Job data remain in memory leads memory leak
> --
>
> Key: SPARK-36827
> URL: https://issues.apache.org/jira/browse/SPARK-36827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: mem1.txt, worker.txt
>
>
> Noticing memory-leak like behavior, steady increase of heap after GC and 
> eventually it leads to a service failure. 
> The GC histogram shows very high number of Task/Data/Job data
> {code}
>  num #instances #bytes  class name 
> -- 
>6:   7835346 2444627952  org.apache.spark.status.TaskDataWrapper 
>   25:   3765152  180727296  org.apache.spark.status.StageDataWrapper 
>   88:2322559290200  org.apache.spark.status.JobDataWrapper 
> {code}
> Thread dumps show clearly the clean up thread is always doing cleanupStages
> {code}
> "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 
> tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000]
>java.lang.Thread.State: RUNNABLE
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown
>  Source)
>   at java.util.TimSort.gallopLeft(TimSort.java:542)
>   at java.util.TimSort.mergeLo(TimSort.java:752)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
>   at java.util.TimSort.mergeCollapse(TimSort.java:439)
>   at java.util.TimSort.sort(TimSort.java:245)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1464)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.map(List.scala:297)
>   at 
> org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013)
>   at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at 

[jira] [Assigned] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak

2021-09-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36827:
--

Assignee: Gengliang Wang

> Task/Stage/Job data remain in memory leads memory leak
> --
>
> Key: SPARK-36827
> URL: https://issues.apache.org/jira/browse/SPARK-36827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: mem1.txt, worker.txt
>
>
> Noticing memory-leak like behavior, steady increase of heap after GC and 
> eventually it leads to a service failure. 
> The GC histogram shows very high number of Task/Data/Job data
> {code}
>  num #instances #bytes  class name 
> -- 
>6:   7835346 2444627952  org.apache.spark.status.TaskDataWrapper 
>   25:   3765152  180727296  org.apache.spark.status.StageDataWrapper 
>   88:2322559290200  org.apache.spark.status.JobDataWrapper 
> {code}
> Thread dumps show clearly the clean up thread is always doing cleanupStages
> {code}
> "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 
> tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000]
>java.lang.Thread.State: RUNNABLE
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown
>  Source)
>   at java.util.TimSort.gallopLeft(TimSort.java:542)
>   at java.util.TimSort.mergeLo(TimSort.java:752)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
>   at java.util.TimSort.mergeCollapse(TimSort.java:439)
>   at java.util.TimSort.sort(TimSort.java:245)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1464)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.map(List.scala:297)
>   at 
> org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013)
>   at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by 

[jira] [Assigned] (SPARK-35174) Avoid opening watch when waitAppCompletion is false

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35174:


Assignee: Apache Spark

> Avoid opening watch when waitAppCompletion is false
> ---
>
> Key: SPARK-35174
> URL: https://issues.apache.org/jira/browse/SPARK-35174
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Jonathan Lafleche
>Assignee: Apache Spark
>Priority: Minor
>
> In spark-submit, we currently [open a pod watch for any spark 
> submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167].
>  If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result 
> of the watcher and break out of the watcher.
> When submitting spark applications at scale, this is a source of operational 
> pain, since opening the watch relies on opening a websocket, which tends to 
> run into subtle networking issues around negotiating the websocket connection.
> I'd like to change this behaviour so that we eagerly check whether we are 
> waiting on app completion, and avoid opening the watch altogether when 
> WAIT_FOR_APP_COMPLETION is false.
> Would you accept a contribution for that change, or are there any concerns 
> I've overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35174) Avoid opening watch when waitAppCompletion is false

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419681#comment-17419681
 ] 

Apache Spark commented on SPARK-35174:
--

User 'slothspot' has created a pull request for this issue:
https://github.com/apache/spark/pull/34095

> Avoid opening watch when waitAppCompletion is false
> ---
>
> Key: SPARK-35174
> URL: https://issues.apache.org/jira/browse/SPARK-35174
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Jonathan Lafleche
>Priority: Minor
>
> In spark-submit, we currently [open a pod watch for any spark 
> submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167].
>  If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result 
> of the watcher and break out of the watcher.
> When submitting spark applications at scale, this is a source of operational 
> pain, since opening the watch relies on opening a websocket, which tends to 
> run into subtle networking issues around negotiating the websocket connection.
> I'd like to change this behaviour so that we eagerly check whether we are 
> waiting on app completion, and avoid opening the watch altogether when 
> WAIT_FOR_APP_COMPLETION is false.
> Would you accept a contribution for that change, or are there any concerns 
> I've overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36742) Fix ps.to_datetime with plurals of keys like years, months, days

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419678#comment-17419678
 ] 

Apache Spark commented on SPARK-36742:
--

User 'dgd-contributor' has created a pull request for this issue:
https://github.com/apache/spark/pull/34094

> Fix ps.to_datetime with plurals of keys like years, months, days
> 
>
> Key: SPARK-36742
> URL: https://issues.apache.org/jira/browse/SPARK-36742
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36838) Keep In/InSet use same nullSafeEval and refactor

2021-09-24 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-36838:
--
Summary: Keep In/InSet use same nullSafeEval and refactor  (was: Keep 
In/InSet use same nullSafeEval)

> Keep In/InSet use same nullSafeEval and refactor
> 
>
> Key: SPARK-36838
> URL: https://issues.apache.org/jira/browse/SPARK-36838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In current code, In/InSet return null when value is null, this behavior is 
> not correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36656) CollapseProject should not collapse correlated scalar subqueries

2021-09-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36656.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33903
[https://github.com/apache/spark/pull/33903]

> CollapseProject should not collapse correlated scalar subqueries
> 
>
> Key: SPARK-36656
> URL: https://issues.apache.org/jira/browse/SPARK-36656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, the optimizer rule `CollapseProject` inlines expressions generated 
> from correlated scalar subqueries, which can create unnecessary left outer 
> joins.
> {code:sql}
> select c1, s, s * 10 from (
> select c1, (select first(c2) from t2 where t1.c1 = t2.c1) s from t1)
> {code}
> {code:scala}
> // Before
> Project [c1, s, (s * 10)]
> +- Project [c1, scalar-subquery [c1] AS s]
>:  +- Aggregate [c1], [first(c2), c1] 
>:  +- LocalRelation [c1, c2]
>+- LocalRelation [c1, c2]
> // After (scalar subqueries are inlined)
> Project [c1, scalar-subquery [c1], (scalar-subquery [c1] * 10)]
> :  +- Aggregate [c1], [first(c2), c1] 
> :  +- LocalRelation [c1, c2]
> :  +- Aggregate [c1], [first(c2), c1] 
> :  +- LocalRelation [c1, c2]
> +- LocalRelation [c1, c2]
> {code}
> Then this query will have two LeftOuter joins created. We should only 
> collapse projects after correlated subqueries are rewritten as joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36656) CollapseProject should not collapse correlated scalar subqueries

2021-09-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36656:
---

Assignee: Allison Wang

> CollapseProject should not collapse correlated scalar subqueries
> 
>
> Key: SPARK-36656
> URL: https://issues.apache.org/jira/browse/SPARK-36656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Currently, the optimizer rule `CollapseProject` inlines expressions generated 
> from correlated scalar subqueries, which can create unnecessary left outer 
> joins.
> {code:sql}
> select c1, s, s * 10 from (
> select c1, (select first(c2) from t2 where t1.c1 = t2.c1) s from t1)
> {code}
> {code:scala}
> // Before
> Project [c1, s, (s * 10)]
> +- Project [c1, scalar-subquery [c1] AS s]
>:  +- Aggregate [c1], [first(c2), c1] 
>:  +- LocalRelation [c1, c2]
>+- LocalRelation [c1, c2]
> // After (scalar subqueries are inlined)
> Project [c1, scalar-subquery [c1], (scalar-subquery [c1] * 10)]
> :  +- Aggregate [c1], [first(c2), c1] 
> :  +- LocalRelation [c1, c2]
> :  +- Aggregate [c1], [first(c2), c1] 
> :  +- LocalRelation [c1, c2]
> +- LocalRelation [c1, c2]
> {code}
> Then this query will have two LeftOuter joins created. We should only 
> collapse projects after correlated subqueries are rewritten as joins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36792:
---

Assignee: angerszhu

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: correctness
>
> Inset(Double.NaN, Seq(Double.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36792.
-
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 34033
[https://github.com/apache/spark/pull/34033]

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>
> Inset(Double.NaN, Seq(Double.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36747:
---

Assignee: Allison Wang

> Do not collapse Project with Aggregate when correlated subqueries are present 
> in the project list
> -
>
> Key: SPARK-36747
> URL: https://issues.apache.org/jira/browse/SPARK-36747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>
> Currently CollapseProject combines Project with Aggregate when the shared 
> attributes are deterministic. But if there are correlated scalar subqueries 
> in the project list that uses the output of the aggregate, they cannot be 
> combined. Otherwise, the plan after rewrite will not be valid:
> {code}
> select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) 
> s from t)
> == Optimized Logical Plan ==
> Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L]
> +- Project [sum(c2)#10L]
>+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))
>   :- LocalRelation [c2#3]
>   +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
>  +- LocalRelation [c1#2, c2#3]
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> sum(input[0, int, false])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36747.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34081
[https://github.com/apache/spark/pull/34081]

> Do not collapse Project with Aggregate when correlated subqueries are present 
> in the project list
> -
>
> Key: SPARK-36747
> URL: https://issues.apache.org/jira/browse/SPARK-36747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently CollapseProject combines Project with Aggregate when the shared 
> attributes are deterministic. But if there are correlated scalar subqueries 
> in the project list that uses the output of the aggregate, they cannot be 
> combined. Otherwise, the plan after rewrite will not be valid:
> {code}
> select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) 
> s from t)
> == Optimized Logical Plan ==
> Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L]
> +- Project [sum(c2)#10L]
>+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))
>   :- LocalRelation [c2#3]
>   +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
>  +- LocalRelation [c1#2, c2#3]
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> sum(input[0, int, false])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak

2021-09-24 Thread tenglei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419631#comment-17419631
 ] 

tenglei commented on SPARK-36827:
-

I met the same problem in spark 2.3.x, and find a pull request from 
https://github.com/apache/spark/pull/24616 it seem to relate to this problem 
and it should work.How much time you spend to occur this problem?

> Task/Stage/Job data remain in memory leads memory leak
> --
>
> Key: SPARK-36827
> URL: https://issues.apache.org/jira/browse/SPARK-36827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
> Attachments: mem1.txt, worker.txt
>
>
> Noticing memory-leak like behavior, steady increase of heap after GC and 
> eventually it leads to a service failure. 
> The GC histogram shows very high number of Task/Data/Job data
> {code}
>  num #instances #bytes  class name 
> -- 
>6:   7835346 2444627952  org.apache.spark.status.TaskDataWrapper 
>   25:   3765152  180727296  org.apache.spark.status.StageDataWrapper 
>   88:2322559290200  org.apache.spark.status.JobDataWrapper 
> {code}
> Thread dumps show clearly the clean up thread is always doing cleanupStages
> {code}
> "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 
> tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000]
>java.lang.Thread.State: RUNNABLE
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown
>  Source)
>   at java.util.TimSort.gallopLeft(TimSort.java:542)
>   at java.util.TimSort.mergeLo(TimSort.java:752)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
>   at java.util.TimSort.mergeCollapse(TimSort.java:439)
>   at java.util.TimSort.sort(TimSort.java:245)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1464)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.map(List.scala:297)
>   at 
> org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013)
>   at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Issue Comment Deleted] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak

2021-09-24 Thread tenglei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tenglei updated SPARK-36827:

Comment: was deleted

(was: I met the same problem in spark 2.3.x, and find a pull request from 
[https://github.com/apache/spark/pull/24616|https://github.com/apache/spark/pull/24616,]
 it seem to relate to this problem and it should work.How much time you spend 
to occur this problem?)

> Task/Stage/Job data remain in memory leads memory leak
> --
>
> Key: SPARK-36827
> URL: https://issues.apache.org/jira/browse/SPARK-36827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
> Attachments: mem1.txt, worker.txt
>
>
> Noticing memory-leak like behavior, steady increase of heap after GC and 
> eventually it leads to a service failure. 
> The GC histogram shows very high number of Task/Data/Job data
> {code}
>  num #instances #bytes  class name 
> -- 
>6:   7835346 2444627952  org.apache.spark.status.TaskDataWrapper 
>   25:   3765152  180727296  org.apache.spark.status.StageDataWrapper 
>   88:2322559290200  org.apache.spark.status.JobDataWrapper 
> {code}
> Thread dumps show clearly the clean up thread is always doing cleanupStages
> {code}
> "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 
> tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000]
>java.lang.Thread.State: RUNNABLE
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown
>  Source)
>   at java.util.TimSort.gallopLeft(TimSort.java:542)
>   at java.util.TimSort.mergeLo(TimSort.java:752)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
>   at java.util.TimSort.mergeCollapse(TimSort.java:439)
>   at java.util.TimSort.sort(TimSort.java:245)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1464)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.map(List.scala:297)
>   at 
> org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013)
>   at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Comment Edited] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak

2021-09-24 Thread tenglei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419629#comment-17419629
 ] 

tenglei edited comment on SPARK-36827 at 9/24/21, 7:50 AM:
---

I met the same problem in spark 2.3.x, and find a pull request from 
[https://github.com/apache/spark/pull/24616|https://github.com/apache/spark/pull/24616,]
 it seem to relate to this problem and it should work.How much time you spend 
to occur this problem?


was (Author: tenglei):
I met the same problem in spark 2.3.x, and find a pull request from 
[https://github.com/apache/spark/pull/24616,] it seem to relate to this problem 
and it should work.How much time you spend to occur this problem?

> Task/Stage/Job data remain in memory leads memory leak
> --
>
> Key: SPARK-36827
> URL: https://issues.apache.org/jira/browse/SPARK-36827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
> Attachments: mem1.txt, worker.txt
>
>
> Noticing memory-leak like behavior, steady increase of heap after GC and 
> eventually it leads to a service failure. 
> The GC histogram shows very high number of Task/Data/Job data
> {code}
>  num #instances #bytes  class name 
> -- 
>6:   7835346 2444627952  org.apache.spark.status.TaskDataWrapper 
>   25:   3765152  180727296  org.apache.spark.status.StageDataWrapper 
>   88:2322559290200  org.apache.spark.status.JobDataWrapper 
> {code}
> Thread dumps show clearly the clean up thread is always doing cleanupStages
> {code}
> "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 
> tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000]
>java.lang.Thread.State: RUNNABLE
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown
>  Source)
>   at java.util.TimSort.gallopLeft(TimSort.java:542)
>   at java.util.TimSort.mergeLo(TimSort.java:752)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
>   at java.util.TimSort.mergeCollapse(TimSort.java:439)
>   at java.util.TimSort.sort(TimSort.java:245)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1464)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.map(List.scala:297)
>   at 
> org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013)
>   at 
> 

[jira] [Commented] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak

2021-09-24 Thread tenglei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419629#comment-17419629
 ] 

tenglei commented on SPARK-36827:
-

I met the same problem in spark 2.3.x, and find a pull request from 
[https://github.com/apache/spark/pull/24616,] it seem to relate to this problem 
and it should work.How much time you spend to occur this problem?

> Task/Stage/Job data remain in memory leads memory leak
> --
>
> Key: SPARK-36827
> URL: https://issues.apache.org/jira/browse/SPARK-36827
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
> Attachments: mem1.txt, worker.txt
>
>
> Noticing memory-leak like behavior, steady increase of heap after GC and 
> eventually it leads to a service failure. 
> The GC histogram shows very high number of Task/Data/Job data
> {code}
>  num #instances #bytes  class name 
> -- 
>6:   7835346 2444627952  org.apache.spark.status.TaskDataWrapper 
>   25:   3765152  180727296  org.apache.spark.status.StageDataWrapper 
>   88:2322559290200  org.apache.spark.status.JobDataWrapper 
> {code}
> Thread dumps show clearly the clean up thread is always doing cleanupStages
> {code}
> "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 
> tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000]
>java.lang.Thread.State: RUNNABLE
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown
>  Source)
>   at java.util.TimSort.gallopLeft(TimSort.java:542)
>   at java.util.TimSort.mergeLo(TimSort.java:752)
>   at java.util.TimSort.mergeAt(TimSort.java:514)
>   at java.util.TimSort.mergeCollapse(TimSort.java:439)
>   at java.util.TimSort.sort(TimSort.java:245)
>   at java.util.Arrays.sort(Arrays.java:1512)
>   at java.util.ArrayList.sort(ArrayList.java:1464)
>   at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375)
>   at 
> org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.map(List.scala:297)
>   at 
> org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260)
>   at 
> org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98)
>   at 
> org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown
>  Source)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133)
>   at 
> org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131)
>   at 
> org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown
>  Source)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58)
>   at 
> org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown
>  Source)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013)
>   at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Commented] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419623#comment-17419623
 ] 

Apache Spark commented on SPARK-36294:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34093

> Refactor fifth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36294
> URL: https://issues.apache.org/jira/browse/SPARK-36294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fifth set of 20.
> {code:java}
> createStreamingSourceNotSpecifySchemaError
> streamedOperatorUnsupportedByDataSourceError
> multiplePathsSpecifiedError
> failedToFindDataSourceError
> removedClassInSpark2Error
> incompatibleDataSourceRegisterError
> unrecognizedFileFormatError
> sparkUpgradeInReadingDatesError
> sparkUpgradeInWritingDatesError
> buildReaderUnsupportedForFileFormatError
> jobAbortedError
> taskFailedWhileWritingRowsError
> readCurrentFileNotFoundError
> unsupportedSaveModeError
> cannotClearOutputDirectoryError
> cannotClearPartitionDirectoryError
> failedToCastValueToDataTypeForPartitionColumnError
> endOfStreamError
> fallbackV1RelationReportsInconsistentSchemaError
> cannotDropNonemptyNamespaceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36294:


Assignee: Apache Spark

> Refactor fifth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36294
> URL: https://issues.apache.org/jira/browse/SPARK-36294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fifth set of 20.
> {code:java}
> createStreamingSourceNotSpecifySchemaError
> streamedOperatorUnsupportedByDataSourceError
> multiplePathsSpecifiedError
> failedToFindDataSourceError
> removedClassInSpark2Error
> incompatibleDataSourceRegisterError
> unrecognizedFileFormatError
> sparkUpgradeInReadingDatesError
> sparkUpgradeInWritingDatesError
> buildReaderUnsupportedForFileFormatError
> jobAbortedError
> taskFailedWhileWritingRowsError
> readCurrentFileNotFoundError
> unsupportedSaveModeError
> cannotClearOutputDirectoryError
> cannotClearPartitionDirectoryError
> failedToCastValueToDataTypeForPartitionColumnError
> endOfStreamError
> fallbackV1RelationReportsInconsistentSchemaError
> cannotDropNonemptyNamespaceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419622#comment-17419622
 ] 

Apache Spark commented on SPARK-36294:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34093

> Refactor fifth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36294
> URL: https://issues.apache.org/jira/browse/SPARK-36294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fifth set of 20.
> {code:java}
> createStreamingSourceNotSpecifySchemaError
> streamedOperatorUnsupportedByDataSourceError
> multiplePathsSpecifiedError
> failedToFindDataSourceError
> removedClassInSpark2Error
> incompatibleDataSourceRegisterError
> unrecognizedFileFormatError
> sparkUpgradeInReadingDatesError
> sparkUpgradeInWritingDatesError
> buildReaderUnsupportedForFileFormatError
> jobAbortedError
> taskFailedWhileWritingRowsError
> readCurrentFileNotFoundError
> unsupportedSaveModeError
> cannotClearOutputDirectoryError
> cannotClearPartitionDirectoryError
> failedToCastValueToDataTypeForPartitionColumnError
> endOfStreamError
> fallbackV1RelationReportsInconsistentSchemaError
> cannotDropNonemptyNamespaceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36294:


Assignee: (was: Apache Spark)

> Refactor fifth set of 20 query execution errors to use error classes
> 
>
> Key: SPARK-36294
> URL: https://issues.apache.org/jira/browse/SPARK-36294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Karen Feng
>Priority: Major
>
> Refactor some exceptions in 
> [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala]
>  to use error classes.
> There are currently ~350 exceptions in this file; so this PR only focuses on 
> the fifth set of 20.
> {code:java}
> createStreamingSourceNotSpecifySchemaError
> streamedOperatorUnsupportedByDataSourceError
> multiplePathsSpecifiedError
> failedToFindDataSourceError
> removedClassInSpark2Error
> incompatibleDataSourceRegisterError
> unrecognizedFileFormatError
> sparkUpgradeInReadingDatesError
> sparkUpgradeInWritingDatesError
> buildReaderUnsupportedForFileFormatError
> jobAbortedError
> taskFailedWhileWritingRowsError
> readCurrentFileNotFoundError
> unsupportedSaveModeError
> cannotClearOutputDirectoryError
> cannotClearPartitionDirectoryError
> failedToCastValueToDataTypeForPartitionColumnError
> endOfStreamError
> fallbackV1RelationReportsInconsistentSchemaError
> cannotDropNonemptyNamespaceError
> {code}
> For more detail, see the parent ticket SPARK-36094.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36825) Read/write dataframes with ANSI intervals from/to parquet files

2021-09-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36825.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34057
[https://github.com/apache/spark/pull/34057]

> Read/write dataframes with ANSI intervals from/to parquet files
> ---
>
> Key: SPARK-36825
> URL: https://issues.apache.org/jira/browse/SPARK-36825
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Implement writing and reading ANSI intervals (year-month and day-time 
> intervals) columns in dataframes to Parquet datasources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side

2021-09-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419604#comment-17419604
 ] 

Apache Spark commented on SPARK-36840:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/34070

> Support DPP if there is no selective predicate on the filtering side
> 
>
> Key: SPARK-36840
> URL: https://issues.apache.org/jira/browse/SPARK-36840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36840:


Assignee: Apache Spark

> Support DPP if there is no selective predicate on the filtering side
> 
>
> Key: SPARK-36840
> URL: https://issues.apache.org/jira/browse/SPARK-36840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side

2021-09-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36840:


Assignee: (was: Apache Spark)

> Support DPP if there is no selective predicate on the filtering side
> 
>
> Key: SPARK-36840
> URL: https://issues.apache.org/jira/browse/SPARK-36840
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36838) Keep In/InSet use same nullSafeEval

2021-09-24 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-36838:
--
Summary: Keep In/InSet use same nullSafeEval  (was: In/InSet should handle 
NULL correctly )

> Keep In/InSet use same nullSafeEval
> ---
>
> Key: SPARK-36838
> URL: https://issues.apache.org/jira/browse/SPARK-36838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In current code, In/InSet return null when value is null, this behavior is 
> not correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-36838) Keep In/InSet use same nullSafeEval

2021-09-24 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu reopened SPARK-36838:
---

> Keep In/InSet use same nullSafeEval
> ---
>
> Key: SPARK-36838
> URL: https://issues.apache.org/jira/browse/SPARK-36838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In current code, In/InSet return null when value is null, this behavior is 
> not correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side

2021-09-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36840:
---

 Summary: Support DPP if there is no selective predicate on the 
filtering side
 Key: SPARK-36840
 URL: https://issues.apache.org/jira/browse/SPARK-36840
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org