[jira] [Assigned] (SPARK-36850) Migrate CreateTableStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36850: Assignee: (was: Apache Spark) > Migrate CreateTableStatement to v2 command framework > > > Key: SPARK-36850 > URL: https://issues.apache.org/jira/browse/SPARK-36850 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36850) Migrate CreateTableStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36850: Assignee: Apache Spark > Migrate CreateTableStatement to v2 command framework > > > Key: SPARK-36850 > URL: https://issues.apache.org/jira/browse/SPARK-36850 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36850) Migrate CreateTableStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420048#comment-17420048 ] Apache Spark commented on SPARK-36850: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34060 > Migrate CreateTableStatement to v2 command framework > > > Key: SPARK-36850 > URL: https://issues.apache.org/jira/browse/SPARK-36850 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36850) Migrate CreateTableStatement to v2 command framework
Huaxin Gao created SPARK-36850: -- Summary: Migrate CreateTableStatement to v2 command framework Key: SPARK-36850 URL: https://issues.apache.org/jira/browse/SPARK-36850 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36849) Migrate UseStatement to v2 command framework
Huaxin Gao created SPARK-36849: -- Summary: Migrate UseStatement to v2 command framework Key: SPARK-36849 URL: https://issues.apache.org/jira/browse/SPARK-36849 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420041#comment-17420041 ] Huaxin Gao commented on SPARK-36849: I am working on this > Migrate UseStatement to v2 command framework > > > Key: SPARK-36849 > URL: https://issues.apache.org/jira/browse/SPARK-36849 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420040#comment-17420040 ] Apache Spark commented on SPARK-36848: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34104 > Migrate ShowCurrentNamespaceStatement to v2 command framework > - > > Key: SPARK-36848 > URL: https://issues.apache.org/jira/browse/SPARK-36848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36848: Assignee: (was: Apache Spark) > Migrate ShowCurrentNamespaceStatement to v2 command framework > - > > Key: SPARK-36848 > URL: https://issues.apache.org/jira/browse/SPARK-36848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420039#comment-17420039 ] Apache Spark commented on SPARK-36848: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/34104 > Migrate ShowCurrentNamespaceStatement to v2 command framework > - > > Key: SPARK-36848 > URL: https://issues.apache.org/jira/browse/SPARK-36848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework
[ https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36848: Assignee: Apache Spark > Migrate ShowCurrentNamespaceStatement to v2 command framework > - > > Key: SPARK-36848 > URL: https://issues.apache.org/jira/browse/SPARK-36848 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework
Huaxin Gao created SPARK-36848: -- Summary: Migrate ShowCurrentNamespaceStatement to v2 command framework Key: SPARK-36848 URL: https://issues.apache.org/jira/browse/SPARK-36848 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table
[ https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420037#comment-17420037 ] Apache Spark commented on SPARK-32712: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34103 > Support writing Hive non-ORC/Parquet bucketed table > > > Key: SPARK-32712 > URL: https://issues.apache.org/jira/browse/SPARK-32712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > Hive non-ORC/Parquet write code path is original Hive table write path > (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table > (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive > 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table
[ https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32712: Assignee: (was: Apache Spark) > Support writing Hive non-ORC/Parquet bucketed table > > > Key: SPARK-32712 > URL: https://issues.apache.org/jira/browse/SPARK-32712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > Hive non-ORC/Parquet write code path is original Hive table write path > (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table > (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive > 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table
[ https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32712: Assignee: Apache Spark > Support writing Hive non-ORC/Parquet bucketed table > > > Key: SPARK-32712 > URL: https://issues.apache.org/jira/browse/SPARK-32712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > Hive non-ORC/Parquet write code path is original Hive table write path > (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table > (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive > 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32712) Support writing Hive non-ORC/Parquet bucketed table
[ https://issues.apache.org/jira/browse/SPARK-32712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420036#comment-17420036 ] Apache Spark commented on SPARK-32712: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34103 > Support writing Hive non-ORC/Parquet bucketed table > > > Key: SPARK-32712 > URL: https://issues.apache.org/jira/browse/SPARK-32712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > Hive non-ORC/Parquet write code path is original Hive table write path > (InsertIntoHiveTable). This JIRA is to support write hivehash bucketed table > (for Hive 1.x.y and 2.x.y), and hive murmur3hash bucketed table (for Hive > 3.x.y), for these non-ORC/Parquet-serde Hive bucketed table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors
[ https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420029#comment-17420029 ] Apache Spark commented on SPARK-36847: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/34102 > Explicitly specify error codes when ignoring type hint errors > - > > Key: SPARK-36847 > URL: https://issues.apache.org/jira/browse/SPARK-36847 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > We use a lot of {{type: ignore}} annotation to ignore type hint errors in > pandas-on-Spark. > We should explicitly specify the error codes to make it clear what kind of > error is being ignored, then the type hint checker can check more cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors
[ https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420027#comment-17420027 ] Apache Spark commented on SPARK-36847: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/34102 > Explicitly specify error codes when ignoring type hint errors > - > > Key: SPARK-36847 > URL: https://issues.apache.org/jira/browse/SPARK-36847 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > We use a lot of {{type: ignore}} annotation to ignore type hint errors in > pandas-on-Spark. > We should explicitly specify the error codes to make it clear what kind of > error is being ignored, then the type hint checker can check more cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors
[ https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36847: Assignee: Apache Spark > Explicitly specify error codes when ignoring type hint errors > - > > Key: SPARK-36847 > URL: https://issues.apache.org/jira/browse/SPARK-36847 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > We use a lot of {{type: ignore}} annotation to ignore type hint errors in > pandas-on-Spark. > We should explicitly specify the error codes to make it clear what kind of > error is being ignored, then the type hint checker can check more cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors
[ https://issues.apache.org/jira/browse/SPARK-36847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36847: Assignee: (was: Apache Spark) > Explicitly specify error codes when ignoring type hint errors > - > > Key: SPARK-36847 > URL: https://issues.apache.org/jira/browse/SPARK-36847 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > We use a lot of {{type: ignore}} annotation to ignore type hint errors in > pandas-on-Spark. > We should explicitly specify the error codes to make it clear what kind of > error is being ignored, then the type hint checker can check more cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36847) Explicitly specify error codes when ignoring type hint errors
Takuya Ueshin created SPARK-36847: - Summary: Explicitly specify error codes when ignoring type hint errors Key: SPARK-36847 URL: https://issues.apache.org/jira/browse/SPARK-36847 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin We use a lot of {{type: ignore}} annotation to ignore type hint errors in pandas-on-Spark. We should explicitly specify the error codes to make it clear what kind of error is being ignored, then the type hint checker can check more cases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder
[ https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36846: Assignee: (was: Apache Spark) > Inline most of type hint files under pyspark/sql/pandas folder > -- > > Key: SPARK-36846 > URL: https://issues.apache.org/jira/browse/SPARK-36846 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Inline type hint files under {{pyspark/sql/pandas}} folder, except for > {{pyspark/sql/pandas/functions.pyi}} and files under > {{pyspark/sql/pandas/_typing}}. > * Since the file contains a lot of overloads, we should revisit and manage > it separately. > * We can't inline files under {{pyspark/sql/pandas/_typing}} because it > includes new syntax for type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder
[ https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36846: Assignee: (was: Apache Spark) > Inline most of type hint files under pyspark/sql/pandas folder > -- > > Key: SPARK-36846 > URL: https://issues.apache.org/jira/browse/SPARK-36846 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Inline type hint files under {{pyspark/sql/pandas}} folder, except for > {{pyspark/sql/pandas/functions.pyi}} and files under > {{pyspark/sql/pandas/_typing}}. > * Since the file contains a lot of overloads, we should revisit and manage > it separately. > * We can't inline files under {{pyspark/sql/pandas/_typing}} because it > includes new syntax for type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder
[ https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419991#comment-17419991 ] Apache Spark commented on SPARK-36846: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/34101 > Inline most of type hint files under pyspark/sql/pandas folder > -- > > Key: SPARK-36846 > URL: https://issues.apache.org/jira/browse/SPARK-36846 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Inline type hint files under {{pyspark/sql/pandas}} folder, except for > {{pyspark/sql/pandas/functions.pyi}} and files under > {{pyspark/sql/pandas/_typing}}. > * Since the file contains a lot of overloads, we should revisit and manage > it separately. > * We can't inline files under {{pyspark/sql/pandas/_typing}} because it > includes new syntax for type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder
[ https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36846: Assignee: Apache Spark > Inline most of type hint files under pyspark/sql/pandas folder > -- > > Key: SPARK-36846 > URL: https://issues.apache.org/jira/browse/SPARK-36846 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Inline type hint files under {{pyspark/sql/pandas}} folder, except for > {{pyspark/sql/pandas/functions.pyi}} and files under > {{pyspark/sql/pandas/_typing}}. > * Since the file contains a lot of overloads, we should revisit and manage > it separately. > * We can't inline files under {{pyspark/sql/pandas/_typing}} because it > includes new syntax for type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder
[ https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-36846: -- Description: Inline type hint files under {{pyspark/sql/pandas}} folder, except for {{pyspark/sql/pandas/functions.pyi}} and files under {{pyspark/sql/pandas/_typing}}. - Since the file contains a lot of overloads, we should revisit and manage it separately. - We can't inline files under {{pyspark/sql/pandas/_typing}} because it includes new syntax for type hints. was:Inline type hint files under {{pyspark/sql/pandas}} folder, except for {{pyspark/sql/pandas/functions.pyi}} and files under {{pyspark/sql/pandas/_typing}}. > Inline most of type hint files under pyspark/sql/pandas folder > -- > > Key: SPARK-36846 > URL: https://issues.apache.org/jira/browse/SPARK-36846 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Inline type hint files under {{pyspark/sql/pandas}} folder, except for > {{pyspark/sql/pandas/functions.pyi}} and files under > {{pyspark/sql/pandas/_typing}}. > - Since the file contains a lot of overloads, we should revisit and manage it > separately. > - We can't inline files under {{pyspark/sql/pandas/_typing}} because it > includes new syntax for type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder
[ https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-36846: -- Description: Inline type hint files under {{pyspark/sql/pandas}} folder, except for {{pyspark/sql/pandas/functions.pyi}} and files under {{pyspark/sql/pandas/_typing}}. * Since the file contains a lot of overloads, we should revisit and manage it separately. * We can't inline files under {{pyspark/sql/pandas/_typing}} because it includes new syntax for type hints. was: Inline type hint files under {{pyspark/sql/pandas}} folder, except for {{pyspark/sql/pandas/functions.pyi}} and files under {{pyspark/sql/pandas/_typing}}. - Since the file contains a lot of overloads, we should revisit and manage it separately. - We can't inline files under {{pyspark/sql/pandas/_typing}} because it includes new syntax for type hints. > Inline most of type hint files under pyspark/sql/pandas folder > -- > > Key: SPARK-36846 > URL: https://issues.apache.org/jira/browse/SPARK-36846 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > Inline type hint files under {{pyspark/sql/pandas}} folder, except for > {{pyspark/sql/pandas/functions.pyi}} and files under > {{pyspark/sql/pandas/_typing}}. > * Since the file contains a lot of overloads, we should revisit and manage > it separately. > * We can't inline files under {{pyspark/sql/pandas/_typing}} because it > includes new syntax for type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder
Takuya Ueshin created SPARK-36846: - Summary: Inline most of type hint files under pyspark/sql/pandas folder Key: SPARK-36846 URL: https://issues.apache.org/jira/browse/SPARK-36846 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Takuya Ueshin Inline type hint files under {{pyspark/sql/pandas}} folder, except for {{pyspark/sql/pandas/functions.pyi}} and files under {{pyspark/sql/pandas/_typing}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36845) Inline type hint files
Takuya Ueshin created SPARK-36845: - Summary: Inline type hint files Key: SPARK-36845 URL: https://issues.apache.org/jira/browse/SPARK-36845 Project: Spark Issue Type: Umbrella Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Takuya Ueshin Currently there are type hint stub files ({{*.pyi}}) to show the expected types for functions, but we can also take advantage of static type checking within the functions by inlining the type hints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35174) Avoid opening watch when waitAppCompletion is false
[ https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35174. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34095 [https://github.com/apache/spark/pull/34095] > Avoid opening watch when waitAppCompletion is false > --- > > Key: SPARK-35174 > URL: https://issues.apache.org/jira/browse/SPARK-35174 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Jonathan Lafleche >Priority: Minor > Fix For: 3.3.0 > > > In spark-submit, we currently [open a pod watch for any spark > submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167]. > If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result > of the watcher and break out of the watcher. > When submitting spark applications at scale, this is a source of operational > pain, since opening the watch relies on opening a websocket, which tends to > run into subtle networking issues around negotiating the websocket connection. > I'd like to change this behaviour so that we eagerly check whether we are > waiting on app completion, and avoid opening the watch altogether when > WAIT_FOR_APP_COMPLETION is false. > Would you accept a contribution for that change, or are there any concerns > I've overlooked? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36844) Window function "first" (unboundedFollowing) appears significantly slower than "last" (unboundedPreceding) in identical circumstances
[ https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alain Bryden updated SPARK-36844: - Summary: Window function "first" (unboundedFollowing) appears significantly slower than "last" (unboundedPreceding) in identical circumstances (was: "first" Window function is significantly slower than "last" in identical circumstances) > Window function "first" (unboundedFollowing) appears significantly slower > than "last" (unboundedPreceding) in identical circumstances > - > > Key: SPARK-36844 > URL: https://issues.apache.org/jira/browse/SPARK-36844 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 3.1.1 >Reporter: Alain Bryden >Priority: Minor > Attachments: Physical Plan 2 - workaround.png, Pysical Plan.png > > > I originally posted a question on SO because I thought perhaps I was doing > something wrong: > [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560] > Perhaps I am, but I'm now fairly convinced that there's something wonky with > the implementation of `first` that's causing it to unnecessarily have a much > worse complexity than `last`. > > More or less copy-pasted from the above post: > I was working on a pyspark routine to interpolate the missing values in a > configuration table. > Imagine a table of configuration values that go from 0 to 50,000. The user > specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) > and we interpolate the remainder. My solution mostly follows [this blog > post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite > closely, except I'm not using any UDFs. > In troubleshooting the performance of this (takes ~3 minutes) I found that > one particular window function is taking all of the time, and everything else > I'm doing takes mere seconds. > Here is the main area of interest - where I use window functions to fill in > the previous and next user-supplied configuration values: > {code:python} > from pyspark.sql import Window, functions as F > # Create partition windows that are required to generate new rows from the > ones provided > win_last = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0) > win_next = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing) > # Join back in the provided config table to populate the "known" scale factors > df_part1 = (df_scale_factors_template > .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter') > # Add computed columns that can lookup the prior config and next config for > each missing value > .withColumn('last_rank', F.last( F.col('rank'), > ignorenulls=True).over(win_last)) > .withColumn('last_sf', F.last( F.col('scale_factor'), > ignorenulls=True).over(win_last)) > ).cache() > debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1 > df_part2 = (df_part1 > .withColumn('next_rank', F.first(F.col('rank'), > ignorenulls=True).over(win_next)) > .withColumn('next_sf', F.first(F.col('scale_factor'), > ignorenulls=True).over(win_next)) > ).cache() > debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2 > df_part3 = (df_part2 > # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * > (x-x1) > .withColumn('scale_factor', > F.when(F.col('last_rank')==F.col('next_rank'), > F.col('last_sf')) # Handle div/0 case > .otherwise(F.col('last_sf') + > ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) > * (F.col('rank')-F.col('last_rank' > .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor') > ).cache() > debug_log_dataframe(df_part3, 'df_part3', explain: True) > {code} > > The above used to be a single chained dataframe statement, but I've since > split it into 3 parts so that I could isolate the part that's taking so long. > The results are: > * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}} > * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}} > * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}} > > In trying various things to speed up my routine, it occurred to me to try > re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a > reversed sort order. > So rewriting this: > {code:python} > win_next = (Window.partitionBy('PORT_TYPE', 'loss_process') > .orderBy('rank').rowsBetween(0,
[jira] [Assigned] (SPARK-36721) Simplify boolean equalities if one side is literal
[ https://issues.apache.org/jira/browse/SPARK-36721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-36721: --- Assignee: Kazuyuki Tanimura > Simplify boolean equalities if one side is literal > -- > > Key: SPARK-36721 > URL: https://issues.apache.org/jira/browse/SPARK-36721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > > The following query does not push down the filter > ``` > SELECT * FROM t WHERE (a AND b) = true > ``` > although the following equivalent query pushes down the filter as expected. > ``` > SELECT * FROM t WHERE (a AND b) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36721) Simplify boolean equalities if one side is literal
[ https://issues.apache.org/jira/browse/SPARK-36721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-36721. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34055 [https://github.com/apache/spark/pull/34055] > Simplify boolean equalities if one side is literal > -- > > Key: SPARK-36721 > URL: https://issues.apache.org/jira/browse/SPARK-36721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > > The following query does not push down the filter > ``` > SELECT * FROM t WHERE (a AND b) = true > ``` > although the following equivalent query pushes down the filter as expected. > ``` > SELECT * FROM t WHERE (a AND b) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36835) Spark 3.2.0 POMs are no longer "dependency reduced"
[ https://issues.apache.org/jira/browse/SPARK-36835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419884#comment-17419884 ] Apache Spark commented on SPARK-36835: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/34100 > Spark 3.2.0 POMs are no longer "dependency reduced" > --- > > Key: SPARK-36835 > URL: https://issues.apache.org/jira/browse/SPARK-36835 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Josh Rosen >Assignee: Chao Sun >Priority: Blocker > Fix For: 3.2.0 > > > It looks like Spark 3.2.0's POMs are no longer "dependency reduced". As a > result, applications may pull in additional unnecessary dependencies when > depending on Spark. > Spark uses the Maven Shade plugin to create effective POMs and to bundle > shaded versions of certain libraries with Spark (namely, Jetty, Guava, and > JPPML). [By > default|https://maven.apache.org/plugins/maven-shade-plugin/shade-mojo.html#createDependencyReducedPom], > the Maven Shade plugin generates simplified POMs which remove dependencies > on artifacts that have been shaded. > SPARK-33212 / > [b6f46ca29742029efea2790af7fdefbc2fcf52de|https://github.com/apache/spark/commit/b6f46ca29742029efea2790af7fdefbc2fcf52de] > changed the configuration of the Maven Shade plugin, setting > {{createDependencyReducedPom}} to {{false}}. > As a result, the generated POMs now include compile-scope dependencies on the > shaded libraries. For example, compare the {{org.eclipse.jetty}} dependencies > in: > * Spark 3.1.2: > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/3.1.2/spark-core_2.12-3.1.2.pom] > * Spark 3.2.0 RC2: > [https://repository.apache.org/content/repositories/orgapachespark-1390/org/apache/spark/spark-core_2.12/3.2.0/spark-core_2.12-3.2.0.pom] > I think we should revert back to generating "dependency reduced" POMs to > ensure that Spark declares a proper set of dependencies and to avoid "unknown > unknown" consequences of changing our generated POM format. > /cc [~csun] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances
[ https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419833#comment-17419833 ] Alain Bryden commented on SPARK-36844: -- I've attached the physical plan from the initial implementation. `last` ends up using "RunningWindowFunction" whereas `first` just says "Window" !Pysical Plan.png! Here it is after I've re-written things to explicitly reverse the window sort order and use `last` instead of `first`. Now "RunningWindowFunction" is used in both cases and the dataframe evaluates several orders of magnitude faster (1-2 seconds instead of ~3 minutes): !Physical Plan 2 - workaround.png! > "first" Window function is significantly slower than "last" in identical > circumstances > -- > > Key: SPARK-36844 > URL: https://issues.apache.org/jira/browse/SPARK-36844 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 3.1.1 >Reporter: Alain Bryden >Priority: Minor > Attachments: Physical Plan 2 - workaround.png, Pysical Plan.png > > > I originally posted a question on SO because I thought perhaps I was doing > something wrong: > [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560] > Perhaps I am, but I'm now fairly convinced that there's something wonky with > the implementation of `first` that's causing it to unnecessarily have a much > worse complexity than `last`. > > More or less copy-pasted from the above post: > I was working on a pyspark routine to interpolate the missing values in a > configuration table. > Imagine a table of configuration values that go from 0 to 50,000. The user > specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) > and we interpolate the remainder. My solution mostly follows [this blog > post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite > closely, except I'm not using any UDFs. > In troubleshooting the performance of this (takes ~3 minutes) I found that > one particular window function is taking all of the time, and everything else > I'm doing takes mere seconds. > Here is the main area of interest - where I use window functions to fill in > the previous and next user-supplied configuration values: > {code:python} > from pyspark.sql import Window, functions as F > # Create partition windows that are required to generate new rows from the > ones provided > win_last = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0) > win_next = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing) > # Join back in the provided config table to populate the "known" scale factors > df_part1 = (df_scale_factors_template > .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter') > # Add computed columns that can lookup the prior config and next config for > each missing value > .withColumn('last_rank', F.last( F.col('rank'), > ignorenulls=True).over(win_last)) > .withColumn('last_sf', F.last( F.col('scale_factor'), > ignorenulls=True).over(win_last)) > ).cache() > debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1 > df_part2 = (df_part1 > .withColumn('next_rank', F.first(F.col('rank'), > ignorenulls=True).over(win_next)) > .withColumn('next_sf', F.first(F.col('scale_factor'), > ignorenulls=True).over(win_next)) > ).cache() > debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2 > df_part3 = (df_part2 > # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * > (x-x1) > .withColumn('scale_factor', > F.when(F.col('last_rank')==F.col('next_rank'), > F.col('last_sf')) # Handle div/0 case > .otherwise(F.col('last_sf') + > ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) > * (F.col('rank')-F.col('last_rank' > .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor') > ).cache() > debug_log_dataframe(df_part3, 'df_part3', explain: True) > {code} > > The above used to be a single chained dataframe statement, but I've since > split it into 3 parts so that I could isolate the part that's taking so long. > The results are: > * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}} > * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}} > * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}} > > In trying various things to speed up my routine, it occurred to me to try > re-rewriting my usages of {{first()}} to just be usages of {{last()}}
[jira] [Updated] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances
[ https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alain Bryden updated SPARK-36844: - Attachment: Physical Plan 2 - workaround.png > "first" Window function is significantly slower than "last" in identical > circumstances > -- > > Key: SPARK-36844 > URL: https://issues.apache.org/jira/browse/SPARK-36844 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 3.1.1 >Reporter: Alain Bryden >Priority: Minor > Attachments: Physical Plan 2 - workaround.png, Pysical Plan.png > > > I originally posted a question on SO because I thought perhaps I was doing > something wrong: > [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560] > Perhaps I am, but I'm now fairly convinced that there's something wonky with > the implementation of `first` that's causing it to unnecessarily have a much > worse complexity than `last`. > > More or less copy-pasted from the above post: > I was working on a pyspark routine to interpolate the missing values in a > configuration table. > Imagine a table of configuration values that go from 0 to 50,000. The user > specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) > and we interpolate the remainder. My solution mostly follows [this blog > post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite > closely, except I'm not using any UDFs. > In troubleshooting the performance of this (takes ~3 minutes) I found that > one particular window function is taking all of the time, and everything else > I'm doing takes mere seconds. > Here is the main area of interest - where I use window functions to fill in > the previous and next user-supplied configuration values: > {code:python} > from pyspark.sql import Window, functions as F > # Create partition windows that are required to generate new rows from the > ones provided > win_last = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0) > win_next = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing) > # Join back in the provided config table to populate the "known" scale factors > df_part1 = (df_scale_factors_template > .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter') > # Add computed columns that can lookup the prior config and next config for > each missing value > .withColumn('last_rank', F.last( F.col('rank'), > ignorenulls=True).over(win_last)) > .withColumn('last_sf', F.last( F.col('scale_factor'), > ignorenulls=True).over(win_last)) > ).cache() > debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1 > df_part2 = (df_part1 > .withColumn('next_rank', F.first(F.col('rank'), > ignorenulls=True).over(win_next)) > .withColumn('next_sf', F.first(F.col('scale_factor'), > ignorenulls=True).over(win_next)) > ).cache() > debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2 > df_part3 = (df_part2 > # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * > (x-x1) > .withColumn('scale_factor', > F.when(F.col('last_rank')==F.col('next_rank'), > F.col('last_sf')) # Handle div/0 case > .otherwise(F.col('last_sf') + > ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) > * (F.col('rank')-F.col('last_rank' > .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor') > ).cache() > debug_log_dataframe(df_part3, 'df_part3', explain: True) > {code} > > The above used to be a single chained dataframe statement, but I've since > split it into 3 parts so that I could isolate the part that's taking so long. > The results are: > * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}} > * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}} > * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}} > > In trying various things to speed up my routine, it occurred to me to try > re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a > reversed sort order. > So rewriting this: > {code:python} > win_next = (Window.partitionBy('PORT_TYPE', 'loss_process') > .orderBy('rank').rowsBetween(0, Window.unboundedFollowing)) > df_part2 = (df_part1 > .withColumn('next_rank', F.first(F.col('rank'), > ignorenulls=True).over(win_next)) > .withColumn('next_sf', F.first(F.col('scale_factor'), > ignorenulls=True).over(win_next)) > ) > {code} > > As this: > {code:python} > win_next =
[jira] [Updated] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances
[ https://issues.apache.org/jira/browse/SPARK-36844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alain Bryden updated SPARK-36844: - Attachment: Pysical Plan.png > "first" Window function is significantly slower than "last" in identical > circumstances > -- > > Key: SPARK-36844 > URL: https://issues.apache.org/jira/browse/SPARK-36844 > Project: Spark > Issue Type: Bug > Components: PySpark, Windows >Affects Versions: 3.1.1 >Reporter: Alain Bryden >Priority: Minor > Attachments: Pysical Plan.png > > > I originally posted a question on SO because I thought perhaps I was doing > something wrong: > [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560] > Perhaps I am, but I'm now fairly convinced that there's something wonky with > the implementation of `first` that's causing it to unnecessarily have a much > worse complexity than `last`. > > More or less copy-pasted from the above post: > I was working on a pyspark routine to interpolate the missing values in a > configuration table. > Imagine a table of configuration values that go from 0 to 50,000. The user > specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) > and we interpolate the remainder. My solution mostly follows [this blog > post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite > closely, except I'm not using any UDFs. > In troubleshooting the performance of this (takes ~3 minutes) I found that > one particular window function is taking all of the time, and everything else > I'm doing takes mere seconds. > Here is the main area of interest - where I use window functions to fill in > the previous and next user-supplied configuration values: > {code:python} > from pyspark.sql import Window, functions as F > # Create partition windows that are required to generate new rows from the > ones provided > win_last = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0) > win_next = Window.partitionBy('PORT_TYPE', > 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing) > # Join back in the provided config table to populate the "known" scale factors > df_part1 = (df_scale_factors_template > .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter') > # Add computed columns that can lookup the prior config and next config for > each missing value > .withColumn('last_rank', F.last( F.col('rank'), > ignorenulls=True).over(win_last)) > .withColumn('last_sf', F.last( F.col('scale_factor'), > ignorenulls=True).over(win_last)) > ).cache() > debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1 > df_part2 = (df_part1 > .withColumn('next_rank', F.first(F.col('rank'), > ignorenulls=True).over(win_next)) > .withColumn('next_sf', F.first(F.col('scale_factor'), > ignorenulls=True).over(win_next)) > ).cache() > debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2 > df_part3 = (df_part2 > # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * > (x-x1) > .withColumn('scale_factor', > F.when(F.col('last_rank')==F.col('next_rank'), > F.col('last_sf')) # Handle div/0 case > .otherwise(F.col('last_sf') + > ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) > * (F.col('rank')-F.col('last_rank' > .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor') > ).cache() > debug_log_dataframe(df_part3, 'df_part3', explain: True) > {code} > > The above used to be a single chained dataframe statement, but I've since > split it into 3 parts so that I could isolate the part that's taking so long. > The results are: > * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}} > * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}} > * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}} > > In trying various things to speed up my routine, it occurred to me to try > re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a > reversed sort order. > So rewriting this: > {code:python} > win_next = (Window.partitionBy('PORT_TYPE', 'loss_process') > .orderBy('rank').rowsBetween(0, Window.unboundedFollowing)) > df_part2 = (df_part1 > .withColumn('next_rank', F.first(F.col('rank'), > ignorenulls=True).over(win_next)) > .withColumn('next_sf', F.first(F.col('scale_factor'), > ignorenulls=True).over(win_next)) > ) > {code} > > As this: > {code:python} > win_next = (Window.partitionBy('PORT_TYPE', 'loss_process') >
[jira] [Created] (SPARK-36844) "first" Window function is significantly slower than "last" in identical circumstances
Alain Bryden created SPARK-36844: Summary: "first" Window function is significantly slower than "last" in identical circumstances Key: SPARK-36844 URL: https://issues.apache.org/jira/browse/SPARK-36844 Project: Spark Issue Type: Bug Components: PySpark, Windows Affects Versions: 3.1.1 Reporter: Alain Bryden I originally posted a question on SO because I thought perhaps I was doing something wrong: [https://stackoverflow.com/questions/69308560|https://stackoverflow.com/questions/69308560/spark-first-window-function-is-taking-much-longer-than-last?noredirect=1#comment122505685_69308560] Perhaps I am, but I'm now fairly convinced that there's something wonky with the implementation of `first` that's causing it to unnecessarily have a much worse complexity than `last`. More or less copy-pasted from the above post: I was working on a pyspark routine to interpolate the missing values in a configuration table. Imagine a table of configuration values that go from 0 to 50,000. The user specifies a few data points in between (say at 0, 50, 100, 500, 2000, 50) and we interpolate the remainder. My solution mostly follows [this blog post|https://walkenho.github.io/interpolating-time-series-p2-spark/] quite closely, except I'm not using any UDFs. In troubleshooting the performance of this (takes ~3 minutes) I found that one particular window function is taking all of the time, and everything else I'm doing takes mere seconds. Here is the main area of interest - where I use window functions to fill in the previous and next user-supplied configuration values: {code:python} from pyspark.sql import Window, functions as F # Create partition windows that are required to generate new rows from the ones provided win_last = Window.partitionBy('PORT_TYPE', 'loss_process').orderBy('rank').rowsBetween(Window.unboundedPreceding, 0) win_next = Window.partitionBy('PORT_TYPE', 'loss_process').orderBy('rank').rowsBetween(0, Window.unboundedFollowing) # Join back in the provided config table to populate the "known" scale factors df_part1 = (df_scale_factors_template .join(df_users_config, ['PORT_TYPE', 'loss_process', 'rank'], 'leftouter') # Add computed columns that can lookup the prior config and next config for each missing value .withColumn('last_rank', F.last( F.col('rank'), ignorenulls=True).over(win_last)) .withColumn('last_sf', F.last( F.col('scale_factor'), ignorenulls=True).over(win_last)) ).cache() debug_log_dataframe(df_part1 , 'df_part1') # Force a .count() and time Part1 df_part2 = (df_part1 .withColumn('next_rank', F.first(F.col('rank'), ignorenulls=True).over(win_next)) .withColumn('next_sf', F.first(F.col('scale_factor'), ignorenulls=True).over(win_next)) ).cache() debug_log_dataframe(df_part2 , 'df_part2') # Force a .count() and time Part2 df_part3 = (df_part2 # Implements standard linear interpolation: y = y1 + ((y2-y1)/(x2-x1)) * (x-x1) .withColumn('scale_factor', F.when(F.col('last_rank')==F.col('next_rank'), F.col('last_sf')) # Handle div/0 case .otherwise(F.col('last_sf') + ((F.col('next_sf')-F.col('last_sf'))/(F.col('next_rank')-F.col('last_rank'))) * (F.col('rank')-F.col('last_rank' .select('PORT_TYPE', 'loss_process', 'rank', 'scale_factor') ).cache() debug_log_dataframe(df_part3, 'df_part3', explain: True) {code} The above used to be a single chained dataframe statement, but I've since split it into 3 parts so that I could isolate the part that's taking so long. The results are: * {{Part 1: Generated 8 columns and 36 rows in 0.65 seconds}} * {{Part 2: Generated 10 columns and 36 rows in 189.55 seconds}} * {{Part 3: Generated 4 columns and 36 rows in 0.24 seconds}} In trying various things to speed up my routine, it occurred to me to try re-rewriting my usages of {{first()}} to just be usages of {{last()}} with a reversed sort order. So rewriting this: {code:python} win_next = (Window.partitionBy('PORT_TYPE', 'loss_process') .orderBy('rank').rowsBetween(0, Window.unboundedFollowing)) df_part2 = (df_part1 .withColumn('next_rank', F.first(F.col('rank'), ignorenulls=True).over(win_next)) .withColumn('next_sf', F.first(F.col('scale_factor'), ignorenulls=True).over(win_next)) ) {code} As this: {code:python} win_next = (Window.partitionBy('PORT_TYPE', 'loss_process') .orderBy(F.desc('rank')).rowsBetween(Window.unboundedPreceding, 0)) df_part2 = (df_part1 .withColumn('next_rank', F.last(F.col('rank'), ignorenulls=True).over(win_next)) .withColumn('next_sf', F.last(F.col('scale_factor'), ignorenulls=True).over(win_next)) ) {code} Much to my amazement, this actually solved the performance problem, and now the entire dataframe is generated in just 3 seconds. I don't know anything about the internals, but
[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN
[ https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419830#comment-17419830 ] Erik Krogen commented on SPARK-35672: - Thanks [~petertoth] [~hyukjin.kwon] [~Gengliang.Wang] for reporting and dealing with the issue. I'll work on submitting a new PR to master with the changes from PRs #31810 (original) and #34084 (environment variable fix) incorporated. > Spark fails to launch executors with very large user classpath lists on YARN > > > Key: SPARK-35672 > URL: https://issues.apache.org/jira/browse/SPARK-35672 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 3.1.2 > Environment: Linux RHEL7 > Spark 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > When running Spark on YARN, the {{user-class-path}} argument to > {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to > executor processes. The argument is specified once for each JAR, and the URIs > are fully-qualified, so the paths can be quite long. With large user JAR > lists (say 1000+), this can result in system-level argument length limits > being exceeded, typically manifesting as the error message: > {code} > /bin/bash: Argument list too long > {code} > A [Google > search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22] > indicates that this is not a theoretical problem and afflicts real users, > including ours. This issue was originally observed on Spark 2.3, but has been > confirmed to exist in the master branch as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36821) Create a test to extend ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36821: Assignee: (was: Apache Spark) > Create a test to extend ColumnarBatch > - > > Key: SPARK-36821 > URL: https://issues.apache.org/jira/browse/SPARK-36821 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yufei Gu >Priority: Major > > As a followup of Spark-36814, to create a test to extend ColumnarBatch to > prevent future changes to break it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36821) Create a test to extend ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419815#comment-17419815 ] Apache Spark commented on SPARK-36821: -- User 'flyrain' has created a pull request for this issue: https://github.com/apache/spark/pull/34087 > Create a test to extend ColumnarBatch > - > > Key: SPARK-36821 > URL: https://issues.apache.org/jira/browse/SPARK-36821 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yufei Gu >Priority: Major > > As a followup of Spark-36814, to create a test to extend ColumnarBatch to > prevent future changes to break it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36821) Create a test to extend ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-36821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36821: Assignee: Apache Spark > Create a test to extend ColumnarBatch > - > > Key: SPARK-36821 > URL: https://issues.apache.org/jira/browse/SPARK-36821 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yufei Gu >Assignee: Apache Spark >Priority: Major > > As a followup of Spark-36814, to create a test to extend ColumnarBatch to > prevent future changes to break it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36843) Add an iterator method to Dataset
Li Xian created SPARK-36843: --- Summary: Add an iterator method to Dataset Key: SPARK-36843 URL: https://issues.apache.org/jira/browse/SPARK-36843 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Li Xian The current org.apache.spark.sql.Dataset#toLocalIterator will submit multiple jobs for multiple partitions. In my case, I would like to collect all partition at once to save the job scheduling cost and also has an iterator to save the memory on deserialization (instead of deserialize all rows at once, I want only one row is deserialized during the iteration) . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36842) Stop task result getter properly on spark context stopping
[ https://issues.apache.org/jira/browse/SPARK-36842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419728#comment-17419728 ] Apache Spark commented on SPARK-36842: -- User 'lxian' has created a pull request for this issue: https://github.com/apache/spark/pull/34098 > Stop task result getter properly on spark context stopping > -- > > Key: SPARK-36842 > URL: https://issues.apache.org/jira/browse/SPARK-36842 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Li Xian >Priority: Major > > org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception > properly. If one component throws exceptions on stopping, the exception is > thrown and TaskSchedulerImpl.stop() will not be executed completely. > For example if backend.stop() fails, then taskResultGetter.stop() won't be > executed. The result is that after a couple of restart of the spark context, > there will be a lot of ' > task-result-getter' threads retained. > > !image-2021-09-24-18-50-57-072.png! > !image-2021-09-24-18-51-03-837.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36842) Stop task result getter properly on spark context stopping
[ https://issues.apache.org/jira/browse/SPARK-36842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36842: Assignee: (was: Apache Spark) > Stop task result getter properly on spark context stopping > -- > > Key: SPARK-36842 > URL: https://issues.apache.org/jira/browse/SPARK-36842 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Li Xian >Priority: Major > > org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception > properly. If one component throws exceptions on stopping, the exception is > thrown and TaskSchedulerImpl.stop() will not be executed completely. > For example if backend.stop() fails, then taskResultGetter.stop() won't be > executed. The result is that after a couple of restart of the spark context, > there will be a lot of ' > task-result-getter' threads retained. > > !image-2021-09-24-18-50-57-072.png! > !image-2021-09-24-18-51-03-837.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36842) Stop task result getter properly on spark context stopping
[ https://issues.apache.org/jira/browse/SPARK-36842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36842: Assignee: Apache Spark > Stop task result getter properly on spark context stopping > -- > > Key: SPARK-36842 > URL: https://issues.apache.org/jira/browse/SPARK-36842 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Li Xian >Assignee: Apache Spark >Priority: Major > > org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception > properly. If one component throws exceptions on stopping, the exception is > thrown and TaskSchedulerImpl.stop() will not be executed completely. > For example if backend.stop() fails, then taskResultGetter.stop() won't be > executed. The result is that after a couple of restart of the spark context, > there will be a lot of ' > task-result-getter' threads retained. > > !image-2021-09-24-18-50-57-072.png! > !image-2021-09-24-18-51-03-837.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36842) Stop task result getter properly on spark context stopping
Li Xian created SPARK-36842: --- Summary: Stop task result getter properly on spark context stopping Key: SPARK-36842 URL: https://issues.apache.org/jira/browse/SPARK-36842 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Li Xian org.apache.spark.scheduler.TaskSchedulerImpl#stop doesn't handle exception properly. If one component throws exceptions on stopping, the exception is thrown and TaskSchedulerImpl.stop() will not be executed completely. For example if backend.stop() fails, then taskResultGetter.stop() won't be executed. The result is that after a couple of restart of the spark context, there will be a lot of ' task-result-getter' threads retained. !image-2021-09-24-18-50-57-072.png! !image-2021-09-24-18-51-03-837.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419711#comment-17419711 ] Apache Spark commented on SPARK-36792: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34097 > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Labels: correctness > Fix For: 3.2.0, 3.1.3, 3.0.4 > > > Inset(Double.NaN, Seq(Double.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419710#comment-17419710 ] Apache Spark commented on SPARK-36792: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34097 > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Labels: correctness > Fix For: 3.2.0, 3.1.3, 3.0.4 > > > Inset(Double.NaN, Seq(Double.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36831) Read/write dataframes with ANSI intervals from/to CSV files
[ https://issues.apache.org/jira/browse/SPARK-36831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-36831: Assignee: (was: Max Gekk) > Read/write dataframes with ANSI intervals from/to CSV files > --- > > Key: SPARK-36831 > URL: https://issues.apache.org/jira/browse/SPARK-36831 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Implement writing and reading ANSI intervals (year-month and day-time > intervals) columns in dataframes to Parquet datasources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog
[ https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419705#comment-17419705 ] Apache Spark commented on SPARK-36841: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34096 > Provide ansi syntax `set catalog xxx` to change the current catalog > -- > > Key: SPARK-36841 > URL: https://issues.apache.org/jira/browse/SPARK-36841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > !SET-CATALOG.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog
[ https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36841: Assignee: (was: Apache Spark) > Provide ansi syntax `set catalog xxx` to change the current catalog > -- > > Key: SPARK-36841 > URL: https://issues.apache.org/jira/browse/SPARK-36841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > !SET-CATALOG.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog
[ https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419703#comment-17419703 ] Apache Spark commented on SPARK-36841: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34096 > Provide ansi syntax `set catalog xxx` to change the current catalog > -- > > Key: SPARK-36841 > URL: https://issues.apache.org/jira/browse/SPARK-36841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > !SET-CATALOG.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog
[ https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36841: Assignee: Apache Spark > Provide ansi syntax `set catalog xxx` to change the current catalog > -- > > Key: SPARK-36841 > URL: https://issues.apache.org/jira/browse/SPARK-36841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > !SET-CATALOG.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog
[ https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PengLei updated SPARK-36841: Description: !SET-CATALOG.PNG! (was: !截图.PNG!) > Provide ansi syntax `set catalog xxx` to change the current catalog > -- > > Key: SPARK-36841 > URL: https://issues.apache.org/jira/browse/SPARK-36841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > !SET-CATALOG.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog
[ https://issues.apache.org/jira/browse/SPARK-36841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] PengLei updated SPARK-36841: Description: !截图.PNG! > Provide ansi syntax `set catalog xxx` to change the current catalog > -- > > Key: SPARK-36841 > URL: https://issues.apache.org/jira/browse/SPARK-36841 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > > !截图.PNG! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36841) Provide ansi syntax `set catalog xxx` to change the current catalog
PengLei created SPARK-36841: --- Summary: Provide ansi syntax `set catalog xxx` to change the current catalog Key: SPARK-36841 URL: https://issues.apache.org/jira/browse/SPARK-36841 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: PengLei Fix For: 3.3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35174) Avoid opening watch when waitAppCompletion is false
[ https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35174: Assignee: (was: Apache Spark) > Avoid opening watch when waitAppCompletion is false > --- > > Key: SPARK-35174 > URL: https://issues.apache.org/jira/browse/SPARK-35174 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Jonathan Lafleche >Priority: Minor > > In spark-submit, we currently [open a pod watch for any spark > submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167]. > If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result > of the watcher and break out of the watcher. > When submitting spark applications at scale, this is a source of operational > pain, since opening the watch relies on opening a websocket, which tends to > run into subtle networking issues around negotiating the websocket connection. > I'd like to change this behaviour so that we eagerly check whether we are > waiting on app completion, and avoid opening the watch altogether when > WAIT_FOR_APP_COMPLETION is false. > Would you accept a contribution for that change, or are there any concerns > I've overlooked? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak
[ https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-36827. Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 34092 [https://github.com/apache/spark/pull/34092] > Task/Stage/Job data remain in memory leads memory leak > -- > > Key: SPARK-36827 > URL: https://issues.apache.org/jira/browse/SPARK-36827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > Fix For: 3.2.0 > > Attachments: mem1.txt, worker.txt > > > Noticing memory-leak like behavior, steady increase of heap after GC and > eventually it leads to a service failure. > The GC histogram shows very high number of Task/Data/Job data > {code} > num #instances #bytes class name > -- >6: 7835346 2444627952 org.apache.spark.status.TaskDataWrapper > 25: 3765152 180727296 org.apache.spark.status.StageDataWrapper > 88:2322559290200 org.apache.spark.status.JobDataWrapper > {code} > Thread dumps show clearly the clean up thread is always doing cleanupStages > {code} > "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 > tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000] >java.lang.Thread.State: RUNNABLE > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown > Source) > at java.util.TimSort.gallopLeft(TimSort.java:542) > at java.util.TimSort.mergeLo(TimSort.java:752) > at java.util.TimSort.mergeAt(TimSort.java:514) > at java.util.TimSort.mergeCollapse(TimSort.java:439) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1464) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117) > at > org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269) > at > org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown > Source) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260) > at > org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98) > at > org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown > Source) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013) > at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at
[jira] [Assigned] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak
[ https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-36827: -- Assignee: Gengliang Wang > Task/Stage/Job data remain in memory leads memory leak > -- > > Key: SPARK-36827 > URL: https://issues.apache.org/jira/browse/SPARK-36827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > Attachments: mem1.txt, worker.txt > > > Noticing memory-leak like behavior, steady increase of heap after GC and > eventually it leads to a service failure. > The GC histogram shows very high number of Task/Data/Job data > {code} > num #instances #bytes class name > -- >6: 7835346 2444627952 org.apache.spark.status.TaskDataWrapper > 25: 3765152 180727296 org.apache.spark.status.StageDataWrapper > 88:2322559290200 org.apache.spark.status.JobDataWrapper > {code} > Thread dumps show clearly the clean up thread is always doing cleanupStages > {code} > "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 > tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000] >java.lang.Thread.State: RUNNABLE > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown > Source) > at java.util.TimSort.gallopLeft(TimSort.java:542) > at java.util.TimSort.mergeLo(TimSort.java:752) > at java.util.TimSort.mergeAt(TimSort.java:514) > at java.util.TimSort.mergeCollapse(TimSort.java:439) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1464) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117) > at > org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269) > at > org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown > Source) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260) > at > org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98) > at > org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown > Source) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013) > at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by
[jira] [Assigned] (SPARK-35174) Avoid opening watch when waitAppCompletion is false
[ https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35174: Assignee: Apache Spark > Avoid opening watch when waitAppCompletion is false > --- > > Key: SPARK-35174 > URL: https://issues.apache.org/jira/browse/SPARK-35174 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Jonathan Lafleche >Assignee: Apache Spark >Priority: Minor > > In spark-submit, we currently [open a pod watch for any spark > submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167]. > If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result > of the watcher and break out of the watcher. > When submitting spark applications at scale, this is a source of operational > pain, since opening the watch relies on opening a websocket, which tends to > run into subtle networking issues around negotiating the websocket connection. > I'd like to change this behaviour so that we eagerly check whether we are > waiting on app completion, and avoid opening the watch altogether when > WAIT_FOR_APP_COMPLETION is false. > Would you accept a contribution for that change, or are there any concerns > I've overlooked? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35174) Avoid opening watch when waitAppCompletion is false
[ https://issues.apache.org/jira/browse/SPARK-35174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419681#comment-17419681 ] Apache Spark commented on SPARK-35174: -- User 'slothspot' has created a pull request for this issue: https://github.com/apache/spark/pull/34095 > Avoid opening watch when waitAppCompletion is false > --- > > Key: SPARK-35174 > URL: https://issues.apache.org/jira/browse/SPARK-35174 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Jonathan Lafleche >Priority: Minor > > In spark-submit, we currently [open a pod watch for any spark > submission|https://github.com/apache/spark/blame/0494dc90af48ce7da0625485a4dc6917a244d580/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L150-L167]. > If WAIT_FOR_APP_COMPLETION is false, we then immediately ignore the result > of the watcher and break out of the watcher. > When submitting spark applications at scale, this is a source of operational > pain, since opening the watch relies on opening a websocket, which tends to > run into subtle networking issues around negotiating the websocket connection. > I'd like to change this behaviour so that we eagerly check whether we are > waiting on app completion, and avoid opening the watch altogether when > WAIT_FOR_APP_COMPLETION is false. > Would you accept a contribution for that change, or are there any concerns > I've overlooked? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36742) Fix ps.to_datetime with plurals of keys like years, months, days
[ https://issues.apache.org/jira/browse/SPARK-36742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419678#comment-17419678 ] Apache Spark commented on SPARK-36742: -- User 'dgd-contributor' has created a pull request for this issue: https://github.com/apache/spark/pull/34094 > Fix ps.to_datetime with plurals of keys like years, months, days > > > Key: SPARK-36742 > URL: https://issues.apache.org/jira/browse/SPARK-36742 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dgd_contributor >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36838) Keep In/InSet use same nullSafeEval and refactor
[ https://issues.apache.org/jira/browse/SPARK-36838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-36838: -- Summary: Keep In/InSet use same nullSafeEval and refactor (was: Keep In/InSet use same nullSafeEval) > Keep In/InSet use same nullSafeEval and refactor > > > Key: SPARK-36838 > URL: https://issues.apache.org/jira/browse/SPARK-36838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > In current code, In/InSet return null when value is null, this behavior is > not correct -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36656) CollapseProject should not collapse correlated scalar subqueries
[ https://issues.apache.org/jira/browse/SPARK-36656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36656. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33903 [https://github.com/apache/spark/pull/33903] > CollapseProject should not collapse correlated scalar subqueries > > > Key: SPARK-36656 > URL: https://issues.apache.org/jira/browse/SPARK-36656 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.3.0 > > > Currently, the optimizer rule `CollapseProject` inlines expressions generated > from correlated scalar subqueries, which can create unnecessary left outer > joins. > {code:sql} > select c1, s, s * 10 from ( > select c1, (select first(c2) from t2 where t1.c1 = t2.c1) s from t1) > {code} > {code:scala} > // Before > Project [c1, s, (s * 10)] > +- Project [c1, scalar-subquery [c1] AS s] >: +- Aggregate [c1], [first(c2), c1] >: +- LocalRelation [c1, c2] >+- LocalRelation [c1, c2] > // After (scalar subqueries are inlined) > Project [c1, scalar-subquery [c1], (scalar-subquery [c1] * 10)] > : +- Aggregate [c1], [first(c2), c1] > : +- LocalRelation [c1, c2] > : +- Aggregate [c1], [first(c2), c1] > : +- LocalRelation [c1, c2] > +- LocalRelation [c1, c2] > {code} > Then this query will have two LeftOuter joins created. We should only > collapse projects after correlated subqueries are rewritten as joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36656) CollapseProject should not collapse correlated scalar subqueries
[ https://issues.apache.org/jira/browse/SPARK-36656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36656: --- Assignee: Allison Wang > CollapseProject should not collapse correlated scalar subqueries > > > Key: SPARK-36656 > URL: https://issues.apache.org/jira/browse/SPARK-36656 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Currently, the optimizer rule `CollapseProject` inlines expressions generated > from correlated scalar subqueries, which can create unnecessary left outer > joins. > {code:sql} > select c1, s, s * 10 from ( > select c1, (select first(c2) from t2 where t1.c1 = t2.c1) s from t1) > {code} > {code:scala} > // Before > Project [c1, s, (s * 10)] > +- Project [c1, scalar-subquery [c1] AS s] >: +- Aggregate [c1], [first(c2), c1] >: +- LocalRelation [c1, c2] >+- LocalRelation [c1, c2] > // After (scalar subqueries are inlined) > Project [c1, scalar-subquery [c1], (scalar-subquery [c1] * 10)] > : +- Aggregate [c1], [first(c2), c1] > : +- LocalRelation [c1, c2] > : +- Aggregate [c1], [first(c2), c1] > : +- LocalRelation [c1, c2] > +- LocalRelation [c1, c2] > {code} > Then this query will have two LeftOuter joins created. We should only > collapse projects after correlated subqueries are rewritten as joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36792: --- Assignee: angerszhu > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Labels: correctness > > Inset(Double.NaN, Seq(Double.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36792) Inset should handle Double.NaN and Float.NaN
[ https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36792. - Fix Version/s: 3.1.3 3.2.0 3.0.4 Resolution: Fixed Issue resolved by pull request 34033 [https://github.com/apache/spark/pull/34033] > Inset should handle Double.NaN and Float.NaN > > > Key: SPARK-36792 > URL: https://issues.apache.org/jira/browse/SPARK-36792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Labels: correctness > Fix For: 3.0.4, 3.2.0, 3.1.3 > > > Inset(Double.NaN, Seq(Double.NaN, 1d)) return false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list
[ https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36747: --- Assignee: Allison Wang > Do not collapse Project with Aggregate when correlated subqueries are present > in the project list > - > > Key: SPARK-36747 > URL: https://issues.apache.org/jira/browse/SPARK-36747 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Currently CollapseProject combines Project with Aggregate when the shared > attributes are deterministic. But if there are correlated scalar subqueries > in the project list that uses the output of the aggregate, they cannot be > combined. Otherwise, the plan after rewrite will not be valid: > {code} > select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) > s from t) > == Optimized Logical Plan == > Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] > +- Project [sum(c2)#10L] >+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int)) > :- LocalRelation [c2#3] > +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2] > +- LocalRelation [c1#2, c2#3] > java.lang.UnsupportedOperationException: Cannot generate code for expression: > sum(input[0, int, false]) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list
[ https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36747. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 34081 [https://github.com/apache/spark/pull/34081] > Do not collapse Project with Aggregate when correlated subqueries are present > in the project list > - > > Key: SPARK-36747 > URL: https://issues.apache.org/jira/browse/SPARK-36747 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.2.0 > > > Currently CollapseProject combines Project with Aggregate when the shared > attributes are deterministic. But if there are correlated scalar subqueries > in the project list that uses the output of the aggregate, they cannot be > combined. Otherwise, the plan after rewrite will not be valid: > {code} > select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) > s from t) > == Optimized Logical Plan == > Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] > +- Project [sum(c2)#10L] >+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int)) > :- LocalRelation [c2#3] > +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2] > +- LocalRelation [c1#2, c2#3] > java.lang.UnsupportedOperationException: Cannot generate code for expression: > sum(input[0, int, false]) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak
[ https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419631#comment-17419631 ] tenglei commented on SPARK-36827: - I met the same problem in spark 2.3.x, and find a pull request from https://github.com/apache/spark/pull/24616 it seem to relate to this problem and it should work.How much time you spend to occur this problem? > Task/Stage/Job data remain in memory leads memory leak > -- > > Key: SPARK-36827 > URL: https://issues.apache.org/jira/browse/SPARK-36827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > Attachments: mem1.txt, worker.txt > > > Noticing memory-leak like behavior, steady increase of heap after GC and > eventually it leads to a service failure. > The GC histogram shows very high number of Task/Data/Job data > {code} > num #instances #bytes class name > -- >6: 7835346 2444627952 org.apache.spark.status.TaskDataWrapper > 25: 3765152 180727296 org.apache.spark.status.StageDataWrapper > 88:2322559290200 org.apache.spark.status.JobDataWrapper > {code} > Thread dumps show clearly the clean up thread is always doing cleanupStages > {code} > "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 > tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000] >java.lang.Thread.State: RUNNABLE > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown > Source) > at java.util.TimSort.gallopLeft(TimSort.java:542) > at java.util.TimSort.mergeLo(TimSort.java:752) > at java.util.TimSort.mergeAt(TimSort.java:514) > at java.util.TimSort.mergeCollapse(TimSort.java:439) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1464) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117) > at > org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269) > at > org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown > Source) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260) > at > org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98) > at > org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown > Source) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013) > at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at >
[jira] [Issue Comment Deleted] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak
[ https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tenglei updated SPARK-36827: Comment: was deleted (was: I met the same problem in spark 2.3.x, and find a pull request from [https://github.com/apache/spark/pull/24616|https://github.com/apache/spark/pull/24616,] it seem to relate to this problem and it should work.How much time you spend to occur this problem?) > Task/Stage/Job data remain in memory leads memory leak > -- > > Key: SPARK-36827 > URL: https://issues.apache.org/jira/browse/SPARK-36827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > Attachments: mem1.txt, worker.txt > > > Noticing memory-leak like behavior, steady increase of heap after GC and > eventually it leads to a service failure. > The GC histogram shows very high number of Task/Data/Job data > {code} > num #instances #bytes class name > -- >6: 7835346 2444627952 org.apache.spark.status.TaskDataWrapper > 25: 3765152 180727296 org.apache.spark.status.StageDataWrapper > 88:2322559290200 org.apache.spark.status.JobDataWrapper > {code} > Thread dumps show clearly the clean up thread is always doing cleanupStages > {code} > "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 > tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000] >java.lang.Thread.State: RUNNABLE > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown > Source) > at java.util.TimSort.gallopLeft(TimSort.java:542) > at java.util.TimSort.mergeLo(TimSort.java:752) > at java.util.TimSort.mergeAt(TimSort.java:514) > at java.util.TimSort.mergeCollapse(TimSort.java:439) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1464) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117) > at > org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269) > at > org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown > Source) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260) > at > org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98) > at > org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown > Source) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013) > at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at >
[jira] [Comment Edited] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak
[ https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419629#comment-17419629 ] tenglei edited comment on SPARK-36827 at 9/24/21, 7:50 AM: --- I met the same problem in spark 2.3.x, and find a pull request from [https://github.com/apache/spark/pull/24616|https://github.com/apache/spark/pull/24616,] it seem to relate to this problem and it should work.How much time you spend to occur this problem? was (Author: tenglei): I met the same problem in spark 2.3.x, and find a pull request from [https://github.com/apache/spark/pull/24616,] it seem to relate to this problem and it should work.How much time you spend to occur this problem? > Task/Stage/Job data remain in memory leads memory leak > -- > > Key: SPARK-36827 > URL: https://issues.apache.org/jira/browse/SPARK-36827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > Attachments: mem1.txt, worker.txt > > > Noticing memory-leak like behavior, steady increase of heap after GC and > eventually it leads to a service failure. > The GC histogram shows very high number of Task/Data/Job data > {code} > num #instances #bytes class name > -- >6: 7835346 2444627952 org.apache.spark.status.TaskDataWrapper > 25: 3765152 180727296 org.apache.spark.status.StageDataWrapper > 88:2322559290200 org.apache.spark.status.JobDataWrapper > {code} > Thread dumps show clearly the clean up thread is always doing cleanupStages > {code} > "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 > tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000] >java.lang.Thread.State: RUNNABLE > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown > Source) > at java.util.TimSort.gallopLeft(TimSort.java:542) > at java.util.TimSort.mergeLo(TimSort.java:752) > at java.util.TimSort.mergeAt(TimSort.java:514) > at java.util.TimSort.mergeCollapse(TimSort.java:439) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1464) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117) > at > org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269) > at > org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown > Source) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260) > at > org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98) > at > org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown > Source) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013) > at >
[jira] [Commented] (SPARK-36827) Task/Stage/Job data remain in memory leads memory leak
[ https://issues.apache.org/jira/browse/SPARK-36827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419629#comment-17419629 ] tenglei commented on SPARK-36827: - I met the same problem in spark 2.3.x, and find a pull request from [https://github.com/apache/spark/pull/24616,] it seem to relate to this problem and it should work.How much time you spend to occur this problem? > Task/Stage/Job data remain in memory leads memory leak > -- > > Key: SPARK-36827 > URL: https://issues.apache.org/jira/browse/SPARK-36827 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > Attachments: mem1.txt, worker.txt > > > Noticing memory-leak like behavior, steady increase of heap after GC and > eventually it leads to a service failure. > The GC histogram shows very high number of Task/Data/Job data > {code} > num #instances #bytes class name > -- >6: 7835346 2444627952 org.apache.spark.status.TaskDataWrapper > 25: 3765152 180727296 org.apache.spark.status.StageDataWrapper > 88:2322559290200 org.apache.spark.status.JobDataWrapper > {code} > Thread dumps show clearly the clean up thread is always doing cleanupStages > {code} > "element-tracking-store-worker" #355 daemon prio=5 os_prio=0 > tid=0x7f31b0014800 nid=0x409 runnable [0x7f2f25783000] >java.lang.Thread.State: RUNNABLE > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.util.kvstore.KVTypeInfo$MethodAccessor.get(KVTypeInfo.java:162) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.compare(InMemoryStore.java:434) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.lambda$iterator$0(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView$$Lambda$9000/574018760.compare(Unknown > Source) > at java.util.TimSort.gallopLeft(TimSort.java:542) > at java.util.TimSort.mergeLo(TimSort.java:752) > at java.util.TimSort.mergeAt(TimSort.java:514) > at java.util.TimSort.mergeCollapse(TimSort.java:439) > at java.util.TimSort.sort(TimSort.java:245) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1464) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryView.iterator(InMemoryStore.java:375) > at > org.apache.spark.util.kvstore.KVStoreView.closeableIterator(KVStoreView.java:117) > at > org.apache.spark.status.AppStatusListener.$anonfun$cleanupStages$2(AppStatusListener.scala:1269) > at > org.apache.spark.status.AppStatusListener$$Lambda$9126/608388595.apply(Unknown > Source) > at scala.collection.immutable.List.map(List.scala:297) > at > org.apache.spark.status.AppStatusListener.cleanupStages(AppStatusListener.scala:1260) > at > org.apache.spark.status.AppStatusListener.$anonfun$new$3(AppStatusListener.scala:98) > at > org.apache.spark.status.AppStatusListener$$Lambda$646/596139882.apply$mcVJ$sp(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3(ElementTrackingStore.scala:135) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$3$adapted(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$986/162337848.apply(Unknown > Source) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2(ElementTrackingStore.scala:133) > at > org.apache.spark.status.ElementTrackingStore.$anonfun$write$2$adapted(ElementTrackingStore.scala:131) > at > org.apache.spark.status.ElementTrackingStore$$Lambda$984/600376389.apply(Unknown > Source) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers.$anonfun$fireOnce$1(ElementTrackingStore.scala:58) > at > org.apache.spark.status.ElementTrackingStore$LatchedTriggers$$Lambda$985/1187323214.apply$mcV$sp(Unknown > Source) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryLog(Utils.scala:2013) > at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at >
[jira] [Commented] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419623#comment-17419623 ] Apache Spark commented on SPARK-36294: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34093 > Refactor fifth set of 20 query execution errors to use error classes > > > Key: SPARK-36294 > URL: https://issues.apache.org/jira/browse/SPARK-36294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fifth set of 20. > {code:java} > createStreamingSourceNotSpecifySchemaError > streamedOperatorUnsupportedByDataSourceError > multiplePathsSpecifiedError > failedToFindDataSourceError > removedClassInSpark2Error > incompatibleDataSourceRegisterError > unrecognizedFileFormatError > sparkUpgradeInReadingDatesError > sparkUpgradeInWritingDatesError > buildReaderUnsupportedForFileFormatError > jobAbortedError > taskFailedWhileWritingRowsError > readCurrentFileNotFoundError > unsupportedSaveModeError > cannotClearOutputDirectoryError > cannotClearPartitionDirectoryError > failedToCastValueToDataTypeForPartitionColumnError > endOfStreamError > fallbackV1RelationReportsInconsistentSchemaError > cannotDropNonemptyNamespaceError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36294: Assignee: Apache Spark > Refactor fifth set of 20 query execution errors to use error classes > > > Key: SPARK-36294 > URL: https://issues.apache.org/jira/browse/SPARK-36294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Assignee: Apache Spark >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fifth set of 20. > {code:java} > createStreamingSourceNotSpecifySchemaError > streamedOperatorUnsupportedByDataSourceError > multiplePathsSpecifiedError > failedToFindDataSourceError > removedClassInSpark2Error > incompatibleDataSourceRegisterError > unrecognizedFileFormatError > sparkUpgradeInReadingDatesError > sparkUpgradeInWritingDatesError > buildReaderUnsupportedForFileFormatError > jobAbortedError > taskFailedWhileWritingRowsError > readCurrentFileNotFoundError > unsupportedSaveModeError > cannotClearOutputDirectoryError > cannotClearPartitionDirectoryError > failedToCastValueToDataTypeForPartitionColumnError > endOfStreamError > fallbackV1RelationReportsInconsistentSchemaError > cannotDropNonemptyNamespaceError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419622#comment-17419622 ] Apache Spark commented on SPARK-36294: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34093 > Refactor fifth set of 20 query execution errors to use error classes > > > Key: SPARK-36294 > URL: https://issues.apache.org/jira/browse/SPARK-36294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fifth set of 20. > {code:java} > createStreamingSourceNotSpecifySchemaError > streamedOperatorUnsupportedByDataSourceError > multiplePathsSpecifiedError > failedToFindDataSourceError > removedClassInSpark2Error > incompatibleDataSourceRegisterError > unrecognizedFileFormatError > sparkUpgradeInReadingDatesError > sparkUpgradeInWritingDatesError > buildReaderUnsupportedForFileFormatError > jobAbortedError > taskFailedWhileWritingRowsError > readCurrentFileNotFoundError > unsupportedSaveModeError > cannotClearOutputDirectoryError > cannotClearPartitionDirectoryError > failedToCastValueToDataTypeForPartitionColumnError > endOfStreamError > fallbackV1RelationReportsInconsistentSchemaError > cannotDropNonemptyNamespaceError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36294) Refactor fifth set of 20 query execution errors to use error classes
[ https://issues.apache.org/jira/browse/SPARK-36294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36294: Assignee: (was: Apache Spark) > Refactor fifth set of 20 query execution errors to use error classes > > > Key: SPARK-36294 > URL: https://issues.apache.org/jira/browse/SPARK-36294 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Karen Feng >Priority: Major > > Refactor some exceptions in > [QueryExecutionErrors|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala] > to use error classes. > There are currently ~350 exceptions in this file; so this PR only focuses on > the fifth set of 20. > {code:java} > createStreamingSourceNotSpecifySchemaError > streamedOperatorUnsupportedByDataSourceError > multiplePathsSpecifiedError > failedToFindDataSourceError > removedClassInSpark2Error > incompatibleDataSourceRegisterError > unrecognizedFileFormatError > sparkUpgradeInReadingDatesError > sparkUpgradeInWritingDatesError > buildReaderUnsupportedForFileFormatError > jobAbortedError > taskFailedWhileWritingRowsError > readCurrentFileNotFoundError > unsupportedSaveModeError > cannotClearOutputDirectoryError > cannotClearPartitionDirectoryError > failedToCastValueToDataTypeForPartitionColumnError > endOfStreamError > fallbackV1RelationReportsInconsistentSchemaError > cannotDropNonemptyNamespaceError > {code} > For more detail, see the parent ticket SPARK-36094. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36825) Read/write dataframes with ANSI intervals from/to parquet files
[ https://issues.apache.org/jira/browse/SPARK-36825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-36825. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34057 [https://github.com/apache/spark/pull/34057] > Read/write dataframes with ANSI intervals from/to parquet files > --- > > Key: SPARK-36825 > URL: https://issues.apache.org/jira/browse/SPARK-36825 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Implement writing and reading ANSI intervals (year-month and day-time > intervals) columns in dataframes to Parquet datasources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side
[ https://issues.apache.org/jira/browse/SPARK-36840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419604#comment-17419604 ] Apache Spark commented on SPARK-36840: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/34070 > Support DPP if there is no selective predicate on the filtering side > > > Key: SPARK-36840 > URL: https://issues.apache.org/jira/browse/SPARK-36840 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side
[ https://issues.apache.org/jira/browse/SPARK-36840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36840: Assignee: Apache Spark > Support DPP if there is no selective predicate on the filtering side > > > Key: SPARK-36840 > URL: https://issues.apache.org/jira/browse/SPARK-36840 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side
[ https://issues.apache.org/jira/browse/SPARK-36840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36840: Assignee: (was: Apache Spark) > Support DPP if there is no selective predicate on the filtering side > > > Key: SPARK-36840 > URL: https://issues.apache.org/jira/browse/SPARK-36840 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36838) Keep In/InSet use same nullSafeEval
[ https://issues.apache.org/jira/browse/SPARK-36838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-36838: -- Summary: Keep In/InSet use same nullSafeEval (was: In/InSet should handle NULL correctly ) > Keep In/InSet use same nullSafeEval > --- > > Key: SPARK-36838 > URL: https://issues.apache.org/jira/browse/SPARK-36838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > In current code, In/InSet return null when value is null, this behavior is > not correct -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-36838) Keep In/InSet use same nullSafeEval
[ https://issues.apache.org/jira/browse/SPARK-36838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu reopened SPARK-36838: --- > Keep In/InSet use same nullSafeEval > --- > > Key: SPARK-36838 > URL: https://issues.apache.org/jira/browse/SPARK-36838 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: angerszhu >Priority: Major > > In current code, In/InSet return null when value is null, this behavior is > not correct -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36840) Support DPP if there is no selective predicate on the filtering side
Yuming Wang created SPARK-36840: --- Summary: Support DPP if there is no selective predicate on the filtering side Key: SPARK-36840 URL: https://issues.apache.org/jira/browse/SPARK-36840 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org