[jira] [Assigned] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables
[ https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37332: Assignee: Max Gekk > Check adding of ANSI interval columns to v1/v2 tables > - > > Key: SPARK-37332 > URL: https://issues.apache.org/jira/browse/SPARK-37332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables
[ https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37332. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34600 [https://github.com/apache/spark/pull/34600] > Check adding of ANSI interval columns to v1/v2 tables > - > > Key: SPARK-37332 > URL: https://issues.apache.org/jira/browse/SPARK-37332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37277) Support DayTimeIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37277: Assignee: Apache Spark > Support DayTimeIntervalType in Arrow > > > Key: SPARK-37277 > URL: https://issues.apache.org/jira/browse/SPARK-37277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Implements the support of DayTimeIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37277) Support DayTimeIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37277: Assignee: (was: Apache Spark) > Support DayTimeIntervalType in Arrow > > > Key: SPARK-37277 > URL: https://issues.apache.org/jira/browse/SPARK-37277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of DayTimeIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37277) Support DayTimeIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444258#comment-17444258 ] Apache Spark commented on SPARK-37277: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34614 > Support DayTimeIntervalType in Arrow > > > Key: SPARK-37277 > URL: https://issues.apache.org/jira/browse/SPARK-37277 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of DayTimeIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444256#comment-17444256 ] pralabhkumar commented on SPARK-37181: -- [~yikunkero] [~chconnell] . I'll work on this and will create a PR > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
[ https://issues.apache.org/jira/browse/SPARK-37337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37337: Assignee: Apache Spark > Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion > -- > > Key: SPARK-37337 > URL: https://issues.apache.org/jira/browse/SPARK-37337 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > - Undeprecate (Spark)DataFrame.to_koalas > - Deprecate (Spark)DataFrame.to_pandas_like and introduce > (Spark)DataFrame.pandas_api instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
[ https://issues.apache.org/jira/browse/SPARK-37337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444247#comment-17444247 ] Apache Spark commented on SPARK-37337: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34608 > Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion > -- > > Key: SPARK-37337 > URL: https://issues.apache.org/jira/browse/SPARK-37337 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > - Undeprecate (Spark)DataFrame.to_koalas > - Deprecate (Spark)DataFrame.to_pandas_like and introduce > (Spark)DataFrame.pandas_api instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
[ https://issues.apache.org/jira/browse/SPARK-37337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37337: Assignee: (was: Apache Spark) > Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion > -- > > Key: SPARK-37337 > URL: https://issues.apache.org/jira/browse/SPARK-37337 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > - Undeprecate (Spark)DataFrame.to_koalas > - Deprecate (Spark)DataFrame.to_pandas_like and introduce > (Spark)DataFrame.pandas_api instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
[ https://issues.apache.org/jira/browse/SPARK-37337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-37337: - Description: - Undeprecate (Spark)DataFrame.to_koalas - Deprecate (Spark)DataFrame.to_pandas_like and introduce (Spark)DataFrame.pandas_api instead. was: Undeprecate (Spark)DataFrame.to_koalas Rename (Spark)DataFrame.to_pandas_like to (Spark)DataFrame.pandas_api > Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion > -- > > Key: SPARK-37337 > URL: https://issues.apache.org/jira/browse/SPARK-37337 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > - Undeprecate (Spark)DataFrame.to_koalas > - Deprecate (Spark)DataFrame.to_pandas_like and introduce > (Spark)DataFrame.pandas_api instead. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
[ https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-37338. -- Resolution: Duplicate > Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api > - > > Key: SPARK-37338 > URL: https://issues.apache.org/jira/browse/SPARK-37338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and > inconvenient to call. > So we wanted to rename to_pandas_on_spark to pandas_api for API usability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2
[ https://issues.apache.org/jira/browse/SPARK-37344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444225#comment-17444225 ] angerszhu edited comment on SPARK-37344 at 11/16/21, 2:51 AM: -- In latest master branch {code} == Parsed Logical Plan == 'Project [unresolvedalias('split('name, \;), None)] +- 'UnresolvedRelation [split_test], [], false == Analyzed Logical Plan == split(name, \;, -1): array Project [split(name#225, \;, -1) AS split(name, \;, -1)#226] +- SubqueryAlias spark_catalog.default.split_test +- Relation default.split_test[id#224,name#225] parquet == Optimized Logical Plan == Project [split(name#225, \;, -1) AS split(name, \;, -1)#226] +- Relation default.split_test[id#224,name#225] parquet == Physical Plan == *(1) Project [split(name#225, \;, -1) AS split(name, \;, -1)#226] +- *(1) ColumnarToRow +- FileScan parquet default.split_test[name#225] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/Users/yi.zhu/Documents/project/Angerszh/spark/sql/core/spark..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct {code} was (Author: angerszhuuu): Work on this > split function behave differently between spark 2.3 and spark 3.2 > - > > Key: SPARK-37344 > URL: https://issues.apache.org/jira/browse/SPARK-37344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: ocean >Priority: Major > Labels: incorrect > > while use split function in sql, it behave differently between 2.3 and 3.2, > which cause incorrect problem. > we can use this sql to reproduce this problem: > > create table split_test ( id int,name string) > insert into split_test values(1,"abc;def") > explain extended select split(name,';') from split_test > > spark3: > spark-sql> Explain extended select split(name,';') from split_test; > == Parsed Logical Plan == > 'Project [unresolvedalias('split('name, \\;), None)] > +- 'UnresolvedRelation [split_test], [], false > > spark2: > > spark-sql> Explain extended select split(name,';') from split_test; > == Parsed Logical Plan == > 'Project [unresolvedalias('split('name, \;), None)] > +- 'UnresolvedRelation split_test > > It looks like the deal of escape is different -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444235#comment-17444235 ] Yikun Jiang commented on SPARK-37181: - Actully, lating-1 is also do same covert in Python internal implementations, so I think it's good to do the same covert. [1] https://github.com/python/cpython/blob/9bf2cbc4c498812e14f20d86acb61c53928a5a57/Lib/encodings/latin_1.py#L43 > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444235#comment-17444235 ] Yikun Jiang edited comment on SPARK-37181 at 11/16/21, 2:41 AM: Agree, actully, lating-1 is also do same covert in Python internal implementations, so I think it's good to do the same covert. [1] [https://github.com/python/cpython/blob/9bf2cbc4c498812e14f20d86acb61c53928a5a57/Lib/encodings/latin_1.py#L43] was (Author: yikunkero): Actully, lating-1 is also do same covert in Python internal implementations, so I think it's good to do the same covert. [1] https://github.com/python/cpython/blob/9bf2cbc4c498812e14f20d86acb61c53928a5a57/Lib/encodings/latin_1.py#L43 > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2
[ https://issues.apache.org/jira/browse/SPARK-37344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444225#comment-17444225 ] angerszhu commented on SPARK-37344: --- Work on this > split function behave differently between spark 2.3 and spark 3.2 > - > > Key: SPARK-37344 > URL: https://issues.apache.org/jira/browse/SPARK-37344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: ocean >Priority: Major > Labels: incorrect > > while use split function in sql, it behave differently between 2.3 and 3.2, > which cause incorrect problem. > we can use this sql to reproduce this problem: > > create table split_test ( id int,name string) > insert into split_test values(1,"abc;def") > explain extended select split(name,';') from split_test > > spark3: > spark-sql> Explain extended select split(name,';') from split_test; > == Parsed Logical Plan == > 'Project [unresolvedalias('split('name, \\;), None)] > +- 'UnresolvedRelation [split_test], [], false > > spark2: > > spark-sql> Explain extended select split(name,';') from split_test; > == Parsed Logical Plan == > 'Project [unresolvedalias('split('name, \;), None)] > +- 'UnresolvedRelation split_test > > It looks like the deal of escape is different -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2
[ https://issues.apache.org/jira/browse/SPARK-37344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ocean updated SPARK-37344: -- Labels: incorrect (was: ) > split function behave differently between spark 2.3 and spark 3.2 > - > > Key: SPARK-37344 > URL: https://issues.apache.org/jira/browse/SPARK-37344 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.2.0 >Reporter: ocean >Priority: Major > Labels: incorrect > > while use split function in sql, it behave differently between 2.3 and 3.2, > which cause incorrect problem. > we can use this sql to reproduce this problem: > > create table split_test ( id int,name string) > insert into split_test values(1,"abc;def") > explain extended select split(name,';') from split_test > > spark3: > spark-sql> Explain extended select split(name,';') from split_test; > == Parsed Logical Plan == > 'Project [unresolvedalias('split('name, \\;), None)] > +- 'UnresolvedRelation [split_test], [], false > > spark2: > > spark-sql> Explain extended select split(name,';') from split_test; > == Parsed Logical Plan == > 'Project [unresolvedalias('split('name, \;), None)] > +- 'UnresolvedRelation split_test > > It looks like the deal of escape is different -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2
ocean created SPARK-37344: - Summary: split function behave differently between spark 2.3 and spark 3.2 Key: SPARK-37344 URL: https://issues.apache.org/jira/browse/SPARK-37344 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.2, 3.1.1 Reporter: ocean while use split function in sql, it behave differently between 2.3 and 3.2, which cause incorrect problem. we can use this sql to reproduce this problem: create table split_test ( id int,name string) insert into split_test values(1,"abc;def") explain extended select split(name,';') from split_test spark3: spark-sql> Explain extended select split(name,';') from split_test; == Parsed Logical Plan == 'Project [unresolvedalias('split('name, \\;), None)] +- 'UnresolvedRelation [split_test], [], false spark2: spark-sql> Explain extended select split(name,';') from split_test; == Parsed Logical Plan == 'Project [unresolvedalias('split('name, \;), None)] +- 'UnresolvedRelation split_test It looks like the deal of escape is different -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37343) Implement createIndex and IndexExists in JDBC (Postgres dialect)
[ https://issues.apache.org/jira/browse/SPARK-37343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444210#comment-17444210 ] dch nguyen commented on SPARK-37343: I'm working on this. > Implement createIndex and IndexExists in JDBC (Postgres dialect) > > > Key: SPARK-37343 > URL: https://issues.apache.org/jira/browse/SPARK-37343 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37343) Implement createIndex and IndexExists in JDBC (Postgres dialect)
dch nguyen created SPARK-37343: -- Summary: Implement createIndex and IndexExists in JDBC (Postgres dialect) Key: SPARK-37343 URL: https://issues.apache.org/jira/browse/SPARK-37343 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: dch nguyen -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-37335: Assignee: Nicholas Chammas > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-37335. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34605 [https://github.com/apache/spark/pull/34605] > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 3.3.0 > > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444195#comment-17444195 ] Apache Spark commented on SPARK-37342: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/34613 > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37342: Assignee: (was: Apache Spark) > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444194#comment-17444194 ] Apache Spark commented on SPARK-37342: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/34613 > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37342: Assignee: Apache Spark > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
[ https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-37342: - Component/s: Build (was: Spark Core) > Upgrade Apache Arrow to 6.0.0 > - > > Key: SPARK-37342 > URL: https://issues.apache.org/jira/browse/SPARK-37342 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Chao Sun >Priority: Major > > Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last > month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37342) Upgrade Apache Arrow to 6.0.0
Chao Sun created SPARK-37342: Summary: Upgrade Apache Arrow to 6.0.0 Key: SPARK-37342 URL: https://issues.apache.org/jira/browse/SPARK-37342 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Chao Sun Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last month. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444190#comment-17444190 ] Apache Spark commented on SPARK-37341: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34612 > Avoid unnecessary buffer and copy in full outer sort merge join > --- > > Key: SPARK-37341 > URL: https://issues.apache.org/jira/browse/SPARK-37341 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers > input rows, even when rows from both sides do have matched keys > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641] > ). This is unnecessary, as we can just output the row with smaller join > keys, and only buffer when both sides have matched keys. This would save us > from unnecessary copy and buffer, when both join sides have a lot of rows not > matched with each other. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444188#comment-17444188 ] Apache Spark commented on SPARK-37341: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34612 > Avoid unnecessary buffer and copy in full outer sort merge join > --- > > Key: SPARK-37341 > URL: https://issues.apache.org/jira/browse/SPARK-37341 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers > input rows, even when rows from both sides do have matched keys > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641] > ). This is unnecessary, as we can just output the row with smaller join > keys, and only buffer when both sides have matched keys. This would save us > from unnecessary copy and buffer, when both join sides have a lot of rows not > matched with each other. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37341: Assignee: Apache Spark > Avoid unnecessary buffer and copy in full outer sort merge join > --- > > Key: SPARK-37341 > URL: https://issues.apache.org/jira/browse/SPARK-37341 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers > input rows, even when rows from both sides do have matched keys > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641] > ). This is unnecessary, as we can just output the row with smaller join > keys, and only buffer when both sides have matched keys. This would save us > from unnecessary copy and buffer, when both join sides have a lot of rows not > matched with each other. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37341: Assignee: (was: Apache Spark) > Avoid unnecessary buffer and copy in full outer sort merge join > --- > > Key: SPARK-37341 > URL: https://issues.apache.org/jira/browse/SPARK-37341 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers > input rows, even when rows from both sides do have matched keys > ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641] > ). This is unnecessary, as we can just output the row with smaller join > keys, and only buffer when both sides have matched keys. This would save us > from unnecessary copy and buffer, when both join sides have a lot of rows not > matched with each other. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join
Cheng Su created SPARK-37341: Summary: Avoid unnecessary buffer and copy in full outer sort merge join Key: SPARK-37341 URL: https://issues.apache.org/jira/browse/SPARK-37341 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Cheng Su FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers input rows, even when rows from both sides do have matched keys ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641] ). This is unnecessary, as we can just output the row with smaller join keys, and only buffer when both sides have matched keys. This would save us from unnecessary copy and buffer, when both join sides have a lot of rows not matched with each other. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444173#comment-17444173 ] Yongjun Zhang edited comment on SPARK-31646 at 11/15/21, 11:17 PM: --- HI [~mauzhang] , wonder if you have been monitoring the metrics activeConnections and registeredConnections, somehow I observed registeredConnections is smaller than activeConnections, I thought it should be the opposite. I also asked here: https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924 Thanks. was (Author: yzhangal): HI [~mauzhang] , wonder if you have been monitoring the metrics activeConnections and registeredConnections, somehow I observed registeredConnections is smaller than activeConenctions, I thought it should be the opposite. I also asked here: https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924 Thanks. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444173#comment-17444173 ] Yongjun Zhang commented on SPARK-31646: --- HI [~mauzhang] , wonder if you have been monitoring the metrics activeConnections and registeredConnections, somehow I observed registeredConnections is smaller than activeConenctions, I thought it should be the opposite. I also asked here: https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924 Thanks. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37340) Display StageIds in Operators for SQL UI
[ https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444168#comment-17444168 ] Yian Liou commented on SPARK-37340: --- Will be working on this issue and opening pull request. > Display StageIds in Operators for SQL UI > > > Key: SPARK-37340 > URL: https://issues.apache.org/jira/browse/SPARK-37340 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yian Liou >Priority: Major > > This proposes a more generalized solution of > https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator > mapping is done with the following algorithm. > 1. Read SparkGraph to get every Node's name and respective AccumulatorIDs. > 2. Gets each stage's AccumulatorIDs. > 3. Maps Operators to stages by checking for non-zero intersection of Step 1 > and 2's AccumulatorIDs. > 4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI. > As a result, some operators without max metrics values will also have > stageIds in the UI. This Jira also aims to add minor enhancements to the SQL > UI tab. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37340) Display StageIds in Operators for SQL UI
Yian Liou created SPARK-37340: - Summary: Display StageIds in Operators for SQL UI Key: SPARK-37340 URL: https://issues.apache.org/jira/browse/SPARK-37340 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.2.0 Reporter: Yian Liou This proposes a more generalized solution of https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator mapping is done with the following algorithm. 1. Read SparkGraph to get every Node's name and respective AccumulatorIDs. 2. Gets each stage's AccumulatorIDs. 3. Maps Operators to stages by checking for non-zero intersection of Step 1 and 2's AccumulatorIDs. 4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI. As a result, some operators without max metrics values will also have stageIds in the UI. This Jira also aims to add minor enhancements to the SQL UI tab. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37339) Add `spark-version` label to driver and executor pods
[ https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37339. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34609 [https://github.com/apache/spark/pull/34609] > Add `spark-version` label to driver and executor pods > - > > Key: SPARK-37339 > URL: https://issues.apache.org/jira/browse/SPARK-37339 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37339) Add `spark-version` label to driver and executor pods
[ https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37339: - Assignee: Dongjoon Hyun > Add `spark-version` label to driver and executor pods > - > > Key: SPARK-37339 > URL: https://issues.apache.org/jira/browse/SPARK-37339 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35867: Assignee: Apache Spark > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Minor > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444135#comment-17444135 ] Apache Spark commented on SPARK-35867: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/34611 > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35867: Assignee: (was: Apache Spark) > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444117#comment-17444117 ] Chuck Connell commented on SPARK-37181: --- That would be a good solution, just convert latin-1 silently to ISO-8859-1. > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34332) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
[ https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34332: Assignee: Max Gekk (was: Apache Spark) > Unify v1 and v2 ALTER TABLE .. SET LOCATION tests > - > > Key: SPARK-34332 > URL: https://issues.apache.org/jira/browse/SPARK-34332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34332) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
[ https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34332: Assignee: Apache Spark (was: Max Gekk) > Unify v1 and v2 ALTER TABLE .. SET LOCATION tests > - > > Key: SPARK-34332 > URL: https://issues.apache.org/jira/browse/SPARK-34332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > > Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34332) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
[ https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444095#comment-17444095 ] Apache Spark commented on SPARK-34332: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/34610 > Unify v1 and v2 ALTER TABLE .. SET LOCATION tests > - > > Key: SPARK-34332 > URL: https://issues.apache.org/jira/browse/SPARK-34332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37339) Add `spark-version` label to driver and executor pods
[ https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37339: Assignee: Apache Spark > Add `spark-version` label to driver and executor pods > - > > Key: SPARK-37339 > URL: https://issues.apache.org/jira/browse/SPARK-37339 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37339) Add `spark-version` label to driver and executor pods
[ https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444087#comment-17444087 ] Apache Spark commented on SPARK-37339: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34609 > Add `spark-version` label to driver and executor pods > - > > Key: SPARK-37339 > URL: https://issues.apache.org/jira/browse/SPARK-37339 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
[ https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sathiya Kumar updated SPARK-37324: -- Description: Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex operations to do the same with spark native methods. [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117] [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel] [https://stackoverflow.com/questions/48279641/oracle-sql-round-half] Opening support for the other rounding modes might interest a lot of use cases. *SAP Hana Sql ROUND function does it :* {code:java} ROUND( [, [, ]]){code} REF : [https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/20e6a27575191014bd54a07fd86c585d.html] *Sql Server does something similar to this* : {code:java} ROUND ( numeric_expression , length [ ,function ] ){code} REF : [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15] was: Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex operations to do the same with spark native methods. [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117] [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel] [https://stackoverflow.com/questions/48279641/oracle-sql-round-half] Opening support for the other rounding modes might interest a lot of use cases. Sql Server does something similar to this : [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15] > Support Decimal RoundingMode.UP, DOWN, HALF_DOWN > > > Key: SPARK-37324 > URL: https://issues.apache.org/jira/browse/SPARK-37324 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Sathiya Kumar >Priority: Minor > > Currently we support only Decimal RoundingModes : HALF_UP (round) and > HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and > RoundingMode.DOWN. In our projects we use UDF, i also see few people do > complex operations to do the same with spark native methods. > [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117] > [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel] > [https://stackoverflow.com/questions/48279641/oracle-sql-round-half] > > Opening support for the other rounding modes might interest a lot of use > cases. > *SAP Hana Sql ROUND function does it :* > {code:java} > ROUND( [, [, ]]){code} > REF : > [https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/20e6a27575191014bd54a07fd86c585d.html] > *Sql Server does something similar to this* : > {code:java} > ROUND ( numeric_expression , length [ ,function ] ){code} > REF : > [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15] > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37339) Add `spark-version` label to driver and executor pods
[ https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37339: Assignee: (was: Apache Spark) > Add `spark-version` label to driver and executor pods > - > > Key: SPARK-37339 > URL: https://issues.apache.org/jira/browse/SPARK-37339 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37339) Add `spark-version` label to driver and executor pods
Dongjoon Hyun created SPARK-37339: - Summary: Add `spark-version` label to driver and executor pods Key: SPARK-37339 URL: https://issues.apache.org/jira/browse/SPARK-37339 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
[ https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37338: Assignee: (was: Apache Spark) > Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api > - > > Key: SPARK-37338 > URL: https://issues.apache.org/jira/browse/SPARK-37338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and > inconvenient to call. > So we wanted to rename to_pandas_on_spark to pandas_api for API usability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
[ https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37338: Assignee: Apache Spark > Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api > - > > Key: SPARK-37338 > URL: https://issues.apache.org/jira/browse/SPARK-37338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and > inconvenient to call. > So we wanted to rename to_pandas_on_spark to pandas_api for API usability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
[ https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444081#comment-17444081 ] Apache Spark commented on SPARK-37338: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/34608 > Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api > - > > Key: SPARK-37338 > URL: https://issues.apache.org/jira/browse/SPARK-37338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and > inconvenient to call. > So we wanted to rename to_pandas_on_spark to pandas_api for API usability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
[ https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-37338: - Summary: Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api (was: Rename to_pandas_on_spark to pandas_api) > Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api > - > > Key: SPARK-37338 > URL: https://issues.apache.org/jira/browse/SPARK-37338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and > inconvenient to call. > So we wanted to rename to_pandas_on_spark to pandas_api for API usability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37338) Rename to_pandas_on_spark to pandas_api
[ https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-37338: - Description: Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and inconvenient to call. So we wanted to rename to_pandas_on_spark to pandas_api for API usability was:Rename to_pandas_on_spark to pandas_api for API usability > Rename to_pandas_on_spark to pandas_api > --- > > Key: SPARK-37338 > URL: https://issues.apache.org/jira/browse/SPARK-37338 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and > inconvenient to call. > So we wanted to rename to_pandas_on_spark to pandas_api for API usability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37338) Rename to_pandas_on_spark to pandas_api
Xinrong Meng created SPARK-37338: Summary: Rename to_pandas_on_spark to pandas_api Key: SPARK-37338 URL: https://issues.apache.org/jira/browse/SPARK-37338 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng Rename to_pandas_on_spark to pandas_api for API usability -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion
Xinrong Meng created SPARK-37337: Summary: Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion Key: SPARK-37337 URL: https://issues.apache.org/jira/browse/SPARK-37337 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.3.0 Reporter: Xinrong Meng Undeprecate (Spark)DataFrame.to_koalas Rename (Spark)DataFrame.to_pandas_like to (Spark)DataFrame.pandas_api -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36038) Basic speculation metrics at stage level
[ https://issues.apache.org/jira/browse/SPARK-36038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444046#comment-17444046 ] Apache Spark commented on SPARK-36038: -- User 'thejdeep' has created a pull request for this issue: https://github.com/apache/spark/pull/34607 > Basic speculation metrics at stage level > > > Key: SPARK-36038 > URL: https://issues.apache.org/jira/browse/SPARK-36038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.3.0 > > > Currently there are no speculation metrics available either at application > level or at stage level. With in our platform, we have added speculation > metrics at stage level as a summary similarly to the stage level metrics > tracking numTotalSpeculated, numCompleted (successful), numFailed, numKilled > etc. This enables us to effectively understand speculative execution feature > at an application level and helps in further tuning the speculation configs. > cc [~ron8hu] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes
[ https://issues.apache.org/jira/browse/SPARK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-37320: - Component/s: Kubernetes (was: k8) > Delete py_container_checks.zip after the test in DepsTestsSuite finishes > > > Key: SPARK-37320 > URL: https://issues.apache.org/jira/browse/SPARK-37320 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.1.3, 3.2.1, 3.3.0 > > > When K8s integration tests run, py_container_checks.zip still remains in > resource-managers/kubernetes/integration-tests/tests/. > It's is created in the test "Launcher python client dependencies using a zip > file" in DepsTestsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37336) Migrate _java2py to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443991#comment-17443991 ] Apache Spark commented on SPARK-37336: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/34606 > Migrate _java2py to SparkSession > > > Key: SPARK-37336 > URL: https://issues.apache.org/jira/browse/SPARK-37336 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > {{_java2py()}} uses a deprecated method to create a SparkSession. > > https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37336) Migrate _java2py to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37336: Assignee: (was: Apache Spark) > Migrate _java2py to SparkSession > > > Key: SPARK-37336 > URL: https://issues.apache.org/jira/browse/SPARK-37336 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > {{_java2py()}} uses a deprecated method to create a SparkSession. > > https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37336) Migrate _java2py to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37336: Assignee: Apache Spark > Migrate _java2py to SparkSession > > > Key: SPARK-37336 > URL: https://issues.apache.org/jira/browse/SPARK-37336 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Minor > > {{_java2py()}} uses a deprecated method to create a SparkSession. > > https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37336) Migrate _java2py to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443990#comment-17443990 ] Apache Spark commented on SPARK-37336: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/34606 > Migrate _java2py to SparkSession > > > Key: SPARK-37336 > URL: https://issues.apache.org/jira/browse/SPARK-37336 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > {{_java2py()}} uses a deprecated method to create a SparkSession. > > https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37336) Migrate _java2py to SparkSession
[ https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37336: - Summary: Migrate _java2py to SparkSession (was: Migrate common ML utils to SparkSession) > Migrate _java2py to SparkSession > > > Key: SPARK-37336 > URL: https://issues.apache.org/jira/browse/SPARK-37336 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > {{_java2py()}} uses a deprecated method to create a SparkSession. > > https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37336) Migrate common ML utils to SparkSession
Nicholas Chammas created SPARK-37336: Summary: Migrate common ML utils to SparkSession Key: SPARK-37336 URL: https://issues.apache.org/jira/browse/SPARK-37336 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.2.0 Reporter: Nicholas Chammas {{_java2py()}} uses a deprecated method to create a SparkSession. https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443973#comment-17443973 ] Apache Spark commented on SPARK-37335: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/34605 > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37335: Assignee: Apache Spark > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Minor > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443970#comment-17443970 ] Apache Spark commented on SPARK-37335: -- User 'nchammas' has created a pull request for this issue: https://github.com/apache/spark/pull/34605 > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37335: Assignee: (was: Apache Spark) > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37335) Clarify output of FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-37335: - Description: The association rules returned by FPGrow include more columns than are documented, like {{{}lift{}}}: [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] We should offer a basic description of these columns. An _itemset_ should also be briefly defined. was: The association rules returned by FPGrow include more columns than are documented: [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] We should offer a basic description of these columns. > Clarify output of FPGrowth > -- > > Key: SPARK-37335 > URL: https://issues.apache.org/jira/browse/SPARK-37335 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 3.2.0 >Reporter: Nicholas Chammas >Priority: Minor > > The association rules returned by FPGrow include more columns than are > documented, like {{{}lift{}}}: > [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] > We should offer a basic description of these columns. An _itemset_ should > also be briefly defined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37335) Clarify output of FPGrowth
Nicholas Chammas created SPARK-37335: Summary: Clarify output of FPGrowth Key: SPARK-37335 URL: https://issues.apache.org/jira/browse/SPARK-37335 Project: Spark Issue Type: Improvement Components: Documentation, ML Affects Versions: 3.2.0 Reporter: Nicholas Chammas The association rules returned by FPGrow include more columns than are documented: [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html] We should offer a basic description of these columns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37329) File system delegation tokens are leaked
[ https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37329: Assignee: Apache Spark > File system delegation tokens are leaked > > > Key: SPARK-37329 > URL: https://issues.apache.org/jira/browse/SPARK-37329 > Project: Spark > Issue Type: Bug > Components: Security, YARN >Affects Versions: 2.4.0 >Reporter: Wei-Chiu Chuang >Assignee: Apache Spark >Priority: Major > > On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS > accumulated millions of delegation tokens that are not cancelled even after > jobs are finished, and KMS goes out of memory within a day because of the > delegation token leak. > We were able to reproduce the bug in a smaller test cluster, and realized > when a Spark job starts, it acquires two delegation tokens, and only one is > cancelled properly after the job finishes. The other one is left over and > linger around for up to 7 days ( default Hadoop delegation token life time). > YARN handles the lifecycle of a delegation token properly if its renewer is > 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation > token with the job issuer as the renewer, simply to get the token renewal > interval. The token is then ignored but not cancelled. > Propose: cancel the delegation token immediately after the token renewal > interval is obtained. > Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got > introduced since day 1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37329) File system delegation tokens are leaked
[ https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37329: Assignee: (was: Apache Spark) > File system delegation tokens are leaked > > > Key: SPARK-37329 > URL: https://issues.apache.org/jira/browse/SPARK-37329 > Project: Spark > Issue Type: Bug > Components: Security, YARN >Affects Versions: 2.4.0 >Reporter: Wei-Chiu Chuang >Priority: Major > > On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS > accumulated millions of delegation tokens that are not cancelled even after > jobs are finished, and KMS goes out of memory within a day because of the > delegation token leak. > We were able to reproduce the bug in a smaller test cluster, and realized > when a Spark job starts, it acquires two delegation tokens, and only one is > cancelled properly after the job finishes. The other one is left over and > linger around for up to 7 days ( default Hadoop delegation token life time). > YARN handles the lifecycle of a delegation token properly if its renewer is > 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation > token with the job issuer as the renewer, simply to get the token renewal > interval. The token is then ignored but not cancelled. > Propose: cancel the delegation token immediately after the token renewal > interval is obtained. > Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got > introduced since day 1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37329) File system delegation tokens are leaked
[ https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443890#comment-17443890 ] Apache Spark commented on SPARK-37329: -- User 'jojochuang' has created a pull request for this issue: https://github.com/apache/spark/pull/34604 > File system delegation tokens are leaked > > > Key: SPARK-37329 > URL: https://issues.apache.org/jira/browse/SPARK-37329 > Project: Spark > Issue Type: Bug > Components: Security, YARN >Affects Versions: 2.4.0 >Reporter: Wei-Chiu Chuang >Priority: Major > > On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS > accumulated millions of delegation tokens that are not cancelled even after > jobs are finished, and KMS goes out of memory within a day because of the > delegation token leak. > We were able to reproduce the bug in a smaller test cluster, and realized > when a Spark job starts, it acquires two delegation tokens, and only one is > cancelled properly after the job finishes. The other one is left over and > linger around for up to 7 days ( default Hadoop delegation token life time). > YARN handles the lifecycle of a delegation token properly if its renewer is > 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation > token with the job issuer as the renewer, simply to get the token renewal > interval. The token is then ignored but not cancelled. > Propose: cancel the delegation token immediately after the token renewal > interval is obtained. > Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got > introduced since day 1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37329) File system delegation tokens are leaked
[ https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443618#comment-17443618 ] Wei-Chiu Chuang edited comment on SPARK-37329 at 11/15/21, 2:57 PM: PR: https://github.com/apache/spark/pull/34604 was (Author: jojochuang): I'll provide a PR. > File system delegation tokens are leaked > > > Key: SPARK-37329 > URL: https://issues.apache.org/jira/browse/SPARK-37329 > Project: Spark > Issue Type: Bug > Components: Security, YARN >Affects Versions: 2.4.0 >Reporter: Wei-Chiu Chuang >Priority: Major > > On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS > accumulated millions of delegation tokens that are not cancelled even after > jobs are finished, and KMS goes out of memory within a day because of the > delegation token leak. > We were able to reproduce the bug in a smaller test cluster, and realized > when a Spark job starts, it acquires two delegation tokens, and only one is > cancelled properly after the job finishes. The other one is left over and > linger around for up to 7 days ( default Hadoop delegation token life time). > YARN handles the lifecycle of a delegation token properly if its renewer is > 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation > token with the job issuer as the renewer, simply to get the token renewal > interval. The token is then ignored but not cancelled. > Propose: cancel the delegation token immediately after the token renewal > interval is obtained. > Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got > introduced since day 1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37266) View text can only be SELECT queries
[ https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37266. - Fix Version/s: 3.3.0 Assignee: jiaan.geng Resolution: Fixed > View text can only be SELECT queries > > > Key: SPARK-37266 > URL: https://issues.apache.org/jira/browse/SPARK-37266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.3.0 > > > The current implementation of persistent view is create hive table with view > text. > The view text is just a query string, so the hackers may tamper with it > through various means. > Such as: > {code:java} > select * from tab1 > {code} > tampered with > > {code:java} > drop table tab1 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443764#comment-17443764 ] pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM: - However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] , [~yikunkero] Let me know , if my understanding is correct . If yes, then I can work on this h1. was (Author: pralabhkumar): However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] , [~yikunkero] Let me know , if I can work on this h1. > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443764#comment-17443764 ] pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM: - However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] , [~yikunkero] Let me know , if I can work on this h1. was (Author: pralabhkumar): However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37334) pandas `convert_dtypes` method support
Ali Amin-Nejad created SPARK-37334: -- Summary: pandas `convert_dtypes` method support Key: SPARK-37334 URL: https://issues.apache.org/jira/browse/SPARK-37334 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.2.0 Reporter: Ali Amin-Nejad Support for the {{convert_dtypes}} method as part of the new pandas API in pyspark? [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan
[ https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37328: Assignee: (was: Apache Spark) > SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was > applied on whole plan innstead of new stage plan > - > > Key: SPARK-37328 > URL: https://issues.apache.org/jira/browse/SPARK-37328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Lietong Liu >Priority: Major > > Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to > queryStagePreparationRules, the position OptimizeSkewedJoin was applied has > been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin > applied on changed from plan of new stage which is about to submit to whole > spark plan. > In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not > work because the number of collected shuffleStages is more than 2. > The following test will prove it: > > > {code:java} > test("OptimizeSkewJoin may not work") { > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", > SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100", > SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100", > SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", > SQLConf.SHUFFLE_PARTITIONS.key -> "10") { > withTempView("skewData1", "skewData2", "skewData3") { > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 3 as key1", "id % 3 as value1") > .createOrReplaceTempView("skewData1") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key2", "id as value2") > .createOrReplaceTempView("skewData2") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key3", "id as value3") > .createOrReplaceTempView("skewData3") > // Query has two skewedJoin in two continuous stages. > val (_, adaptive1) = > runAdaptiveAndVerifyResult( > """ > |SELECT key1 FROM skewData1 s1 > |JOIN skewData2 s2 > |ON s1.key1 = s2.key2 > |JOIN skewData3 > |ON s1.value1 = value3 > |""".stripMargin) > val shuffles1 = collect(adaptive1) { > case s: ShuffleExchangeExec => s > } > assert(shuffles1.size == 4) > val smj1 = findTopLevelSortMergeJoin(adaptive1) > assert(smj1.size == 2 && smj1.forall(_.isSkewJoin)) > } > } > } {code} > I'll open a PR shortly to fix this issue > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan
[ https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37328: Assignee: Apache Spark > SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was > applied on whole plan innstead of new stage plan > - > > Key: SPARK-37328 > URL: https://issues.apache.org/jira/browse/SPARK-37328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Lietong Liu >Assignee: Apache Spark >Priority: Major > > Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to > queryStagePreparationRules, the position OptimizeSkewedJoin was applied has > been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin > applied on changed from plan of new stage which is about to submit to whole > spark plan. > In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not > work because the number of collected shuffleStages is more than 2. > The following test will prove it: > > > {code:java} > test("OptimizeSkewJoin may not work") { > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", > SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100", > SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100", > SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", > SQLConf.SHUFFLE_PARTITIONS.key -> "10") { > withTempView("skewData1", "skewData2", "skewData3") { > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 3 as key1", "id % 3 as value1") > .createOrReplaceTempView("skewData1") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key2", "id as value2") > .createOrReplaceTempView("skewData2") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key3", "id as value3") > .createOrReplaceTempView("skewData3") > // Query has two skewedJoin in two continuous stages. > val (_, adaptive1) = > runAdaptiveAndVerifyResult( > """ > |SELECT key1 FROM skewData1 s1 > |JOIN skewData2 s2 > |ON s1.key1 = s2.key2 > |JOIN skewData3 > |ON s1.value1 = value3 > |""".stripMargin) > val shuffles1 = collect(adaptive1) { > case s: ShuffleExchangeExec => s > } > assert(shuffles1.size == 4) > val smj1 = findTopLevelSortMergeJoin(adaptive1) > assert(smj1.size == 2 && smj1.forall(_.isSkewJoin)) > } > } > } {code} > I'll open a PR shortly to fix this issue > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan
[ https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443801#comment-17443801 ] Apache Spark commented on SPARK-37328: -- User 'Liulietong' has created a pull request for this issue: https://github.com/apache/spark/pull/34602 > SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was > applied on whole plan innstead of new stage plan > - > > Key: SPARK-37328 > URL: https://issues.apache.org/jira/browse/SPARK-37328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Lietong Liu >Priority: Major > > Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to > queryStagePreparationRules, the position OptimizeSkewedJoin was applied has > been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin > applied on changed from plan of new stage which is about to submit to whole > spark plan. > In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not > work because the number of collected shuffleStages is more than 2. > The following test will prove it: > > > {code:java} > test("OptimizeSkewJoin may not work") { > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", > SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100", > SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100", > SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", > SQLConf.SHUFFLE_PARTITIONS.key -> "10") { > withTempView("skewData1", "skewData2", "skewData3") { > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 3 as key1", "id % 3 as value1") > .createOrReplaceTempView("skewData1") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key2", "id as value2") > .createOrReplaceTempView("skewData2") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key3", "id as value3") > .createOrReplaceTempView("skewData3") > // Query has two skewedJoin in two continuous stages. > val (_, adaptive1) = > runAdaptiveAndVerifyResult( > """ > |SELECT key1 FROM skewData1 s1 > |JOIN skewData2 s2 > |ON s1.key1 = s2.key2 > |JOIN skewData3 > |ON s1.value1 = value3 > |""".stripMargin) > val shuffles1 = collect(adaptive1) { > case s: ShuffleExchangeExec => s > } > assert(shuffles1.size == 4) > val smj1 = findTopLevelSortMergeJoin(adaptive1) > assert(smj1.size == 2 && smj1.forall(_.isSkewJoin)) > } > } > } {code} > I'll open a PR shortly to fix this issue > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37333) Specify the required distribution at V1Write
XiDuo You created SPARK-37333: - Summary: Specify the required distribution at V1Write Key: SPARK-37333 URL: https://issues.apache.org/jira/browse/SPARK-37333 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You An improvment of SPARK-37287. We can specify the distribution at V1Write. e.g. the write is dynamic partition, we may expect an output partitioning based on dynamic partition columns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan
[ https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lietong Liu updated SPARK-37328: Summary: SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan (was: SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied onn whole plan innstead of new stage plan) > SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was > applied on whole plan innstead of new stage plan > - > > Key: SPARK-37328 > URL: https://issues.apache.org/jira/browse/SPARK-37328 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Lietong Liu >Priority: Major > > Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to > queryStagePreparationRules, the position OptimizeSkewedJoin was applied has > been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin > applied on changed from plan of new stage which is about to submit to whole > spark plan. > In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not > work because the number of collected shuffleStages is more than 2. > The following test will prove it: > > > {code:java} > test("OptimizeSkewJoin may not work") { > withSQLConf( > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true", > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1", > SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100", > SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100", > SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", > SQLConf.SHUFFLE_PARTITIONS.key -> "10") { > withTempView("skewData1", "skewData2", "skewData3") { > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 3 as key1", "id % 3 as value1") > .createOrReplaceTempView("skewData1") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key2", "id as value2") > .createOrReplaceTempView("skewData2") > spark > .range(0, 1000, 1, 10) > .selectExpr("id % 1 as key3", "id as value3") > .createOrReplaceTempView("skewData3") > // Query has two skewedJoin in two continuous stages. > val (_, adaptive1) = > runAdaptiveAndVerifyResult( > """ > |SELECT key1 FROM skewData1 s1 > |JOIN skewData2 s2 > |ON s1.key1 = s2.key2 > |JOIN skewData3 > |ON s1.value1 = value3 > |""".stripMargin) > val shuffles1 = collect(adaptive1) { > case s: ShuffleExchangeExec => s > } > assert(shuffles1.size == 4) > val smj1 = findTopLevelSortMergeJoin(adaptive1) > assert(smj1.size == 2 && smj1.forall(_.isSkewJoin)) > } > } > } {code} > I'll open a PR shortly to fix this issue > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37316) Add code-gen for existence sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443794#comment-17443794 ] Apache Spark commented on SPARK-37316: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34601 > Add code-gen for existence sort merge join > -- > > Key: SPARK-37316 > URL: https://issues.apache.org/jira/browse/SPARK-37316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > This Jira is to track the progress to add code-gen support for existence sort > merge join. See motivation in SPARK-34705. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37316) Add code-gen for existence sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37316: Assignee: (was: Apache Spark) > Add code-gen for existence sort merge join > -- > > Key: SPARK-37316 > URL: https://issues.apache.org/jira/browse/SPARK-37316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > This Jira is to track the progress to add code-gen support for existence sort > merge join. See motivation in SPARK-34705. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37316) Add code-gen for existence sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37316: Assignee: Apache Spark > Add code-gen for existence sort merge join > -- > > Key: SPARK-37316 > URL: https://issues.apache.org/jira/browse/SPARK-37316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > This Jira is to track the progress to add code-gen support for existence sort > merge join. See motivation in SPARK-34705. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37316) Add code-gen for existence sort merge join
[ https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443793#comment-17443793 ] Apache Spark commented on SPARK-37316: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/34601 > Add code-gen for existence sort merge join > -- > > Key: SPARK-37316 > URL: https://issues.apache.org/jira/browse/SPARK-37316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Cheng Su >Priority: Minor > > This Jira is to track the progress to add code-gen support for existence sort > merge join. See motivation in SPARK-34705. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35352) Add code-gen for full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-35352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35352. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34581 [https://github.com/apache/spark/pull/34581] > Add code-gen for full outer sort merge join > --- > > Key: SPARK-35352 > URL: https://issues.apache.org/jira/browse/SPARK-35352 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.3.0 > > > This Jira is to track the progress to add code-gen support for full outer > sort merge join. See motivation in SPARK-34705. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35352) Add code-gen for full outer sort merge join
[ https://issues.apache.org/jira/browse/SPARK-35352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35352: --- Assignee: Cheng Su > Add code-gen for full outer sort merge join > --- > > Key: SPARK-35352 > URL: https://issues.apache.org/jira/browse/SPARK-35352 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > > This Jira is to track the progress to add code-gen support for full outer > sort merge join. See motivation in SPARK-34705. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables
[ https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37332: Assignee: (was: Apache Spark) > Check adding of ANSI interval columns to v1/v2 tables > - > > Key: SPARK-37332 > URL: https://issues.apache.org/jira/browse/SPARK-37332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables
[ https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443768#comment-17443768 ] Apache Spark commented on SPARK-37332: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/34600 > Check adding of ANSI interval columns to v1/v2 tables > - > > Key: SPARK-37332 > URL: https://issues.apache.org/jira/browse/SPARK-37332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables
[ https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37332: Assignee: Apache Spark > Check adding of ANSI interval columns to v1/v2 tables > - > > Key: SPARK-37332 > URL: https://issues.apache.org/jira/browse/SPARK-37332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443764#comment-17443764 ] pralabhkumar commented on SPARK-37181: -- However from users point of view , if user mention latin-1 in pyspark.pandas then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" , spark can internally convert it to ISO-8859-1 cc [~hyukjin.kwon] > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443720#comment-17443720 ] pralabhkumar edited comment on SPARK-37181 at 11/15/21, 10:32 AM: -- from pyspark import pandas as ps latin-1 encoding is same as ISO-8859-1. You can mentioned the same . ps.read_csv("<>", encoding ='ISO-8859-1') [~chconnell] was (Author: pralabhkumar): from pyspark import pandas as ps latin-1 encoding is same as ISO-8859-1. You can mentioned the same . ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding ='ISO-8859-1') > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding
[ https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443720#comment-17443720 ] pralabhkumar commented on SPARK-37181: -- from pyspark import pandas as ps latin-1 encoding is same as ISO-8859-1. You can mentioned the same . ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding ='ISO-8859-1') > pyspark.pandas.read_csv() should support latin-1 encoding > - > > Key: SPARK-37181 > URL: https://issues.apache.org/jira/browse/SPARK-37181 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Chuck Connell >Priority: Major > > {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding > is not recognized in pyspark.pandas. You have to use Windows-1252 instead, > which is almost the same but not identical. }} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables
[ https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443715#comment-17443715 ] Max Gekk commented on SPARK-37332: -- I am working on this. > Check adding of ANSI interval columns to v1/v2 tables > - > > Key: SPARK-37332 > URL: https://issues.apache.org/jira/browse/SPARK-37332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables
[ https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-37332: - Summary: Check adding of ANSI interval columns to v1/v2 tables (was: Check adding of ANSI interval columns) > Check adding of ANSI interval columns to v1/v2 tables > - > > Key: SPARK-37332 > URL: https://issues.apache.org/jira/browse/SPARK-37332 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37332) Check adding of ANSI interval columns
Max Gekk created SPARK-37332: Summary: Check adding of ANSI interval columns Key: SPARK-37332 URL: https://issues.apache.org/jira/browse/SPARK-37332 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Write tests that check adding ANSI interval column to a table -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37283. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34551 [https://github.com/apache/spark/pull/34551] > Don't try to store a V1 table which contains ANSI intervals in Hive > compatible format > - > > Key: SPARK-37283 > URL: https://issues.apache.org/jira/browse/SPARK-37283 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.3.0 > > > If, a table being created contains a column of ANSI interval types and the > underlying file format has a corresponding Hive SerDe (e.g. Parquet), > `HiveExternalcatalog` tries to store the table in Hive compatible format. > But, as ANSI interval types in Spark and interval type in Hive are not > compatible (Hive only supports interval_year_month and interval_day_time), > the following warning with stack trace will be logged. > {code} > spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; > 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist > `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore > in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IllegalArgumentException: Error: type expected at the position 0 of > 'interval year to month' but 'interval year to month' is found. > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) > at > org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCom