[jira] [Commented] (SPARK-37329) File system delegation tokens are leaked

2021-11-15 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443618#comment-17443618
 ] 

Wei-Chiu Chuang commented on SPARK-37329:
-

I'll provide a PR.

> File system delegation tokens are leaked
> 
>
> Key: SPARK-37329
> URL: https://issues.apache.org/jira/browse/SPARK-37329
> Project: Spark
>  Issue Type: Bug
>  Components: Security, YARN
>Affects Versions: 2.4.0
>Reporter: Wei-Chiu Chuang
>Priority: Major
>
> On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
> accumulated millions of delegation tokens that are not cancelled even after 
> jobs are finished, and KMS goes out of memory within a day because of the 
> delegation token leak.
> We were able to reproduce the bug in a smaller test cluster, and realized 
> when a Spark job starts, it acquires two delegation tokens, and only one is 
> cancelled properly after the job finishes. The other one is left over and 
> linger around for up to 7 days ( default Hadoop delegation token life time).
> YARN handles the lifecycle of a delegation token properly if its renewer is 
> 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
> token with the job issuer as the renewer, simply to get the token renewal 
> interval. The token is then ignored but not cancelled.
> Propose: cancel the delegation token immediately after the token renewal 
> interval is obtained.
> Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
> introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36223) TPCDSQueryTestSuite should run with different config set

2021-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36223.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33510
[https://github.com/apache/spark/pull/33510]

> TPCDSQueryTestSuite should run with different config set
> 
>
> Key: SPARK-36223
> URL: https://issues.apache.org/jira/browse/SPARK-36223
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Linhong Liu
>Priority: Major
> Fix For: 3.3.0
>
>
> In current github actions we run TPCDSQueryTestSuite for tpcds benchmark. But 
> it's only tested under default configurations. Since we have added the 
> `spark.sql.join.forceApplyShuffledHashJoin` config. Now we can test all 3 
> join strategies in TPCDS to improve the coverage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36223) TPCDSQueryTestSuite should run with different config set

2021-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36223:
---

Assignee: roryqi

> TPCDSQueryTestSuite should run with different config set
> 
>
> Key: SPARK-36223
> URL: https://issues.apache.org/jira/browse/SPARK-36223
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Linhong Liu
>Assignee: roryqi
>Priority: Major
> Fix For: 3.3.0
>
>
> In current github actions we run TPCDSQueryTestSuite for tpcds benchmark. But 
> it's only tested under default configurations. Since we have added the 
> `spark.sql.join.forceApplyShuffledHashJoin` config. Now we can test all 3 
> join strategies in TPCDS to improve the coverage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37327) Silence the to_pandas() advice log for internal usage

2021-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37327:


Assignee: Haejoon Lee

> Silence the to_pandas() advice log for internal usage
> -
>
> Key: SPARK-37327
> URL: https://issues.apache.org/jira/browse/SPARK-37327
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for to_pandas() issuing too much 
> message e.g. when user runs the plotting functions, so we want to silence the 
> warning message when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37327) Silence the to_pandas() advice log for internal usage

2021-11-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37327.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34598
[https://github.com/apache/spark/pull/34598]

> Silence the to_pandas() advice log for internal usage
> -
>
> Key: SPARK-37327
> URL: https://issues.apache.org/jira/browse/SPARK-37327
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for to_pandas() issuing too much 
> message e.g. when user runs the plotting functions, so we want to silence the 
> warning message when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37331) Add the ability to creating resources before driver pod

2021-11-15 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-37331:
---

 Summary: Add the ability to creating resources before driver pod
 Key: SPARK-37331
 URL: https://issues.apache.org/jira/browse/SPARK-37331
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37331) Add the ability to creating resources before driver pod

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37331:


Assignee: (was: Apache Spark)

> Add the ability to creating resources before driver pod
> ---
>
> Key: SPARK-37331
> URL: https://issues.apache.org/jira/browse/SPARK-37331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37331) Add the ability to creating resources before driver pod

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37331:


Assignee: Apache Spark

> Add the ability to creating resources before driver pod
> ---
>
> Key: SPARK-37331
> URL: https://issues.apache.org/jira/browse/SPARK-37331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37331) Add the ability to creating resources before driver pod

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443673#comment-17443673
 ] 

Apache Spark commented on SPARK-37331:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34599

> Add the ability to creating resources before driver pod
> ---
>
> Key: SPARK-37331
> URL: https://issues.apache.org/jira/browse/SPARK-37331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37331) Add the ability to creating resources before driver pod

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443674#comment-17443674
 ] 

Apache Spark commented on SPARK-37331:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34599

> Add the ability to creating resources before driver pod
> ---
>
> Key: SPARK-37331
> URL: https://issues.apache.org/jira/browse/SPARK-37331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format

2021-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37283.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34551
[https://github.com/apache/spark/pull/34551]

> Don't try to store a V1 table which contains ANSI intervals in Hive 
> compatible format
> -
>
> Key: SPARK-37283
> URL: https://issues.apache.org/jira/browse/SPARK-37283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> If, a table being created contains a column of ANSI interval types and the 
> underlying file format has a corresponding Hive SerDe (e.g. Parquet),
> `HiveExternalcatalog` tries to store the table in Hive compatible format.
> But, as ANSI interval types in Spark and interval type in Hive are not 
> compatible (Hive only supports interval_year_month and interval_day_time), 
> the following warning with stack trace will be logged.
> {code}
> spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet;
> 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist 
> `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore 
> in Spark SQL specific format.
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found.
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376)
>   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCom

[jira] [Created] (SPARK-37332) Check adding of ANSI interval columns

2021-11-15 Thread Max Gekk (Jira)
Max Gekk created SPARK-37332:


 Summary: Check adding of ANSI interval columns
 Key: SPARK-37332
 URL: https://issues.apache.org/jira/browse/SPARK-37332
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk


Write tests that check adding ANSI interval column to a table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables

2021-11-15 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-37332:
-
Summary: Check adding of ANSI interval columns to v1/v2 tables  (was: Check 
adding of ANSI interval columns)

> Check adding of ANSI interval columns to v1/v2 tables
> -
>
> Key: SPARK-37332
> URL: https://issues.apache.org/jira/browse/SPARK-37332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Write tests that check adding ANSI interval column to a table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables

2021-11-15 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443715#comment-17443715
 ] 

Max Gekk commented on SPARK-37332:
--

I am working on this.

> Check adding of ANSI interval columns to v1/v2 tables
> -
>
> Key: SPARK-37332
> URL: https://issues.apache.org/jira/browse/SPARK-37332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Write tests that check adding ANSI interval column to a table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443720#comment-17443720
 ] 

pralabhkumar commented on SPARK-37181:
--

from pyspark import pandas as ps

latin-1 encoding is same as  ISO-8859-1. You can mentioned the same . 

 

ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding 
='ISO-8859-1')

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443720#comment-17443720
 ] 

pralabhkumar edited comment on SPARK-37181 at 11/15/21, 10:32 AM:
--

from pyspark import pandas as ps

latin-1 encoding is same as  ISO-8859-1. You can mentioned the same . 

ps.read_csv("<>", encoding ='ISO-8859-1')

 

[~chconnell] 


was (Author: pralabhkumar):
from pyspark import pandas as ps

latin-1 encoding is same as  ISO-8859-1. You can mentioned the same . 

 

ps.read_csv("/Users/pralkuma/Desktop/rk_scaas/spark/a.txt", encoding 
='ISO-8859-1')

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443764#comment-17443764
 ] 

pralabhkumar commented on SPARK-37181:
--

However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] 

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37332:


Assignee: Apache Spark

> Check adding of ANSI interval columns to v1/v2 tables
> -
>
> Key: SPARK-37332
> URL: https://issues.apache.org/jira/browse/SPARK-37332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Write tests that check adding ANSI interval column to a table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443768#comment-17443768
 ] 

Apache Spark commented on SPARK-37332:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34600

> Check adding of ANSI interval columns to v1/v2 tables
> -
>
> Key: SPARK-37332
> URL: https://issues.apache.org/jira/browse/SPARK-37332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Write tests that check adding ANSI interval column to a table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37332) Check adding of ANSI interval columns to v1/v2 tables

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37332:


Assignee: (was: Apache Spark)

> Check adding of ANSI interval columns to v1/v2 tables
> -
>
> Key: SPARK-37332
> URL: https://issues.apache.org/jira/browse/SPARK-37332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Write tests that check adding ANSI interval column to a table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35352) Add code-gen for full outer sort merge join

2021-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35352.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34581
[https://github.com/apache/spark/pull/34581]

> Add code-gen for full outer sort merge join
> ---
>
> Key: SPARK-35352
> URL: https://issues.apache.org/jira/browse/SPARK-35352
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> This Jira is to track the progress to add code-gen support for full outer 
> sort merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35352) Add code-gen for full outer sort merge join

2021-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35352:
---

Assignee: Cheng Su

> Add code-gen for full outer sort merge join
> ---
>
> Key: SPARK-35352
> URL: https://issues.apache.org/jira/browse/SPARK-35352
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for full outer 
> sort merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37316) Add code-gen for existence sort merge join

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443793#comment-17443793
 ] 

Apache Spark commented on SPARK-37316:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34601

> Add code-gen for existence sort merge join
> --
>
> Key: SPARK-37316
> URL: https://issues.apache.org/jira/browse/SPARK-37316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for existence sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37316) Add code-gen for existence sort merge join

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37316:


Assignee: (was: Apache Spark)

> Add code-gen for existence sort merge join
> --
>
> Key: SPARK-37316
> URL: https://issues.apache.org/jira/browse/SPARK-37316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for existence sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37316) Add code-gen for existence sort merge join

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37316:


Assignee: Apache Spark

> Add code-gen for existence sort merge join
> --
>
> Key: SPARK-37316
> URL: https://issues.apache.org/jira/browse/SPARK-37316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for existence sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37316) Add code-gen for existence sort merge join

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443794#comment-17443794
 ] 

Apache Spark commented on SPARK-37316:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34601

> Add code-gen for existence sort merge join
> --
>
> Key: SPARK-37316
> URL: https://issues.apache.org/jira/browse/SPARK-37316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> This Jira is to track the progress to add code-gen support for existence sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan

2021-11-15 Thread Lietong Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lietong Liu updated SPARK-37328:

Summary: SPARK-33832 brings the bug that OptimizeSkewedJoin may not work 
since it was applied on whole plan innstead of new stage plan  (was: 
SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was 
applied onn whole plan innstead of new stage plan)

> SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was 
> applied on whole plan innstead of new stage plan
> -
>
> Key: SPARK-37328
> URL: https://issues.apache.org/jira/browse/SPARK-37328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lietong Liu
>Priority: Major
>
> Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 
> queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
> been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin 
> applied on changed from plan of new stage which is about to submit to whole 
> spark plan.
> In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not 
> work because the number of collected shuffleStages is more than 2.
> The following test will prove it:
>  
>  
> {code:java}
> test("OptimizeSkewJoin may not work") {
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
> SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100",
> SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100",
> SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1",
> SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
> withTempView("skewData1", "skewData2", "skewData3") {
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 3 as key1", "id % 3 as value1")
> .createOrReplaceTempView("skewData1")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key2", "id as value2")
> .createOrReplaceTempView("skewData2")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key3", "id as value3")
> .createOrReplaceTempView("skewData3")
>   // Query has two skewedJoin in two continuous stages.
>   val (_, adaptive1) =
> runAdaptiveAndVerifyResult(
>   """
> |SELECT key1 FROM skewData1 s1
> |JOIN skewData2 s2
> |ON s1.key1 = s2.key2
> |JOIN skewData3
> |ON s1.value1 = value3
> |""".stripMargin)
>   val shuffles1 = collect(adaptive1) {
> case s: ShuffleExchangeExec => s
>   }
>   assert(shuffles1.size == 4)
>   val smj1 = findTopLevelSortMergeJoin(adaptive1)
>   assert(smj1.size == 2 && smj1.forall(_.isSkewJoin))
> }
>   }
> } {code}
> I'll open a PR shortly to fix this issue
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37333) Specify the required distribution at V1Write

2021-11-15 Thread XiDuo You (Jira)
XiDuo You created SPARK-37333:
-

 Summary: Specify the required distribution at V1Write
 Key: SPARK-37333
 URL: https://issues.apache.org/jira/browse/SPARK-37333
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


An improvment of SPARK-37287.

We can specify the distribution at V1Write. e.g. the write is dynamic 
partition, we may expect an output partitioning based on dynamic partition 
columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37328:


Assignee: Apache Spark

> SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was 
> applied on whole plan innstead of new stage plan
> -
>
> Key: SPARK-37328
> URL: https://issues.apache.org/jira/browse/SPARK-37328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lietong Liu
>Assignee: Apache Spark
>Priority: Major
>
> Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 
> queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
> been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin 
> applied on changed from plan of new stage which is about to submit to whole 
> spark plan.
> In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not 
> work because the number of collected shuffleStages is more than 2.
> The following test will prove it:
>  
>  
> {code:java}
> test("OptimizeSkewJoin may not work") {
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
> SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100",
> SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100",
> SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1",
> SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
> withTempView("skewData1", "skewData2", "skewData3") {
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 3 as key1", "id % 3 as value1")
> .createOrReplaceTempView("skewData1")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key2", "id as value2")
> .createOrReplaceTempView("skewData2")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key3", "id as value3")
> .createOrReplaceTempView("skewData3")
>   // Query has two skewedJoin in two continuous stages.
>   val (_, adaptive1) =
> runAdaptiveAndVerifyResult(
>   """
> |SELECT key1 FROM skewData1 s1
> |JOIN skewData2 s2
> |ON s1.key1 = s2.key2
> |JOIN skewData3
> |ON s1.value1 = value3
> |""".stripMargin)
>   val shuffles1 = collect(adaptive1) {
> case s: ShuffleExchangeExec => s
>   }
>   assert(shuffles1.size == 4)
>   val smj1 = findTopLevelSortMergeJoin(adaptive1)
>   assert(smj1.size == 2 && smj1.forall(_.isSkewJoin))
> }
>   }
> } {code}
> I'll open a PR shortly to fix this issue
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443801#comment-17443801
 ] 

Apache Spark commented on SPARK-37328:
--

User 'Liulietong' has created a pull request for this issue:
https://github.com/apache/spark/pull/34602

> SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was 
> applied on whole plan innstead of new stage plan
> -
>
> Key: SPARK-37328
> URL: https://issues.apache.org/jira/browse/SPARK-37328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lietong Liu
>Priority: Major
>
> Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 
> queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
> been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin 
> applied on changed from plan of new stage which is about to submit to whole 
> spark plan.
> In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not 
> work because the number of collected shuffleStages is more than 2.
> The following test will prove it:
>  
>  
> {code:java}
> test("OptimizeSkewJoin may not work") {
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
> SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100",
> SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100",
> SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1",
> SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
> withTempView("skewData1", "skewData2", "skewData3") {
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 3 as key1", "id % 3 as value1")
> .createOrReplaceTempView("skewData1")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key2", "id as value2")
> .createOrReplaceTempView("skewData2")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key3", "id as value3")
> .createOrReplaceTempView("skewData3")
>   // Query has two skewedJoin in two continuous stages.
>   val (_, adaptive1) =
> runAdaptiveAndVerifyResult(
>   """
> |SELECT key1 FROM skewData1 s1
> |JOIN skewData2 s2
> |ON s1.key1 = s2.key2
> |JOIN skewData3
> |ON s1.value1 = value3
> |""".stripMargin)
>   val shuffles1 = collect(adaptive1) {
> case s: ShuffleExchangeExec => s
>   }
>   assert(shuffles1.size == 4)
>   val smj1 = findTopLevelSortMergeJoin(adaptive1)
>   assert(smj1.size == 2 && smj1.forall(_.isSkewJoin))
> }
>   }
> } {code}
> I'll open a PR shortly to fix this issue
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied on whole plan innstead of new stage plan

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37328:


Assignee: (was: Apache Spark)

> SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was 
> applied on whole plan innstead of new stage plan
> -
>
> Key: SPARK-37328
> URL: https://issues.apache.org/jira/browse/SPARK-37328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lietong Liu
>Priority: Major
>
> Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 
> queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
> been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin 
> applied on changed from plan of new stage which is about to submit to whole 
> spark plan.
> In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not 
> work because the number of collected shuffleStages is more than 2.
> The following test will prove it:
>  
>  
> {code:java}
> test("OptimizeSkewJoin may not work") {
>   withSQLConf(
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
> SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100",
> SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100",
> SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1",
> SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
> withTempView("skewData1", "skewData2", "skewData3") {
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 3 as key1", "id % 3 as value1")
> .createOrReplaceTempView("skewData1")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key2", "id as value2")
> .createOrReplaceTempView("skewData2")
>   spark
> .range(0, 1000, 1, 10)
> .selectExpr("id % 1 as key3", "id as value3")
> .createOrReplaceTempView("skewData3")
>   // Query has two skewedJoin in two continuous stages.
>   val (_, adaptive1) =
> runAdaptiveAndVerifyResult(
>   """
> |SELECT key1 FROM skewData1 s1
> |JOIN skewData2 s2
> |ON s1.key1 = s2.key2
> |JOIN skewData3
> |ON s1.value1 = value3
> |""".stripMargin)
>   val shuffles1 = collect(adaptive1) {
> case s: ShuffleExchangeExec => s
>   }
>   assert(shuffles1.size == 4)
>   val smj1 = findTopLevelSortMergeJoin(adaptive1)
>   assert(smj1.size == 2 && smj1.forall(_.isSkewJoin))
> }
>   }
> } {code}
> I'll open a PR shortly to fix this issue
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37334) pandas `convert_dtypes` method support

2021-11-15 Thread Ali Amin-Nejad (Jira)
Ali Amin-Nejad created SPARK-37334:
--

 Summary: pandas `convert_dtypes` method support
 Key: SPARK-37334
 URL: https://issues.apache.org/jira/browse/SPARK-37334
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Ali Amin-Nejad


Support for the {{convert_dtypes}} method as part of the new pandas API in 
pyspark?

[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443764#comment-17443764
 ] 

pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM:
-

However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] , [~yikunkero] 

Let me know ,  if I can work on this 
h1.


was (Author: pralabhkumar):
However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] 

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443764#comment-17443764
 ] 

pralabhkumar edited comment on SPARK-37181 at 11/15/21, 2:10 PM:
-

However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] , [~yikunkero] 

Let me know ,  if my understanding is correct . If yes, then I can work on this 
h1.  


was (Author: pralabhkumar):
However from users point of view , if user mention latin-1 in pyspark.pandas 
then instead of throwing "pyspark.sql.utils.IllegalArgumentException: latin-1" 
, spark can internally convert it to ISO-8859-1

 

cc [~hyukjin.kwon] , [~yikunkero] 

Let me know ,  if I can work on this 
h1.

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37266) View text can only be SELECT queries

2021-11-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37266.
-
Fix Version/s: 3.3.0
 Assignee: jiaan.geng
   Resolution: Fixed

> View text can only be SELECT queries
> 
>
> Key: SPARK-37266
> URL: https://issues.apache.org/jira/browse/SPARK-37266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> The current implementation of persistent view is create hive table with view 
> text.
> The view text is just a query string, so the hackers may tamper with it 
> through various means.
> Such as:
> {code:java}
> select * from tab1
> {code}
>  tampered with
>  
> {code:java}
> drop table tab1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37329) File system delegation tokens are leaked

2021-11-15 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443618#comment-17443618
 ] 

Wei-Chiu Chuang edited comment on SPARK-37329 at 11/15/21, 2:57 PM:


PR: https://github.com/apache/spark/pull/34604


was (Author: jojochuang):
I'll provide a PR.

> File system delegation tokens are leaked
> 
>
> Key: SPARK-37329
> URL: https://issues.apache.org/jira/browse/SPARK-37329
> Project: Spark
>  Issue Type: Bug
>  Components: Security, YARN
>Affects Versions: 2.4.0
>Reporter: Wei-Chiu Chuang
>Priority: Major
>
> On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
> accumulated millions of delegation tokens that are not cancelled even after 
> jobs are finished, and KMS goes out of memory within a day because of the 
> delegation token leak.
> We were able to reproduce the bug in a smaller test cluster, and realized 
> when a Spark job starts, it acquires two delegation tokens, and only one is 
> cancelled properly after the job finishes. The other one is left over and 
> linger around for up to 7 days ( default Hadoop delegation token life time).
> YARN handles the lifecycle of a delegation token properly if its renewer is 
> 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
> token with the job issuer as the renewer, simply to get the token renewal 
> interval. The token is then ignored but not cancelled.
> Propose: cancel the delegation token immediately after the token renewal 
> interval is obtained.
> Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
> introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37329) File system delegation tokens are leaked

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443890#comment-17443890
 ] 

Apache Spark commented on SPARK-37329:
--

User 'jojochuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34604

> File system delegation tokens are leaked
> 
>
> Key: SPARK-37329
> URL: https://issues.apache.org/jira/browse/SPARK-37329
> Project: Spark
>  Issue Type: Bug
>  Components: Security, YARN
>Affects Versions: 2.4.0
>Reporter: Wei-Chiu Chuang
>Priority: Major
>
> On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
> accumulated millions of delegation tokens that are not cancelled even after 
> jobs are finished, and KMS goes out of memory within a day because of the 
> delegation token leak.
> We were able to reproduce the bug in a smaller test cluster, and realized 
> when a Spark job starts, it acquires two delegation tokens, and only one is 
> cancelled properly after the job finishes. The other one is left over and 
> linger around for up to 7 days ( default Hadoop delegation token life time).
> YARN handles the lifecycle of a delegation token properly if its renewer is 
> 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
> token with the job issuer as the renewer, simply to get the token renewal 
> interval. The token is then ignored but not cancelled.
> Propose: cancel the delegation token immediately after the token renewal 
> interval is obtained.
> Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
> introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37329) File system delegation tokens are leaked

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37329:


Assignee: (was: Apache Spark)

> File system delegation tokens are leaked
> 
>
> Key: SPARK-37329
> URL: https://issues.apache.org/jira/browse/SPARK-37329
> Project: Spark
>  Issue Type: Bug
>  Components: Security, YARN
>Affects Versions: 2.4.0
>Reporter: Wei-Chiu Chuang
>Priority: Major
>
> On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
> accumulated millions of delegation tokens that are not cancelled even after 
> jobs are finished, and KMS goes out of memory within a day because of the 
> delegation token leak.
> We were able to reproduce the bug in a smaller test cluster, and realized 
> when a Spark job starts, it acquires two delegation tokens, and only one is 
> cancelled properly after the job finishes. The other one is left over and 
> linger around for up to 7 days ( default Hadoop delegation token life time).
> YARN handles the lifecycle of a delegation token properly if its renewer is 
> 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
> token with the job issuer as the renewer, simply to get the token renewal 
> interval. The token is then ignored but not cancelled.
> Propose: cancel the delegation token immediately after the token renewal 
> interval is obtained.
> Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
> introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37329) File system delegation tokens are leaked

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37329:


Assignee: Apache Spark

> File system delegation tokens are leaked
> 
>
> Key: SPARK-37329
> URL: https://issues.apache.org/jira/browse/SPARK-37329
> Project: Spark
>  Issue Type: Bug
>  Components: Security, YARN
>Affects Versions: 2.4.0
>Reporter: Wei-Chiu Chuang
>Assignee: Apache Spark
>Priority: Major
>
> On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
> accumulated millions of delegation tokens that are not cancelled even after 
> jobs are finished, and KMS goes out of memory within a day because of the 
> delegation token leak.
> We were able to reproduce the bug in a smaller test cluster, and realized 
> when a Spark job starts, it acquires two delegation tokens, and only one is 
> cancelled properly after the job finishes. The other one is left over and 
> linger around for up to 7 days ( default Hadoop delegation token life time).
> YARN handles the lifecycle of a delegation token properly if its renewer is 
> 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
> token with the job issuer as the renewer, simply to get the token renewal 
> interval. The token is then ignored but not cancelled.
> Propose: cancel the delegation token immediately after the token renewal 
> interval is obtained.
> Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
> introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37335:


 Summary: Clarify output of FPGrowth
 Key: SPARK-37335
 URL: https://issues.apache.org/jira/browse/SPARK-37335
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


The association rules returned by FPGrow include more columns than are 
documented:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37335:
-
Description: 
The association rules returned by FPGrow include more columns than are 
documented, like {{{}lift{}}}:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns. An _itemset_ should also 
be briefly defined.

  was:
The association rules returned by FPGrow include more columns than are 
documented:

[https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]

We should offer a basic description of these columns.


> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443970#comment-17443970
 ] 

Apache Spark commented on SPARK-37335:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/34605

> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37335:


Assignee: (was: Apache Spark)

> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37335:


Assignee: Apache Spark

> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443973#comment-17443973
 ] 

Apache Spark commented on SPARK-37335:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/34605

> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37336) Migrate common ML utils to SparkSession

2021-11-15 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-37336:


 Summary: Migrate common ML utils to SparkSession
 Key: SPARK-37336
 URL: https://issues.apache.org/jira/browse/SPARK-37336
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.2.0
Reporter: Nicholas Chammas


{{_java2py()}} uses a deprecated method to create a SparkSession.
 
https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37336) Migrate _java2py to SparkSession

2021-11-15 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-37336:
-
Summary: Migrate _java2py to SparkSession  (was: Migrate common ML utils to 
SparkSession)

> Migrate _java2py to SparkSession
> 
>
> Key: SPARK-37336
> URL: https://issues.apache.org/jira/browse/SPARK-37336
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{_java2py()}} uses a deprecated method to create a SparkSession.
>  
> https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37336) Migrate _java2py to SparkSession

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443990#comment-17443990
 ] 

Apache Spark commented on SPARK-37336:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/34606

> Migrate _java2py to SparkSession
> 
>
> Key: SPARK-37336
> URL: https://issues.apache.org/jira/browse/SPARK-37336
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{_java2py()}} uses a deprecated method to create a SparkSession.
>  
> https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37336) Migrate _java2py to SparkSession

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37336:


Assignee: (was: Apache Spark)

> Migrate _java2py to SparkSession
> 
>
> Key: SPARK-37336
> URL: https://issues.apache.org/jira/browse/SPARK-37336
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{_java2py()}} uses a deprecated method to create a SparkSession.
>  
> https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37336) Migrate _java2py to SparkSession

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37336:


Assignee: Apache Spark

> Migrate _java2py to SparkSession
> 
>
> Key: SPARK-37336
> URL: https://issues.apache.org/jira/browse/SPARK-37336
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Minor
>
> {{_java2py()}} uses a deprecated method to create a SparkSession.
>  
> https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37336) Migrate _java2py to SparkSession

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443991#comment-17443991
 ] 

Apache Spark commented on SPARK-37336:
--

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/34606

> Migrate _java2py to SparkSession
> 
>
> Key: SPARK-37336
> URL: https://issues.apache.org/jira/browse/SPARK-37336
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{_java2py()}} uses a deprecated method to create a SparkSession.
>  
> https://github.com/apache/spark/blob/2fe9af8b2b91d0a46782dd6fff57eca8609be105/python/pyspark/ml/common.py#L99



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37320:
-
Component/s: Kubernetes
 (was: k8)

> Delete py_container_checks.zip after the test in DepsTestsSuite finishes
> 
>
> Key: SPARK-37320
> URL: https://issues.apache.org/jira/browse/SPARK-37320
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>
> When K8s integration tests run, py_container_checks.zip  still remains in 
> resource-managers/kubernetes/integration-tests/tests/.
> It's is created in the test "Launcher python client dependencies using a zip 
> file" in DepsTestsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36038) Basic speculation metrics at stage level

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444046#comment-17444046
 ] 

Apache Spark commented on SPARK-36038:
--

User 'thejdeep' has created a pull request for this issue:
https://github.com/apache/spark/pull/34607

> Basic speculation metrics at stage level
> 
>
> Key: SPARK-36038
> URL: https://issues.apache.org/jira/browse/SPARK-36038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently there are no speculation metrics available either at application 
> level or at stage level. With in our platform, we have added speculation 
> metrics at stage level as a summary similarly to the stage level metrics 
> tracking numTotalSpeculated, numCompleted (successful), numFailed, numKilled 
> etc. This enables us to effectively understand speculative execution feature 
> at an application level and helps in further tuning the speculation configs.
> cc [~ron8hu]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37337) Improve the API of Spark DataFrame to pandas-on-Spark DataFrame conversion

2021-11-15 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-37337:


 Summary: Improve the API of Spark DataFrame to pandas-on-Spark 
DataFrame conversion
 Key: SPARK-37337
 URL: https://issues.apache.org/jira/browse/SPARK-37337
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Undeprecate (Spark)DataFrame.to_koalas 

Rename (Spark)DataFrame.to_pandas_like to (Spark)DataFrame.pandas_api



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37338) Rename to_pandas_on_spark to pandas_api

2021-11-15 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-37338:


 Summary: Rename to_pandas_on_spark to pandas_api
 Key: SPARK-37338
 URL: https://issues.apache.org/jira/browse/SPARK-37338
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Rename to_pandas_on_spark to pandas_api for API usability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37338) Rename to_pandas_on_spark to pandas_api

2021-11-15 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37338:
-
Description: 
Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and 
inconvenient to call.

So we wanted to rename to_pandas_on_spark to pandas_api for API usability

  was:Rename to_pandas_on_spark to pandas_api for API usability


> Rename to_pandas_on_spark to pandas_api
> ---
>
> Key: SPARK-37338
> URL: https://issues.apache.org/jira/browse/SPARK-37338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and 
> inconvenient to call.
> So we wanted to rename to_pandas_on_spark to pandas_api for API usability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api

2021-11-15 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-37338:
-
Summary: Rename (Spark)DataFrame.to_pandas_on_spark to 
(Spark)DataFrame.pandas_api  (was: Rename to_pandas_on_spark to pandas_api)

> Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
> -
>
> Key: SPARK-37338
> URL: https://issues.apache.org/jira/browse/SPARK-37338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and 
> inconvenient to call.
> So we wanted to rename to_pandas_on_spark to pandas_api for API usability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444081#comment-17444081
 ] 

Apache Spark commented on SPARK-37338:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34608

> Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
> -
>
> Key: SPARK-37338
> URL: https://issues.apache.org/jira/browse/SPARK-37338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and 
> inconvenient to call.
> So we wanted to rename to_pandas_on_spark to pandas_api for API usability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37338:


Assignee: Apache Spark

> Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
> -
>
> Key: SPARK-37338
> URL: https://issues.apache.org/jira/browse/SPARK-37338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and 
> inconvenient to call.
> So we wanted to rename to_pandas_on_spark to pandas_api for API usability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37338:


Assignee: (was: Apache Spark)

> Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
> -
>
> Key: SPARK-37338
> URL: https://issues.apache.org/jira/browse/SPARK-37338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and 
> inconvenient to call.
> So we wanted to rename to_pandas_on_spark to pandas_api for API usability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37339) Add `spark-version` label to driver and executor pods

2021-11-15 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37339:
-

 Summary: Add `spark-version` label to driver and executor pods
 Key: SPARK-37339
 URL: https://issues.apache.org/jira/browse/SPARK-37339
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-15 Thread Sathiya Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-37324:
--
Description: 
Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex 
operations to do the same with spark native methods.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Opening support for the other rounding modes might interest a lot of use cases. 
*SAP Hana Sql ROUND function does it :* 
{code:java}
ROUND( [,  [, ]]){code}
REF : 
[https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/20e6a27575191014bd54a07fd86c585d.html]


*Sql Server does something similar to this* :
{code:java}
ROUND ( numeric_expression , length [ ,function ] ){code}
REF : 
[https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
 

 

  was:
Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex 
operations to do the same with spark native methods.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Opening support for the other rounding modes might interest a lot of use cases. 
Sql Server does something similar to this : 
[https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
 


> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. 
> *SAP Hana Sql ROUND function does it :* 
> {code:java}
> ROUND( [,  [, ]]){code}
> REF : 
> [https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/20e6a27575191014bd54a07fd86c585d.html]
> *Sql Server does something similar to this* :
> {code:java}
> ROUND ( numeric_expression , length [ ,function ] ){code}
> REF : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37339) Add `spark-version` label to driver and executor pods

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37339:


Assignee: (was: Apache Spark)

> Add `spark-version` label to driver and executor pods
> -
>
> Key: SPARK-37339
> URL: https://issues.apache.org/jira/browse/SPARK-37339
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37339) Add `spark-version` label to driver and executor pods

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444087#comment-17444087
 ] 

Apache Spark commented on SPARK-37339:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34609

> Add `spark-version` label to driver and executor pods
> -
>
> Key: SPARK-37339
> URL: https://issues.apache.org/jira/browse/SPARK-37339
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37339) Add `spark-version` label to driver and executor pods

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37339:


Assignee: Apache Spark

> Add `spark-version` label to driver and executor pods
> -
>
> Key: SPARK-37339
> URL: https://issues.apache.org/jira/browse/SPARK-37339
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34332) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34332:


Assignee: Max Gekk  (was: Apache Spark)

> Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
> -
>
> Key: SPARK-34332
> URL: https://issues.apache.org/jira/browse/SPARK-34332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34332) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34332:


Assignee: Apache Spark  (was: Max Gekk)

> Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
> -
>
> Key: SPARK-34332
> URL: https://issues.apache.org/jira/browse/SPARK-34332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34332) Unify v1 and v2 ALTER TABLE .. SET LOCATION tests

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444095#comment-17444095
 ] 

Apache Spark commented on SPARK-34332:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34610

> Unify v1 and v2 ALTER TABLE .. SET LOCATION tests
> -
>
> Key: SPARK-34332
> URL: https://issues.apache.org/jira/browse/SPARK-34332
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Extract ALTER TABLE .. SET LOCATION tests to the common place to run them for 
> V1 and v2 datasources. Some tests can be places to V1 and V2 specific test 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread Chuck Connell (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444117#comment-17444117
 ] 

Chuck Connell commented on SPARK-37181:
---

That would be a good solution, just convert latin-1 silently to ISO-8859-1. 

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35867:


Assignee: (was: Apache Spark)

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444135#comment-17444135
 ] 

Apache Spark commented on SPARK-35867:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/34611

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35867:


Assignee: Apache Spark

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Minor
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37339) Add `spark-version` label to driver and executor pods

2021-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37339:
-

Assignee: Dongjoon Hyun

> Add `spark-version` label to driver and executor pods
> -
>
> Key: SPARK-37339
> URL: https://issues.apache.org/jira/browse/SPARK-37339
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37339) Add `spark-version` label to driver and executor pods

2021-11-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37339.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34609
[https://github.com/apache/spark/pull/34609]

> Add `spark-version` label to driver and executor pods
> -
>
> Key: SPARK-37339
> URL: https://issues.apache.org/jira/browse/SPARK-37339
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37340) Display StageIds in Operators for SQL UI

2021-11-15 Thread Yian Liou (Jira)
Yian Liou created SPARK-37340:
-

 Summary: Display StageIds in Operators for SQL UI
 Key: SPARK-37340
 URL: https://issues.apache.org/jira/browse/SPARK-37340
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.2.0
Reporter: Yian Liou


This proposes a more generalized solution of 
https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator 
mapping is done with the following algorithm.

 1. Read SparkGraph to get every Node's name and respective AccumulatorIDs.
 2. Gets each stage's AccumulatorIDs.
 3. Maps Operators to stages by checking for non-zero intersection of Step 1 
and 2's AccumulatorIDs.
 4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI.

As a result, some operators without max metrics values will also have stageIds 
in the UI. This Jira also aims to add minor enhancements to the SQL UI tab.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37340) Display StageIds in Operators for SQL UI

2021-11-15 Thread Yian Liou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444168#comment-17444168
 ] 

Yian Liou commented on SPARK-37340:
---

Will be working on this issue and opening pull request.

> Display StageIds in Operators for SQL UI
> 
>
> Key: SPARK-37340
> URL: https://issues.apache.org/jira/browse/SPARK-37340
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Yian Liou
>Priority: Major
>
> This proposes a more generalized solution of 
> https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator 
> mapping is done with the following algorithm.
>  1. Read SparkGraph to get every Node's name and respective AccumulatorIDs.
>  2. Gets each stage's AccumulatorIDs.
>  3. Maps Operators to stages by checking for non-zero intersection of Step 1 
> and 2's AccumulatorIDs.
>  4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI.
> As a result, some operators without max metrics values will also have 
> stageIds in the UI. This Jira also aims to add minor enhancements to the SQL 
> UI tab.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-11-15 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444173#comment-17444173
 ] 

Yongjun Zhang commented on SPARK-31646:
---

HI [~mauzhang] , wonder if you have been monitoring the metrics 
activeConnections and registeredConnections, somehow I observed 
registeredConnections is smaller than activeConenctions, I thought it should be 
the opposite. I also asked here:

https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924

Thanks.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-11-15 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444173#comment-17444173
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 11/15/21, 11:17 PM:
---

HI [~mauzhang] , wonder if you have been monitoring the metrics 
activeConnections and registeredConnections, somehow I observed 
registeredConnections is smaller than activeConnections, I thought it should be 
the opposite. I also asked here:

https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924

Thanks.


was (Author: yzhangal):
HI [~mauzhang] , wonder if you have been monitoring the metrics 
activeConnections and registeredConnections, somehow I observed 
registeredConnections is smaller than activeConenctions, I thought it should be 
the opposite. I also asked here:

https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924

Thanks.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join

2021-11-15 Thread Cheng Su (Jira)
Cheng Su created SPARK-37341:


 Summary: Avoid unnecessary buffer and copy in full outer sort 
merge join
 Key: SPARK-37341
 URL: https://issues.apache.org/jira/browse/SPARK-37341
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Cheng Su


FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers 
input rows, even when rows from both sides do have matched keys 
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641]
 ). This is unnecessary, as we can just output the row with smaller join keys, 
and only buffer when both sides have matched keys. This would save us from 
unnecessary copy and buffer, when both join sides have a lot of rows not 
matched with each other.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37341:


Assignee: Apache Spark

> Avoid unnecessary buffer and copy in full outer sort merge join
> ---
>
> Key: SPARK-37341
> URL: https://issues.apache.org/jira/browse/SPARK-37341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers 
> input rows, even when rows from both sides do have matched keys 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641]
>  ). This is unnecessary, as we can just output the row with smaller join 
> keys, and only buffer when both sides have matched keys. This would save us 
> from unnecessary copy and buffer, when both join sides have a lot of rows not 
> matched with each other.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37341:


Assignee: (was: Apache Spark)

> Avoid unnecessary buffer and copy in full outer sort merge join
> ---
>
> Key: SPARK-37341
> URL: https://issues.apache.org/jira/browse/SPARK-37341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers 
> input rows, even when rows from both sides do have matched keys 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641]
>  ). This is unnecessary, as we can just output the row with smaller join 
> keys, and only buffer when both sides have matched keys. This would save us 
> from unnecessary copy and buffer, when both join sides have a lot of rows not 
> matched with each other.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444188#comment-17444188
 ] 

Apache Spark commented on SPARK-37341:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34612

> Avoid unnecessary buffer and copy in full outer sort merge join
> ---
>
> Key: SPARK-37341
> URL: https://issues.apache.org/jira/browse/SPARK-37341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers 
> input rows, even when rows from both sides do have matched keys 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641]
>  ). This is unnecessary, as we can just output the row with smaller join 
> keys, and only buffer when both sides have matched keys. This would save us 
> from unnecessary copy and buffer, when both join sides have a lot of rows not 
> matched with each other.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37341) Avoid unnecessary buffer and copy in full outer sort merge join

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444190#comment-17444190
 ] 

Apache Spark commented on SPARK-37341:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34612

> Avoid unnecessary buffer and copy in full outer sort merge join
> ---
>
> Key: SPARK-37341
> URL: https://issues.apache.org/jira/browse/SPARK-37341
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> FULL OUTER sort merge join (non-code-gen path) copies join keys and buffers 
> input rows, even when rows from both sides do have matched keys 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L1637-L1641]
>  ). This is unnecessary, as we can just output the row with smaller join 
> keys, and only buffer when both sides have matched keys. This would save us 
> from unnecessary copy and buffer, when both join sides have a lot of rows not 
> matched with each other.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Chao Sun (Jira)
Chao Sun created SPARK-37342:


 Summary: Upgrade Apache Arrow to 6.0.0
 Key: SPARK-37342
 URL: https://issues.apache.org/jira/browse/SPARK-37342
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Chao Sun


Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-37342:
-
Component/s: Build
 (was: Spark Core)

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37342:


Assignee: Apache Spark

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444194#comment-17444194
 ] 

Apache Spark commented on SPARK-37342:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34613

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37342:


Assignee: (was: Apache Spark)

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37342) Upgrade Apache Arrow to 6.0.0

2021-11-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444195#comment-17444195
 ] 

Apache Spark commented on SPARK-37342:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34613

> Upgrade Apache Arrow to 6.0.0
> -
>
> Key: SPARK-37342
> URL: https://issues.apache.org/jira/browse/SPARK-37342
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Spark is still using Apache Arrow 2.0.0 while 6.0.0 was already released last 
> month.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37335.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34605
[https://github.com/apache/spark/pull/34605]

> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.3.0
>
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37335) Clarify output of FPGrowth

2021-11-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-37335:


Assignee: Nicholas Chammas

> Clarify output of FPGrowth
> --
>
> Key: SPARK-37335
> URL: https://issues.apache.org/jira/browse/SPARK-37335
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> The association rules returned by FPGrow include more columns than are 
> documented, like {{{}lift{}}}:
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> We should offer a basic description of these columns. An _itemset_ should 
> also be briefly defined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37343) Implement createIndex and IndexExists in JDBC (Postgres dialect)

2021-11-15 Thread dch nguyen (Jira)
dch nguyen created SPARK-37343:
--

 Summary: Implement createIndex and IndexExists in JDBC (Postgres 
dialect)
 Key: SPARK-37343
 URL: https://issues.apache.org/jira/browse/SPARK-37343
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37343) Implement createIndex and IndexExists in JDBC (Postgres dialect)

2021-11-15 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444210#comment-17444210
 ] 

dch nguyen commented on SPARK-37343:


I'm working on this.

> Implement createIndex and IndexExists in JDBC (Postgres dialect)
> 
>
> Key: SPARK-37343
> URL: https://issues.apache.org/jira/browse/SPARK-37343
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2

2021-11-15 Thread ocean (Jira)
ocean created SPARK-37344:
-

 Summary: split function behave differently between spark 2.3 and 
spark 3.2
 Key: SPARK-37344
 URL: https://issues.apache.org/jira/browse/SPARK-37344
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.2, 3.1.1
Reporter: ocean


while use split function in sql, it behave differently between 2.3 and 3.2, 
which cause incorrect problem.

we can use this sql to reproduce this problem:

 

create table split_test ( id int,name string)

insert into split_test values(1,"abc;def")

explain extended select split(name,';') from split_test

 

spark3:

spark-sql> Explain extended select split(name,';') from split_test;

== Parsed Logical Plan ==

'Project [unresolvedalias('split('name, \\;), None)]

+- 'UnresolvedRelation [split_test], [], false

 

spark2:

 

spark-sql> Explain extended select split(name,';') from split_test;

== Parsed Logical Plan ==

'Project [unresolvedalias('split('name, \;), None)]

+- 'UnresolvedRelation split_test

 

It looks like the deal of escape is different



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2

2021-11-15 Thread ocean (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ocean updated SPARK-37344:
--
Labels: incorrect  (was: )

> split function behave differently between spark 2.3 and spark 3.2
> -
>
> Key: SPARK-37344
> URL: https://issues.apache.org/jira/browse/SPARK-37344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: ocean
>Priority: Major
>  Labels: incorrect
>
> while use split function in sql, it behave differently between 2.3 and 3.2, 
> which cause incorrect problem.
> we can use this sql to reproduce this problem:
>  
> create table split_test ( id int,name string)
> insert into split_test values(1,"abc;def")
> explain extended select split(name,';') from split_test
>  
> spark3:
> spark-sql> Explain extended select split(name,';') from split_test;
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('split('name, \\;), None)]
> +- 'UnresolvedRelation [split_test], [], false
>  
> spark2:
>  
> spark-sql> Explain extended select split(name,';') from split_test;
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('split('name, \;), None)]
> +- 'UnresolvedRelation split_test
>  
> It looks like the deal of escape is different



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2

2021-11-15 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444225#comment-17444225
 ] 

angerszhu commented on SPARK-37344:
---

Work on this


> split function behave differently between spark 2.3 and spark 3.2
> -
>
> Key: SPARK-37344
> URL: https://issues.apache.org/jira/browse/SPARK-37344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: ocean
>Priority: Major
>  Labels: incorrect
>
> while use split function in sql, it behave differently between 2.3 and 3.2, 
> which cause incorrect problem.
> we can use this sql to reproduce this problem:
>  
> create table split_test ( id int,name string)
> insert into split_test values(1,"abc;def")
> explain extended select split(name,';') from split_test
>  
> spark3:
> spark-sql> Explain extended select split(name,';') from split_test;
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('split('name, \\;), None)]
> +- 'UnresolvedRelation [split_test], [], false
>  
> spark2:
>  
> spark-sql> Explain extended select split(name,';') from split_test;
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('split('name, \;), None)]
> +- 'UnresolvedRelation split_test
>  
> It looks like the deal of escape is different



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444235#comment-17444235
 ] 

Yikun Jiang edited comment on SPARK-37181 at 11/16/21, 2:41 AM:


Agree, actully, lating-1 is also do same covert in Python internal 
implementations, so I think it's good to do the same covert.

 

[1] 
[https://github.com/python/cpython/blob/9bf2cbc4c498812e14f20d86acb61c53928a5a57/Lib/encodings/latin_1.py#L43]


was (Author: yikunkero):
Actully, lating-1 is also do same covert in Python internal implementations, so 
I think it's good to do the same covert.

 

[1] 
https://github.com/python/cpython/blob/9bf2cbc4c498812e14f20d86acb61c53928a5a57/Lib/encodings/latin_1.py#L43

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37181) pyspark.pandas.read_csv() should support latin-1 encoding

2021-11-15 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444235#comment-17444235
 ] 

Yikun Jiang commented on SPARK-37181:
-

Actully, lating-1 is also do same covert in Python internal implementations, so 
I think it's good to do the same covert.

 

[1] 
https://github.com/python/cpython/blob/9bf2cbc4c498812e14f20d86acb61c53928a5a57/Lib/encodings/latin_1.py#L43

> pyspark.pandas.read_csv() should support latin-1 encoding
> -
>
> Key: SPARK-37181
> URL: https://issues.apache.org/jira/browse/SPARK-37181
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Chuck Connell
>Priority: Major
>
> {{In regular pandas, you can say read_csv(encoding='latin-1'). This encoding 
> is not recognized in pyspark.pandas. You have to use Windows-1252 instead, 
> which is almost the same but not identical. }}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37344) split function behave differently between spark 2.3 and spark 3.2

2021-11-15 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444225#comment-17444225
 ] 

angerszhu edited comment on SPARK-37344 at 11/16/21, 2:51 AM:
--

In latest master branch 
{code}
== Parsed Logical Plan ==
'Project [unresolvedalias('split('name, \;), None)]
+- 'UnresolvedRelation [split_test], [], false

== Analyzed Logical Plan ==
split(name, \;, -1): array
Project [split(name#225, \;, -1) AS split(name, \;, -1)#226]
+- SubqueryAlias spark_catalog.default.split_test
   +- Relation default.split_test[id#224,name#225] parquet

== Optimized Logical Plan ==
Project [split(name#225, \;, -1) AS split(name, \;, -1)#226]
+- Relation default.split_test[id#224,name#225] parquet

== Physical Plan ==
*(1) Project [split(name#225, \;, -1) AS split(name, \;, -1)#226]
+- *(1) ColumnarToRow
   +- FileScan parquet default.split_test[name#225] Batched: true, DataFilters: 
[], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/yi.zhu/Documents/project/Angerszh/spark/sql/core/spark...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct
{code}



was (Author: angerszhuuu):
Work on this


> split function behave differently between spark 2.3 and spark 3.2
> -
>
> Key: SPARK-37344
> URL: https://issues.apache.org/jira/browse/SPARK-37344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0
>Reporter: ocean
>Priority: Major
>  Labels: incorrect
>
> while use split function in sql, it behave differently between 2.3 and 3.2, 
> which cause incorrect problem.
> we can use this sql to reproduce this problem:
>  
> create table split_test ( id int,name string)
> insert into split_test values(1,"abc;def")
> explain extended select split(name,';') from split_test
>  
> spark3:
> spark-sql> Explain extended select split(name,';') from split_test;
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('split('name, \\;), None)]
> +- 'UnresolvedRelation [split_test], [], false
>  
> spark2:
>  
> spark-sql> Explain extended select split(name,';') from split_test;
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('split('name, \;), None)]
> +- 'UnresolvedRelation split_test
>  
> It looks like the deal of escape is different



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37338) Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api

2021-11-15 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-37338.
--
Resolution: Duplicate

> Rename (Spark)DataFrame.to_pandas_on_spark to (Spark)DataFrame.pandas_api
> -
>
> Key: SPARK-37338
> URL: https://issues.apache.org/jira/browse/SPARK-37338
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, (Spark)DataFrame.to_pandas_on_spark is too long to memorize and 
> inconvenient to call.
> So we wanted to rename to_pandas_on_spark to pandas_api for API usability



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >