[jira] [Created] (SPARK-41446) Make `createDataFrame` support schema and more input dataset type

2022-12-07 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41446:
-

 Summary: Make `createDataFrame` support schema and more input 
dataset type
 Key: SPARK-41446
 URL: https://issues.apache.org/jira/browse/SPARK-41446
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39948) exclude velocity 1.5 jar

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644645#comment-17644645
 ] 

Apache Spark commented on SPARK-39948:
--

User 'zhouyifan279' has created a pull request for this issue:
https://github.com/apache/spark/pull/38978

> exclude velocity 1.5 jar
> 
>
> Key: SPARK-39948
> URL: https://issues.apache.org/jira/browse/SPARK-39948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: melin
>Priority: Major
>
> hive-exec depends on importing velocity. Velocity has an older version and 
> has many security issues
> https://issues.apache.org/jira/browse/HIVE-25726
>  
> !image-2022-08-02-14-05-55-756.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39948) exclude velocity 1.5 jar

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644643#comment-17644643
 ] 

Apache Spark commented on SPARK-39948:
--

User 'zhouyifan279' has created a pull request for this issue:
https://github.com/apache/spark/pull/38978

> exclude velocity 1.5 jar
> 
>
> Key: SPARK-39948
> URL: https://issues.apache.org/jira/browse/SPARK-39948
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: melin
>Priority: Major
>
> hive-exec depends on importing velocity. Velocity has an older version and 
> has many security issues
> https://issues.apache.org/jira/browse/HIVE-25726
>  
> !image-2022-08-02-14-05-55-756.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41366) DF.groupby.agg() API should be compatible

2022-12-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41366:


Assignee: Martin Grund

> DF.groupby.agg() API should be compatible
> -
>
> Key: SPARK-41366
> URL: https://issues.apache.org/jira/browse/SPARK-41366
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38277) Clear write batch after RocksDB state store's commit

2022-12-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-38277.
--
Fix Version/s: 3.3.2
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 38880
[https://github.com/apache/spark/pull/38880]

> Clear write batch after RocksDB state store's commit
> 
>
> Key: SPARK-38277
> URL: https://issues.apache.org/jira/browse/SPARK-38277
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Minor
> Fix For: 3.3.2, 3.4.0
>
>
> Current write batch would be cleared when loading the next batch, however, 
> this could be improved once the batch just committed to release unused memory 
> earlier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38277) Clear write batch after RocksDB state store's commit

2022-12-07 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-38277:


Assignee: Yun Tang

> Clear write batch after RocksDB state store's commit
> 
>
> Key: SPARK-38277
> URL: https://issues.apache.org/jira/browse/SPARK-38277
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.1
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Minor
>
> Current write batch would be cleared when loading the next batch, however, 
> this could be improved once the batch just committed to release unused memory 
> earlier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41445) Implement DataFrameReader.parquet

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644623#comment-17644623
 ] 

Apache Spark commented on SPARK-41445:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/38977

> Implement DataFrameReader.parquet
> -
>
> Key: SPARK-41445
> URL: https://issues.apache.org/jira/browse/SPARK-41445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41445) Implement DataFrameReader.parquet

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41445:


Assignee: Apache Spark

> Implement DataFrameReader.parquet
> -
>
> Key: SPARK-41445
> URL: https://issues.apache.org/jira/browse/SPARK-41445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41445) Implement DataFrameReader.parquet

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41445:


Assignee: (was: Apache Spark)

> Implement DataFrameReader.parquet
> -
>
> Key: SPARK-41445
> URL: https://issues.apache.org/jira/browse/SPARK-41445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41444) Implement DataFrameReader.json

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644620#comment-17644620
 ] 

Apache Spark commented on SPARK-41444:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38975

> Implement DataFrameReader.json
> --
>
> Key: SPARK-41444
> URL: https://issues.apache.org/jira/browse/SPARK-41444
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41444) Implement DataFrameReader.json

2022-12-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-41444:
-
Summary: Implement DataFrameReader.json  (was: Support read.json)

> Implement DataFrameReader.json
> --
>
> Key: SPARK-41444
> URL: https://issues.apache.org/jira/browse/SPARK-41444
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41445) Implement DataFrameReader.parquet

2022-12-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41445:


 Summary: Implement DataFrameReader.parquet
 Key: SPARK-41445
 URL: https://issues.apache.org/jira/browse/SPARK-41445
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Hyukjin Kwon (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41284 ]


Hyukjin Kwon deleted comment on SPARK-41284:
--

was (Author: gurwls223):
Issue resolved by pull request 38975
[https://github.com/apache/spark/pull/38975]

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
> Fix For: 3.4.0
>
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41444) Support read.json

2022-12-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644619#comment-17644619
 ] 

Hyukjin Kwon commented on SPARK-41444:
--

Fixed in https://github.com/apache/spark/pull/38975

> Support read.json
> -
>
> Key: SPARK-41444
> URL: https://issues.apache.org/jira/browse/SPARK-41444
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Hyukjin Kwon (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41284 ]


Hyukjin Kwon deleted comment on SPARK-41284:
--

was (Author: apachespark):
User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38975

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
> Fix For: 3.4.0
>
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41444) Support read.json

2022-12-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41444.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> Support read.json
> -
>
> Key: SPARK-41444
> URL: https://issues.apache.org/jira/browse/SPARK-41444
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Hyukjin Kwon (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41284 ]


Hyukjin Kwon deleted comment on SPARK-41284:
--

was (Author: apachespark):
User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38975

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
> Fix For: 3.4.0
>
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-41284:
--

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
> Fix For: 3.4.0
>
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41284.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38975
[https://github.com/apache/spark/pull/38975]

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
> Fix For: 3.4.0
>
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41442.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38969
[https://github.com/apache/spark/pull/38969]

> Only update SQLMetric value if merging with valid metric
> 
>
> Key: SPARK-41442
> URL: https://issues.apache.org/jira/browse/SPARK-41442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.4.0
>
>
> We use -1 as initial value of SQLMetric, and change it to 0 while merging 
> with other SQLMetric instances. A SQLMetric will be treated as invalid and 
> filtered out later.
> While we are developing with Spark, it is trouble behavior that two invalid 
> SQLMetric instances merge to a valid SQLMetric because merging will set the 
> value to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41442:
-

Assignee: L. C. Hsieh

> Only update SQLMetric value if merging with valid metric
> 
>
> Key: SPARK-41442
> URL: https://issues.apache.org/jira/browse/SPARK-41442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> We use -1 as initial value of SQLMetric, and change it to 0 while merging 
> with other SQLMetric instances. A SQLMetric will be treated as invalid and 
> filtered out later.
> While we are developing with Spark, it is trouble behavior that two invalid 
> SQLMetric instances merge to a valid SQLMetric because merging will set the 
> value to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41366) DF.groupby.agg() API should be compatible

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644604#comment-17644604
 ] 

Apache Spark commented on SPARK-41366:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38976

> DF.groupby.agg() API should be compatible
> -
>
> Key: SPARK-41366
> URL: https://issues.apache.org/jira/browse/SPARK-41366
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41366) DF.groupby.agg() API should be compatible

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644603#comment-17644603
 ] 

Apache Spark commented on SPARK-41366:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38976

> DF.groupby.agg() API should be compatible
> -
>
> Key: SPARK-41366
> URL: https://issues.apache.org/jira/browse/SPARK-41366
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41439) Implement `DataFrame.melt`

2022-12-07 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41439 ]


jiaan.geng deleted comment on SPARK-41439:


was (Author: beliefer):
I'm working on.

> Implement `DataFrame.melt`
> --
>
> Key: SPARK-41439
> URL: https://issues.apache.org/jira/browse/SPARK-41439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41350) allow simple name access of using join hidden columns after subquery alias

2022-12-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41350:
--
Fix Version/s: 3.3.2

> allow simple name access of using join hidden columns after subquery alias
> --
>
> Key: SPARK-41350
> URL: https://issues.apache.org/jira/browse/SPARK-41350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644593#comment-17644593
 ] 

Apache Spark commented on SPARK-41284:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38975

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41284:


Assignee: Rui Wang  (was: Apache Spark)

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644594#comment-17644594
 ] 

Apache Spark commented on SPARK-41284:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38975

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Rui Wang
>Priority: Critical
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41284) Feature parity: I/O in Spark Connect

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41284:


Assignee: Apache Spark  (was: Rui Wang)

> Feature parity: I/O in Spark Connect
> 
>
> Key: SPARK-41284
> URL: https://issues.apache.org/jira/browse/SPARK-41284
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Critical
>
> Implement I/O API such as DataFrameReader/Writer



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41444) Support read.json

2022-12-07 Thread Rui Wang (Jira)
Rui Wang created SPARK-41444:


 Summary: Support read.json
 Key: SPARK-41444
 URL: https://issues.apache.org/jira/browse/SPARK-41444
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41439) Implement `DataFrame.melt`

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41439:


Assignee: Apache Spark

> Implement `DataFrame.melt`
> --
>
> Key: SPARK-41439
> URL: https://issues.apache.org/jira/browse/SPARK-41439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41439) Implement `DataFrame.melt`

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41439:


Assignee: (was: Apache Spark)

> Implement `DataFrame.melt`
> --
>
> Key: SPARK-41439
> URL: https://issues.apache.org/jira/browse/SPARK-41439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41439) Implement `DataFrame.melt`

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644576#comment-17644576
 ] 

Apache Spark commented on SPARK-41439:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/38973

> Implement `DataFrame.melt`
> --
>
> Key: SPARK-41439
> URL: https://issues.apache.org/jira/browse/SPARK-41439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41443) Assign a name to the error class _LEGACY_ERROR_TEMP_1061

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644575#comment-17644575
 ] 

Apache Spark commented on SPARK-41443:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38972

> Assign a name to the error class _LEGACY_ERROR_TEMP_1061
> 
>
> Key: SPARK-41443
> URL: https://issues.apache.org/jira/browse/SPARK-41443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41443) Assign a name to the error class _LEGACY_ERROR_TEMP_1061

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41443:


Assignee: Apache Spark

> Assign a name to the error class _LEGACY_ERROR_TEMP_1061
> 
>
> Key: SPARK-41443
> URL: https://issues.apache.org/jira/browse/SPARK-41443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41443) Assign a name to the error class _LEGACY_ERROR_TEMP_1061

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644574#comment-17644574
 ] 

Apache Spark commented on SPARK-41443:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/38972

> Assign a name to the error class _LEGACY_ERROR_TEMP_1061
> 
>
> Key: SPARK-41443
> URL: https://issues.apache.org/jira/browse/SPARK-41443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41443) Assign a name to the error class _LEGACY_ERROR_TEMP_1061

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41443:


Assignee: (was: Apache Spark)

> Assign a name to the error class _LEGACY_ERROR_TEMP_1061
> 
>
> Key: SPARK-41443
> URL: https://issues.apache.org/jira/browse/SPARK-41443
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41443) Assign a name to the error class _LEGACY_ERROR_TEMP_1061

2022-12-07 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-41443:
---

 Summary: Assign a name to the error class _LEGACY_ERROR_TEMP_1061
 Key: SPARK-41443
 URL: https://issues.apache.org/jira/browse/SPARK-41443
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644566#comment-17644566
 ] 

Apache Spark commented on SPARK-41433:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/38971

> Make Max Arrow BatchSize configurable
> -
>
> Key: SPARK-41433
> URL: https://issues.apache.org/jira/browse/SPARK-41433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41376) Executor netty direct memory check should respect spark.shuffle.io.preferDirectBufs

2022-12-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-41376:
-
Priority: Minor  (was: Major)

> Executor netty direct memory check should respect 
> spark.shuffle.io.preferDirectBufs
> ---
>
> Key: SPARK-41376
> URL: https://issues.apache.org/jira/browse/SPARK-41376
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41376) Executor netty direct memory check should respect spark.shuffle.io.preferDirectBufs

2022-12-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-41376.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38901
[https://github.com/apache/spark/pull/38901]

> Executor netty direct memory check should respect 
> spark.shuffle.io.preferDirectBufs
> ---
>
> Key: SPARK-41376
> URL: https://issues.apache.org/jira/browse/SPARK-41376
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41376) Executor netty direct memory check should respect spark.shuffle.io.preferDirectBufs

2022-12-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-41376:


Assignee: Cheng Pan

> Executor netty direct memory check should respect 
> spark.shuffle.io.preferDirectBufs
> ---
>
> Key: SPARK-41376
> URL: https://issues.apache.org/jira/browse/SPARK-41376
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41378) Support Column Stats in DS V2

2022-12-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41378.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38904
[https://github.com/apache/spark/pull/38904]

> Support Column Stats in DS V2
> -
>
> Key: SPARK-41378
> URL: https://issues.apache.org/jira/browse/SPARK-41378
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41378) Support Column Stats in DS V2

2022-12-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41378:
-

Assignee: Huaxin Gao

> Support Column Stats in DS V2
> -
>
> Key: SPARK-41378
> URL: https://issues.apache.org/jira/browse/SPARK-41378
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41412) Implement `Cast`

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41412:


Assignee: Rui Wang  (was: Apache Spark)

> Implement `Cast`
> 
>
> Key: SPARK-41412
> URL: https://issues.apache.org/jira/browse/SPARK-41412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41412) Implement `Cast`

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644513#comment-17644513
 ] 

Apache Spark commented on SPARK-41412:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/38970

> Implement `Cast`
> 
>
> Key: SPARK-41412
> URL: https://issues.apache.org/jira/browse/SPARK-41412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41412) Implement `Cast`

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41412:


Assignee: Apache Spark  (was: Rui Wang)

> Implement `Cast`
> 
>
> Key: SPARK-41412
> URL: https://issues.apache.org/jira/browse/SPARK-41412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644512#comment-17644512
 ] 

Apache Spark commented on SPARK-41442:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/38969

> Only update SQLMetric value if merging with valid metric
> 
>
> Key: SPARK-41442
> URL: https://issues.apache.org/jira/browse/SPARK-41442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Priority: Minor
>
> We use -1 as initial value of SQLMetric, and change it to 0 while merging 
> with other SQLMetric instances. A SQLMetric will be treated as invalid and 
> filtered out later.
> While we are developing with Spark, it is trouble behavior that two invalid 
> SQLMetric instances merge to a valid SQLMetric because merging will set the 
> value to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41442:


Assignee: (was: Apache Spark)

> Only update SQLMetric value if merging with valid metric
> 
>
> Key: SPARK-41442
> URL: https://issues.apache.org/jira/browse/SPARK-41442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Priority: Minor
>
> We use -1 as initial value of SQLMetric, and change it to 0 while merging 
> with other SQLMetric instances. A SQLMetric will be treated as invalid and 
> filtered out later.
> While we are developing with Spark, it is trouble behavior that two invalid 
> SQLMetric instances merge to a valid SQLMetric because merging will set the 
> value to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41442:


Assignee: Apache Spark

> Only update SQLMetric value if merging with valid metric
> 
>
> Key: SPARK-41442
> URL: https://issues.apache.org/jira/browse/SPARK-41442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> We use -1 as initial value of SQLMetric, and change it to 0 while merging 
> with other SQLMetric instances. A SQLMetric will be treated as invalid and 
> filtered out later.
> While we are developing with Spark, it is trouble behavior that two invalid 
> SQLMetric instances merge to a valid SQLMetric because merging will set the 
> value to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644511#comment-17644511
 ] 

Apache Spark commented on SPARK-41442:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/38969

> Only update SQLMetric value if merging with valid metric
> 
>
> Key: SPARK-41442
> URL: https://issues.apache.org/jira/browse/SPARK-41442
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: L. C. Hsieh
>Priority: Minor
>
> We use -1 as initial value of SQLMetric, and change it to 0 while merging 
> with other SQLMetric instances. A SQLMetric will be treated as invalid and 
> filtered out later.
> While we are developing with Spark, it is trouble behavior that two invalid 
> SQLMetric instances merge to a valid SQLMetric because merging will set the 
> value to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41442) Only update SQLMetric value if merging with valid metric

2022-12-07 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-41442:
---

 Summary: Only update SQLMetric value if merging with valid metric
 Key: SPARK-41442
 URL: https://issues.apache.org/jira/browse/SPARK-41442
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: L. C. Hsieh


We use -1 as initial value of SQLMetric, and change it to 0 while merging with 
other SQLMetric instances. A SQLMetric will be treated as invalid and filtered 
out later.

While we are developing with Spark, it is trouble behavior that two invalid 
SQLMetric instances merge to a valid SQLMetric because merging will set the 
value to 0.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41233) High-order function: array_prepend

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644500#comment-17644500
 ] 

Apache Spark commented on SPARK-41233:
--

User 'navinvishy' has created a pull request for this issue:
https://github.com/apache/spark/pull/38947

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41233) High-order function: array_prepend

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41233:


Assignee: Apache Spark

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41233) High-order function: array_prepend

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644499#comment-17644499
 ] 

Apache Spark commented on SPARK-41233:
--

User 'navinvishy' has created a pull request for this issue:
https://github.com/apache/spark/pull/38947

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41233) High-order function: array_prepend

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41233:


Assignee: (was: Apache Spark)

> High-order function: array_prepend
> --
>
> Key: SPARK-41233
> URL: https://issues.apache.org/jira/browse/SPARK-41233
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>
> refer to 
> https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_prepend.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-07 Thread Kevin Cheung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644498#comment-17644498
 ] 

Kevin Cheung edited comment on SPARK-41344 at 12/7/22 7:53 PM:
---

[~wforget] I believe he just means duplicate CatalogV2Util.loadTable as a new 
function with signature CatalogV2Util.loadTableThrowsException : Table. The 
only difference would be you just don't catch the exceptions. Then change this 
to your new function ({*}CatalogV2Util.loadTableThrowsException(catalog, ident, 
timeTravel){*}, Some(catalog), Some(ident)). This solves the problem of masking 
the original exception


was (Author: kecheung):
[~wforget] I believe he just means duplicate CatalogV2Util.loadTable as a new 
function with signature CatalogV2Util.loadTableThrowsException : Table. Then 
change this to your new function 
({*}CatalogV2Util.loadTableThrowsException(catalog, ident, timeTravel){*}, 
Some(catalog), Some(ident)). This solves the problem of masking the original 
exception

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
> Attachments: image-2022-12-03-09-24-43-285.png
>
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-07 Thread Kevin Cheung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644498#comment-17644498
 ] 

Kevin Cheung edited comment on SPARK-41344 at 12/7/22 7:52 PM:
---

[~wforget] I believe he just means duplicate CatalogV2Util.loadTable as a new 
function with signature CatalogV2Util.loadTableThrowsException : Table. Then 
change this to your new function 
({*}CatalogV2Util.loadTableThrowsException(catalog, ident, timeTravel){*}, 
Some(catalog), Some(ident)). This solves the problem of masking the original 
exception


was (Author: kecheung):
[~wforget] I believe he just means duplicate CatalogV2Util.loadTable as a new 
function with signature CatalogV2Util.loadTableThrowsException : Table. Then 
change this to your new function 
({*}CatalogV2Util.loadTableThrowsException(catalog, ident, timeTravel){*}, 
Some(catalog), Some(ident))

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
> Attachments: image-2022-12-03-09-24-43-285.png
>
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-07 Thread Kevin Cheung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644498#comment-17644498
 ] 

Kevin Cheung commented on SPARK-41344:
--

[~wforget] I believe he just means duplicate CatalogV2Util.loadTable as a new 
function with signature CatalogV2Util.loadTableThrowsException : Table. Then 
change this to your new function 
({*}CatalogV2Util.loadTableThrowsException(catalog, ident, timeTravel){*}, 
Some(catalog), Some(ident))

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
> Attachments: image-2022-12-03-09-24-43-285.png
>
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41344) Reading V2 datasource masks underlying error

2022-12-07 Thread Kevin Cheung (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644492#comment-17644492
 ] 

Kevin Cheung commented on SPARK-41344:
--

+1 [~planga82]. I like this approach of having another function so the real 
exception can be propagated

> Reading V2 datasource masks underlying error
> 
>
> Key: SPARK-41344
> URL: https://issues.apache.org/jira/browse/SPARK-41344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.4.0
>Reporter: Kevin Cheung
>Priority: Critical
> Attachments: image-2022-12-03-09-24-43-285.png
>
>
> In Spark 3.3, 
>  # DataSourceV2Utils, the loadV2Source calls: 
> {*}(CatalogV2Util.loadTable(catalog, ident, timeTravel).get{*}, 
> Some(catalog), Some(ident)).
>  # CatalogV2Util.scala, when it tries to *loadTable(x,x,x)* and it fails with 
> any of these exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException, it would return None
>  # Coming back to DataSourceV2Utils, None was previously returned and calling 
> None.get results in a cryptic error technically "correct", but the *original 
> exceptions NoSuchTableException, NoSuchDatabaseException, 
> NoSuchNamespaceException are thrown away.*
>  
> *Ask:*
> Retain the original error and propagate this to the user. Prior to Spark 3.3, 
> the *original error* was shown and this seems like a design flaw.
>  
> *Sample user facing error:*
> None.get
> java.util.NoSuchElementException: None.get
>     at scala.None$.get(Option.scala:529)
>     at scala.None$.get(Option.scala:527)
>     at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:129)
>     at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
>     at scala.Option.flatMap(Option.scala:271)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
>     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
>  
> *DataSourceV2Utils.scala - CatalogV2Util.loadTable(x,x,x).get*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala#L137]
> *CatalogV2Util.scala - Option(catalog.asTableCatalog.loadTable(ident))*
> {*}{{*}}[https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L341]
> *CatalogV2Util.scala - catching the exceptions and return None*
> [https://github.com/apache/spark/blob/7fd654c0142ab9e4002882da4e65d3b25bebd26c/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Util.scala#L344]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41349) Implement `DataFrame.hint`

2022-12-07 Thread Rui Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644463#comment-17644463
 ] 

Rui Wang commented on SPARK-41349:
--

Keeping this issue as open given that there is python side of work left.

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Deng Ziming
>Priority: Major
> Fix For: 3.4.0
>
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41349) Implement `DataFrame.hint`

2022-12-07 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang reopened SPARK-41349:
--

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Deng Ziming
>Priority: Major
> Fix For: 3.4.0
>
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41441) Allow Generate with no required child output to host outer references

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41441:


Assignee: (was: Apache Spark)

> Allow Generate with no required child output to host outer references
> -
>
> Key: SPARK-41441
> URL: https://issues.apache.org/jira/browse/SPARK-41441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, in CheckAnalysis, Spark disallows Generate to host any outer 
> references when it's required child output is not empty. But when the child 
> output is empty, it can host outer references, which DecorrelateInnerQuery 
> does not handle.
> For example,
> {code:java}
> select * from t, lateral (select explode(array(c1, c2))){code}
> This throws an internal error :
> {code:java}
> Caused by: java.lang.AssertionError: assertion failed: Correlated column is 
> not allowed in Generate explode(array(outer(c1#219), outer(c2#220))), false, 
> [col#221] +- OneRowRelation{code}
>  We should support Generate to host outer references when its required child 
> output is empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41441) Allow Generate with no required child output to host outer references

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41441:


Assignee: Apache Spark

> Allow Generate with no required child output to host outer references
> -
>
> Key: SPARK-41441
> URL: https://issues.apache.org/jira/browse/SPARK-41441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, in CheckAnalysis, Spark disallows Generate to host any outer 
> references when it's required child output is not empty. But when the child 
> output is empty, it can host outer references, which DecorrelateInnerQuery 
> does not handle.
> For example,
> {code:java}
> select * from t, lateral (select explode(array(c1, c2))){code}
> This throws an internal error :
> {code:java}
> Caused by: java.lang.AssertionError: assertion failed: Correlated column is 
> not allowed in Generate explode(array(outer(c1#219), outer(c2#220))), false, 
> [col#221] +- OneRowRelation{code}
>  We should support Generate to host outer references when its required child 
> output is empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41441) Allow Generate with no required child output to host outer references

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644459#comment-17644459
 ] 

Apache Spark commented on SPARK-41441:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/38968

> Allow Generate with no required child output to host outer references
> -
>
> Key: SPARK-41441
> URL: https://issues.apache.org/jira/browse/SPARK-41441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, in CheckAnalysis, Spark disallows Generate to host any outer 
> references when it's required child output is not empty. But when the child 
> output is empty, it can host outer references, which DecorrelateInnerQuery 
> does not handle.
> For example,
> {code:java}
> select * from t, lateral (select explode(array(c1, c2))){code}
> This throws an internal error :
> {code:java}
> Caused by: java.lang.AssertionError: assertion failed: Correlated column is 
> not allowed in Generate explode(array(outer(c1#219), outer(c2#220))), false, 
> [col#221] +- OneRowRelation{code}
>  We should support Generate to host outer references when its required child 
> output is empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41369) Refactor connect directory structure

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1765#comment-1765
 ] 

Apache Spark commented on SPARK-41369:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/38967

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41369) Refactor connect directory structure

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1764#comment-1764
 ] 

Apache Spark commented on SPARK-41369:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/38967

> Refactor connect directory structure
> 
>
> Key: SPARK-41369
> URL: https://issues.apache.org/jira/browse/SPARK-41369
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, `spark/connector/connect/` is a single module that contains both 
> the "server"/service as well as the protobuf definitions.
> However, this module can be split into multiple modules - "server" and 
> "common". This brings the advantage of separating out the protobuf generation 
> from the core "server" module for efficient reuse.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41441) Allow Generate with no required child output to host outer references

2022-12-07 Thread Allison Wang (Jira)
Allison Wang created SPARK-41441:


 Summary: Allow Generate with no required child output to host 
outer references
 Key: SPARK-41441
 URL: https://issues.apache.org/jira/browse/SPARK-41441
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Allison Wang


Currently, in CheckAnalysis, Spark disallows Generate to host any outer 
references when it's required child output is not empty. But when the child 
output is empty, it can host outer references, which DecorrelateInnerQuery does 
not handle.

For example,
{code:java}
select * from t, lateral (select explode(array(c1, c2))){code}
This throws an internal error :
{code:java}
Caused by: java.lang.AssertionError: assertion failed: Correlated column is not 
allowed in Generate explode(array(outer(c1#219), outer(c2#220))), false, 
[col#221] +- OneRowRelation{code}
 We should support Generate to host outer references when its required child 
output is empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40801) Upgrade Apache Commons Text to 1.10

2022-12-07 Thread Kevin Appel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644430#comment-17644430
 ] 

Kevin Appel commented on SPARK-40801:
-

thank you for working on this

> Upgrade Apache Commons Text to 1.10
> ---
>
> Key: SPARK-40801
> URL: https://issues.apache.org/jira/browse/SPARK-40801
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.2.3, 3.3.2, 3.4.0
>
>
> [CVE-2022-42889|https://nvd.nist.gov/vuln/detail/CVE-2022-42889]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644314#comment-17644314
 ] 

Apache Spark commented on SPARK-41008:
--

User 'ahmed-mahran' has created a pull request for this issue:
https://github.com/apache/spark/pull/38966

> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
> "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
> "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
> "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644313#comment-17644313
 ] 

Apache Spark commented on SPARK-41008:
--

User 'ahmed-mahran' has created a pull request for this issue:
https://github.com/apache/spark/pull/38966

> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
> "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
> "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
> "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41008:


Assignee: Apache Spark

> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Assignee: Apache Spark
>Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
> "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
> "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
> "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41008) Isotonic regression result differs from sklearn implementation

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41008:


Assignee: (was: Apache Spark)

> Isotonic regression result differs from sklearn implementation
> --
>
> Key: SPARK-41008
> URL: https://issues.apache.org/jira/browse/SPARK-41008
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.3.1
>Reporter: Arne Koopman
>Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
> "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
> "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
> "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0. 0. 0. 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0. 0. 0. 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41437) Do not optimize the input query twice for v1 write fallback

2022-12-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41437.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38942
[https://github.com/apache/spark/pull/38942]

> Do not optimize the input query twice for v1 write fallback
> ---
>
> Key: SPARK-41437
> URL: https://issues.apache.org/jira/browse/SPARK-41437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41437) Do not optimize the input query twice for v1 write fallback

2022-12-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41437:
---

Assignee: Wenchen Fan

> Do not optimize the input query twice for v1 write fallback
> ---
>
> Key: SPARK-41437
> URL: https://issues.apache.org/jira/browse/SPARK-41437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0

2022-12-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-41418.
-

> Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
> --
>
> Key: SPARK-41418
> URL: https://issues.apache.org/jira/browse/SPARK-41418
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41418) Upgrade scala-maven-plugin from 4.7.2 to 4.8.0

2022-12-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41418.
---
Resolution: Duplicate

> Upgrade scala-maven-plugin from 4.7.2 to 4.8.0
> --
>
> Key: SPARK-41418
> URL: https://issues.apache.org/jira/browse/SPARK-41418
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41439) Implement `DataFrame.melt`

2022-12-07 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644258#comment-17644258
 ] 

jiaan.geng commented on SPARK-41439:


I'm working on.

> Implement `DataFrame.melt`
> --
>
> Key: SPARK-41439
> URL: https://issues.apache.org/jira/browse/SPARK-41439
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41438) Implement DataFrame. colRegex

2022-12-07 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41438 ]


jiaan.geng deleted comment on SPARK-41438:


was (Author: beliefer):
I'm working on.

> Implement DataFrame. colRegex
> -
>
> Key: SPARK-41438
> URL: https://issues.apache.org/jira/browse/SPARK-41438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41438) Implement DataFrame. colRegex

2022-12-07 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644242#comment-17644242
 ] 

jiaan.geng commented on SPARK-41438:


I'm working on.

> Implement DataFrame. colRegex
> -
>
> Key: SPARK-41438
> URL: https://issues.apache.org/jira/browse/SPARK-41438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41440) Implement DataFrame.randomSplit

2022-12-07 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41440:
-

 Summary: Implement DataFrame.randomSplit
 Key: SPARK-41440
 URL: https://issues.apache.org/jira/browse/SPARK-41440
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41439) Implement `DataFrame.melt`

2022-12-07 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41439:
-

 Summary: Implement `DataFrame.melt`
 Key: SPARK-41439
 URL: https://issues.apache.org/jira/browse/SPARK-41439
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41438) Implement DataFrame. colRegex

2022-12-07 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-41438:
-

 Summary: Implement DataFrame. colRegex
 Key: SPARK-41438
 URL: https://issues.apache.org/jira/browse/SPARK-41438
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644224#comment-17644224
 ] 

Apache Spark commented on SPARK-41386:
--

User 'Juerin-Dong' has created a pull request for this issue:
https://github.com/apache/spark/pull/38965

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644223#comment-17644223
 ] 

Apache Spark commented on SPARK-41386:
--

User 'Juerin-Dong' has created a pull request for this issue:
https://github.com/apache/spark/pull/38965

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41386:


Assignee: (was: Apache Spark)

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41386:


Assignee: Apache Spark

> There are some small files when using rebalance(column)
> ---
>
> Key: SPARK-41386
> URL: https://issues.apache.org/jira/browse/SPARK-41386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Zhe Dong
>Assignee: Apache Spark
>Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", 
> "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", 
> "0.5"){noformat}
> so, we except that files size should be bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-0-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-1-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-2-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 
> .../part-3-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 
> .../part-4-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 
> .../part-5-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in 
> another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644141#comment-17644141
 ] 

Zhe Dong edited comment on SPARK-41386 at 12/7/22 8:31 AM:
---

OptimizeSkewInRebalancePartitions.scala
{code:java}
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.spark.sql.execution.adaptive

import org.apache.spark.sql.execution.{CoalescedPartitionSpec, 
ShufflePartitionSpec, SparkPlan}
import org.apache.spark.sql.execution.exchange.{REBALANCE_PARTITIONS_BY_COL, 
REBALANCE_PARTITIONS_BY_NONE, ShuffleOrigin}
import org.apache.spark.sql.internal.SQLConf

/**
 * A rule to optimize the skewed shuffle partitions in [[RebalancePartitions]] 
based on the map
 * output statistics, which can avoid data skew that hurt performance.
 *
 * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition should 
be optimized.
 * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has data 
skew issue.
 * the map side looks like:
 *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
 * and the reduce side looks like:
 *(without this rule) r1[m0-b1, m1-b1, m2-b1]
 *  / \
 *   r0:[m0-b0, m1-b0, m2-b0], r1-0:[m0-b1], r1-1:[m1-b1], r1-2:[m2-b1], 
r2[m0-b2, m1-b2, m2-b2]
 */
object OptimizeSkewInRebalancePartitions extends AQEShuffleReadRule {

  override val supportedShuffleOrigins: Seq[ShuffleOrigin] =
Seq(REBALANCE_PARTITIONS_BY_NONE, REBALANCE_PARTITIONS_BY_COL)

  /**
   * Splits the skewed partition based on the map size and the target partition 
size
   * after split. Create a list of `PartialReducerPartitionSpec` for skewed 
partition and
   * create `CoalescedPartition` for normal partition.
   */
  private def optimizeSkewedPartitions(
  shuffleId: Int,
  bytesByPartitionId: Array[Long],
  targetSize: Long,
  smallPartitionFactor: Double): Seq[ShufflePartitionSpec] = {
bytesByPartitionId.indices.flatMap { reduceIndex =>
  val bytes = bytesByPartitionId(reduceIndex)
  if (bytes > targetSize) {
val newPartitionSpec = ShufflePartitionsUtil.createSkewPartitionSpecs(
  shuffleId, reduceIndex, targetSize, smallPartitionFactor)
if (newPartitionSpec.isEmpty) {
  CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil
} else {
  logDebug(s"For shuffle $shuffleId, partition $reduceIndex is skew, " +
s"split it into ${newPartitionSpec.get.size} parts.")
  newPartitionSpec.get
}
  } else if (bytes < targetSize * smallPartitionFactor) {
CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil
  } else {
CoalescedPartitionSpec(reduceIndex, reduceIndex, bytes) :: Nil
  }
}
  }

  private def tryOptimizeSkewedPartitions(shuffle: ShuffleQueryStageExec): 
SparkPlan = {
val advisorySize = conf.getConf(SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES)
val smallPartitionFactor =
  conf.getConf(SQLConf.ADAPTIVE_REBALANCE_PARTITIONS_SMALL_PARTITION_FACTOR)
val mapStats = shuffle.mapStats
if (mapStats.isEmpty ||
  mapStats.get.bytesByPartitionId.forall(
r => r <= advisorySize && r >= advisorySize * smallPartitionFactor)) {
  return shuffle
}

val newPartitionsSpec = optimizeSkewedPartitions(
  mapStats.get.shuffleId, mapStats.get.bytesByPartitionId, advisorySize, 
smallPartitionFactor)
// return origin plan if we can not optimize partitions
if (newPartitionsSpec.length == mapStats.get.bytesByPartitionId.length) {
  shuffle
} else {
  AQEShuffleReadExec(shuffle, newPartitionsSpec)
}
  }

  override def apply(plan: SparkPlan): SparkPlan = {
if 
(!conf.getConf(SQLConf.ADAPTIVE_OPTIMIZE_SKEWS_IN_REBALANCE_PARTITIONS_ENABLED))
 {
  return plan
}

plan transformUp {
  case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) =>
tryOptimizeSkewedPartitions(stage)
}
  }
}
 {code}
 

 


was (Author: JIRAUSER298432):
OptimizeSkewInRebalancePartitions.scala
{noformat}
/*
 * Licensed to the 

[jira] [Assigned] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41433:
-

Assignee: Ruifeng Zheng

> Make Max Arrow BatchSize configurable
> -
>
> Key: SPARK-41433
> URL: https://issues.apache.org/jira/browse/SPARK-41433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41433) Make Max Arrow BatchSize configurable

2022-12-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41433.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38958
[https://github.com/apache/spark/pull/38958]

> Make Max Arrow BatchSize configurable
> -
>
> Key: SPARK-41433
> URL: https://issues.apache.org/jira/browse/SPARK-41433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41386) There are some small files when using rebalance(column)

2022-12-07 Thread Zhe Dong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644141#comment-17644141
 ] 

Zhe Dong edited comment on SPARK-41386 at 12/7/22 8:29 AM:
---

OptimizeSkewInRebalancePartitions.scala
{noformat}
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */package org.apache.spark.sql.execution.adaptiveimport 
org.apache.spark.sql.execution.{CoalescedPartitionSpec, ShufflePartitionSpec, 
SparkPlan}
import org.apache.spark.sql.execution.exchange.{REBALANCE_PARTITIONS_BY_COL, 
REBALANCE_PARTITIONS_BY_NONE, ShuffleOrigin}
import org.apache.spark.sql.internal.SQLConf/**
 * A rule to optimize the skewed shuffle partitions in [[RebalancePartitions]] 
based on the map
 * output statistics, which can avoid data skew that hurt performance.
 *
 * We use ADVISORY_PARTITION_SIZE_IN_BYTES size to decide if a partition should 
be optimized.
 * Let's say we have 3 maps with 3 shuffle partitions, and assuming r1 has data 
skew issue.
 * the map side looks like:
 *   m0:[b0, b1, b2], m1:[b0, b1, b2], m2:[b0, b1, b2]
 * and the reduce side looks like:
 *                            (without this rule) r1[m0-b1, m1-b1, m2-b1]
 *                              /                                     \
 *   r0:[m0-b0, m1-b0, m2-b0], r1-0:[m0-b1], r1-1:[m1-b1], r1-2:[m2-b1], 
r2[m0-b2, m1-b2, m2-b2]
 */
object OptimizeSkewInRebalancePartitions extends AQEShuffleReadRule {  override 
val supportedShuffleOrigins: Seq[ShuffleOrigin] =
    Seq(REBALANCE_PARTITIONS_BY_NONE, REBALANCE_PARTITIONS_BY_COL)  /**
   * Splits the skewed partition based on the map size and the target partition 
size
   * after split. Create a list of `PartialReducerPartitionSpec` for skewed 
partition and
   * create `CoalescedPartition` for normal partition.
   */
  private def optimizeSkewedPartitions(
      shuffleId: Int,
      bytesByPartitionId: Array[Long],
      targetSize: Long,
      smallPartitionFactor: Double): Seq[ShufflePartitionSpec] = {
    bytesByPartitionId.indices.flatMap { reduceIndex =>
      val bytes = bytesByPartitionId(reduceIndex)
      if (bytes > targetSize) {
        val newPartitionSpec = ShufflePartitionsUtil.createSkewPartitionSpecs(
          shuffleId, reduceIndex, targetSize, smallPartitionFactor)
        if (newPartitionSpec.isEmpty) {
          CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil
        } else {
          logDebug(s"For shuffle $shuffleId, partition $reduceIndex is skew, " +
            s"split it into ${newPartitionSpec.get.size} parts.")
          newPartitionSpec.get
        }
      } else if (bytes < targetSize * smallPartitionFactor) {
        CoalescedPartitionSpec(reduceIndex, reduceIndex + 1, bytes) :: Nil
      } else {
        CoalescedPartitionSpec(reduceIndex, reduceIndex, bytes) :: Nil
      }
    }
  }  private def tryOptimizeSkewedPartitions(shuffle: ShuffleQueryStageExec): 
SparkPlan = {
    val advisorySize = conf.getConf(SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES)
    val smallPartitionFactor =
      conf.getConf(SQLConf.ADAPTIVE_REBALANCE_PARTITIONS_SMALL_PARTITION_FACTOR)
    val mapStats = shuffle.mapStats
    if (mapStats.isEmpty ||
      mapStats.get.bytesByPartitionId.forall(
        r => r <= advisorySize && r >= advisorySize * smallPartitionFactor)) {
      return shuffle
    }    val newPartitionsSpec = optimizeSkewedPartitions(
      mapStats.get.shuffleId, mapStats.get.bytesByPartitionId, advisorySize, 
smallPartitionFactor)
    // return origin plan if we can not optimize partitions
    if (newPartitionsSpec.length == mapStats.get.bytesByPartitionId.length) {
      shuffle
    } else {
      AQEShuffleReadExec(shuffle, newPartitionsSpec)
    }
  }  override def apply(plan: SparkPlan): SparkPlan = {
    if 
(!conf.getConf(SQLConf.ADAPTIVE_OPTIMIZE_SKEWS_IN_REBALANCE_PARTITIONS_ENABLED))
 {
      return plan
    }    plan transformUp {
      case stage: ShuffleQueryStageExec if isSupported(stage.shuffle) =>
        tryOptimizeSkewedPartitions(stage)
    }
  }
}
{noformat}
 

 


was (Author: JIRAUSER298432):
 
{noformat}
    if (mapStats.isEmpty ||
      mapStats.get.bytesByPartitionId.forall(_

[jira] [Assigned] (SPARK-41403) Implement DataFrame.describe

2022-12-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-41403:
-

Assignee: jiaan.geng

> Implement DataFrame.describe
> 
>
> Key: SPARK-41403
> URL: https://issues.apache.org/jira/browse/SPARK-41403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41403) Implement DataFrame.describe

2022-12-07 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-41403.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38938
[https://github.com/apache/spark/pull/38938]

> Implement DataFrame.describe
> 
>
> Key: SPARK-41403
> URL: https://issues.apache.org/jira/browse/SPARK-41403
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41349) Implement `DataFrame.hint`

2022-12-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41349.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38899
[https://github.com/apache/spark/pull/38899]

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Deng Ziming
>Priority: Major
> Fix For: 3.4.0
>
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41349) Implement `DataFrame.hint`

2022-12-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-41349:
---

Assignee: Deng Ziming

> Implement `DataFrame.hint`
> --
>
> Key: SPARK-41349
> URL: https://issues.apache.org/jira/browse/SPARK-41349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Deng Ziming
>Priority: Major
>
> implement DataFrame.hint with the proto message added in 
> https://issues.apache.org/jira/browse/SPARK-41345



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org