date:20230711

[jira] [Updated] (SPARK-44381) How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider refresh token regularly

2023-07-11 Thread qingbo jiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-44381:

Description: 
export KRB5CCNAME=FILE:/tmp/krb5cc_1001
./bin/spark-submit -{-}master yarn --deploy-mode client --proxy-user  
--conf spark.app.name=spark-hive-test --conf 
spark.security.credentials.renewalRatio=0.58 --conf 
spark.kerberos.renewal.credentials=ccache{-}  -class 
org.apache.spark.examples.sql.hive.SparkHiveExample 
/examples/jars/spark-examples_2.12-3.1.1.jar

spark version 3.1.1，I configured it to refresh every 5 seconds。

--deploy-mode client/cluster wtih/without --proxy-user have all been tried, but 
none of them will work

Missing any configuration parameters？

  was:
export KRB5CCNAME=FILE:/tmp/krb5cc_1001
./bin/spark-submit --master yarn --deploy-mode client --proxy-user ocdp --conf 
spark.app.name=spark-hive-test --conf 
spark.security.credentials.renewalRatio=0.58 --conf 
spark.kerberos.renewal.credentials=ccache  --class 
org.apache.spark.examples.sql.hive.SparkHiveExample 
/examples/jars/spark-examples_2.12-3.1.1.jar

spark version 3.1.1，I configured it to refresh every 5 seconds。

--deploy-mode client/cluster wtih/without --proxy-user have all been tried, but 
none of them will work

Missing any configuration parameters？


> How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider 
> refresh token regularly
> -
>
> Key: SPARK-44381
> URL: https://issues.apache.org/jira/browse/SPARK-44381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: qingbo jiao
>Priority: Minor
>
> export KRB5CCNAME=FILE:/tmp/krb5cc_1001
> ./bin/spark-submit -{-}master yarn --deploy-mode client --proxy-user  
> --conf spark.app.name=spark-hive-test --conf 
> spark.security.credentials.renewalRatio=0.58 --conf 
> spark.kerberos.renewal.credentials=ccache{-}  -class 
> org.apache.spark.examples.sql.hive.SparkHiveExample 
> /examples/jars/spark-examples_2.12-3.1.1.jar
> spark version 3.1.1，I configured it to refresh every 5 seconds。
> --deploy-mode client/cluster wtih/without --proxy-user have all been tried, 
> but none of them will work
> Missing any configuration parameters？



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44381) How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider refresh token regularly

2023-07-11 Thread qingbo jiao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742293#comment-17742293
 ] 

qingbo jiao commented on SPARK-44381:
-

[~jshao] please cc ,thanks

> How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider 
> refresh token regularly
> -
>
> Key: SPARK-44381
> URL: https://issues.apache.org/jira/browse/SPARK-44381
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: qingbo jiao
>Priority: Minor
>
> export KRB5CCNAME=FILE:/tmp/krb5cc_1001
> ./bin/spark-submit --master yarn --deploy-mode client --proxy-user ocdp 
> --conf spark.app.name=spark-hive-test --conf 
> spark.security.credentials.renewalRatio=0.58 --conf 
> spark.kerberos.renewal.credentials=ccache  --class 
> org.apache.spark.examples.sql.hive.SparkHiveExample 
> /examples/jars/spark-examples_2.12-3.1.1.jar
> spark version 3.1.1，I configured it to refresh every 5 seconds。
> --deploy-mode client/cluster wtih/without --proxy-user have all been tried, 
> but none of them will work
> Missing any configuration parameters？



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44353) Remove toAttributes from StructType

2023-07-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-44353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-44353.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

> Remove toAttributes from StructType
> ---
>
> Key: SPARK-44353
> URL: https://issues.apache.org/jira/browse/SPARK-44353
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, SQL
>Affects Versions: 3.4.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44373) Wrap withActive for Dataset API w/ parse logic

2023-07-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44373.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41938
[https://github.com/apache/spark/pull/41938]

> Wrap withActive for Dataset API w/ parse logic
> --
>
> Key: SPARK-44373
> URL: https://issues.apache.org/jira/browse/SPARK-44373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44373) Wrap withActive for Dataset API w/ parse logic

2023-07-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44373:


Assignee: Kent Yao

> Wrap withActive for Dataset API w/ parse logic
> --
>
> Key: SPARK-44373
> URL: https://issues.apache.org/jira/browse/SPARK-44373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44334) Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED

2023-07-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44334.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41891
[https://github.com/apache/spark/pull/41891]

> Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED
> ---
>
> Key: SPARK-44334
> URL: https://issues.apache.org/jira/browse/SPARK-44334
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44334) Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED

2023-07-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44334:


Assignee: Kent Yao

> Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED
> ---
>
> Key: SPARK-44334
> URL: https://issues.apache.org/jira/browse/SPARK-44334
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44370) Migrate Buf remote generation alpha to remote plugins

2023-07-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44370.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41933
[https://github.com/apache/spark/pull/41933]

> Migrate Buf remote generation alpha to remote plugins
> -
>
> Key: SPARK-44370
> URL: https://issues.apache.org/jira/browse/SPARK-44370
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
> Fix For: 3.5.0
>
>
> Buf unsupported remote generation alpha at now. Please refer 
> [https://buf.build/docs/migration-guides/migrate-remote-generation-alpha/] . 
> We should migrate Buf remote generation alpha to remote plugins by follow the 
> guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43755) Spark Connect - decouple query execution from RPC handler

2023-07-11 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742252#comment-17742252
 ] 

Snoot.io commented on SPARK-43755:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/41315

> Spark Connect - decouple query execution from RPC handler
> -
>
> Key: SPARK-43755
> URL: https://issues.apache.org/jira/browse/SPARK-43755
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Move actual query execution out of the RPC handler callback. This allows:
>  * (immediately) better control over query cancellation, by interrupting the 
> execution thread.
>  * design changes to the RPC interface to allow different execution models 
> than stream-push from server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44381) How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider refresh token regularly

2023-07-11 Thread qingbo jiao (Jira)

qingbo jiao created SPARK-44381:
---

 Summary: How to specify parameters in spark-sumbit to make 
HiveDelegationTokenProvider refresh token regularly
 Key: SPARK-44381
 URL: https://issues.apache.org/jira/browse/SPARK-44381
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: qingbo jiao


export KRB5CCNAME=FILE:/tmp/krb5cc_1001
./bin/spark-submit --master yarn --deploy-mode client --proxy-user ocdp --conf 
spark.app.name=spark-hive-test --conf 
spark.security.credentials.renewalRatio=0.58 --conf 
spark.kerberos.renewal.credentials=ccache  --class 
org.apache.spark.examples.sql.hive.SparkHiveExample 
/examples/jars/spark-examples_2.12-3.1.1.jar

spark version 3.1.1，I configured it to refresh every 5 seconds。

--deploy-mode client/cluster wtih/without --proxy-user have all been tried, but 
none of them will work

Missing any configuration parameters？



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec

2023-07-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44340.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41899
[https://github.com/apache/spark/pull/41899]

> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec
> 
>
> Key: SPARK-44340
> URL: https://issues.apache.org/jira/browse/SPARK-44340
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec

2023-07-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44340:
---

Assignee: jiaan.geng

> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec
> 
>
> Key: SPARK-44340
> URL: https://issues.apache.org/jira/browse/SPARK-44340
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec

2023-07-11 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742246#comment-17742246
 ] 

Snoot.io commented on SPARK-44340:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41899

> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec
> 
>
> Key: SPARK-44340
> URL: https://issues.apache.org/jira/browse/SPARK-44340
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec

2023-07-11 Thread Snoot.io (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742247#comment-17742247
 ] 

Snoot.io commented on SPARK-44340:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/41899

> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec
> 
>
> Key: SPARK-44340
> URL: https://issues.apache.org/jira/browse/SPARK-44340
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> Define the computing logic through PartitionEvaluator API and use it in 
> WindowGroupLimitExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect

2023-07-11 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-43665.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41931
[https://github.com/apache/spark/pull/41931]

> Enable PandasSQLStringFormatter.vformat to work with Spark Connect
> --
>
> Key: SPARK-43665
> URL: https://issues.apache.org/jira/browse/SPARK-43665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.5.0
>
>
> Enable PandasSQLStringFormatter.vformat to work with Spark Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect

2023-07-11 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-43665:
-

Assignee: Haejoon Lee

> Enable PandasSQLStringFormatter.vformat to work with Spark Connect
> --
>
> Key: SPARK-43665
> URL: https://issues.apache.org/jira/browse/SPARK-43665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Enable PandasSQLStringFormatter.vformat to work with Spark Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44325) Define the computing logic through PartitionEvaluator API and use it in SortMergeJoinExec

2023-07-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44325.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41884
[https://github.com/apache/spark/pull/41884]

> Define the computing logic through PartitionEvaluator API and use it in 
> SortMergeJoinExec
> -
>
> Key: SPARK-44325
> URL: https://issues.apache.org/jira/browse/SPARK-44325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Major
> Fix For: 3.5.0
>
>
> Define the computing logic through PartitionEvaluator API and use it in 
> SortMergeJoinExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44325) Define the computing logic through PartitionEvaluator API and use it in SortMergeJoinExec

2023-07-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44325:


Assignee: Vinod KC

> Define the computing logic through PartitionEvaluator API and use it in 
> SortMergeJoinExec
> -
>
> Key: SPARK-44325
> URL: https://issues.apache.org/jira/browse/SPARK-44325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Major
>
> Define the computing logic through PartitionEvaluator API and use it in 
> SortMergeJoinExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple

2023-07-11 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-44377.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41944
[https://github.com/apache/spark/pull/41944]

> exclude junit5 deps from jersey-test-framework-provider-simple
> --
>
> Key: SPARK-44377
> URL: https://issues.apache.org/jira/browse/SPARK-44377
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use 
> [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], 
> Spark core module uses 
> `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
>  which cascades and introduces the dependencies of Junit5, this causes Java 
> tests no longer be executed when performing maven tests on the core module.
> run `mvn clean install -pl core -am`
>  
> {code:java}
> [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 
> ---
> [INFO] Using auto detected provider 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
> [INFO] 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
> [INFO] 
> [INFO] 
> [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
> [INFO] Skipping execution of surefire because it has already been run for 
> this configuration{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple

2023-07-11 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742234#comment-17742234
 ] 

Yang Jie commented on SPARK-44377:
--

Fixed

> exclude junit5 deps from jersey-test-framework-provider-simple
> --
>
> Key: SPARK-44377
> URL: https://issues.apache.org/jira/browse/SPARK-44377
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use 
> [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], 
> Spark core module uses 
> `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
>  which cascades and introduces the dependencies of Junit5, this causes Java 
> tests no longer be executed when performing maven tests on the core module.
> run `mvn clean install -pl core -am`
>  
> {code:java}
> [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 
> ---
> [INFO] Using auto detected provider 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
> [INFO] 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
> [INFO] 
> [INFO] 
> [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
> [INFO] Skipping execution of surefire because it has already been run for 
> this configuration{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple

2023-07-11 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-44377:


Assignee: Yang Jie

> exclude junit5 deps from jersey-test-framework-provider-simple
> --
>
> Key: SPARK-44377
> URL: https://issues.apache.org/jira/browse/SPARK-44377
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use 
> [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], 
> Spark core module uses 
> `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
>  which cascades and introduces the dependencies of Junit5, this causes Java 
> tests no longer be executed when performing maven tests on the core module.
> run `mvn clean install -pl core -am`
>  
> {code:java}
> [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 
> ---
> [INFO] Using auto detected provider 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
> [INFO] 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
> [INFO] 
> [INFO] 
> [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
> [INFO] Skipping execution of surefire because it has already been run for 
> this configuration{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44374) Add example code

2023-07-11 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-44374:
---
Fix Version/s: 3.5.0

> Add example code
> 
>
> Key: SPARK-44374
> URL: https://issues.apache.org/jira/browse/SPARK-44374
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.5.0
>
>
> Add example code for distributed ML <> spark connect .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44374) Add example code

2023-07-11 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu resolved SPARK-44374.

Resolution: Done

> Add example code
> 
>
> Key: SPARK-44374
> URL: https://issues.apache.org/jira/browse/SPARK-44374
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Add example code for distributed ML <> spark connect .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec

2023-07-11 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1774#comment-1774
 ] 

jiaan.geng commented on SPARK-44362:


Thank you.

> Use  PartitionEvaluator API in 
> AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
> -
>
> Key: SPARK-44362
> URL: https://issues.apache.org/jira/browse/SPARK-44362
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Priority: Major
>
> Use  PartitionEvaluator API in
> AggregateInPandasExec
> EvalPythonExec
> AttachDistributedSequenceExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44380) Support for UDTF to analyze in Python

2023-07-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-44380:
-

 Summary: Support for UDTF to analyze in Python
 Key: SPARK-44380
 URL: https://issues.apache.org/jira/browse/SPARK-44380
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44217) Allow custom precision for fp approx equality

2023-07-11 Thread Amanda Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44217:
---
Summary: Allow custom precision for fp approx equality  (was: Add 
assert_approx_df_equality util function)

> Allow custom precision for fp approx equality
> -
>
> Key: SPARK-44217
> URL: https://issues.apache.org/jira/browse/SPARK-44217
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44264) DeepSpeed Distrobutor

2023-07-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44264.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41770
[https://github.com/apache/spark/pull/41770]

> DeepSpeed Distrobutor
> -
>
> Key: SPARK-44264
> URL: https://issues.apache.org/jira/browse/SPARK-44264
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.4.1
>Reporter: Lu Wang
>Priority: Critical
> Fix For: 3.5.0
>
>
> To make it easier for Pyspark users to run distributed training and inference 
> with DeepSpeed on spark clusters using PySpark. This was a project determined 
> by the Databricks ML Training Team.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists

2023-07-11 Thread Frederik Paradis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742175#comment-17742175
 ] 

Frederik Paradis commented on SPARK-43513:
--

Hi [~wenxin]. Thank you for your comment.

1. Didn't what to put there so I just put the latest version.
2. I see what you mean. However, it seems counter-intuitive to me to allow to 
allow columns with the same name and no other ways to differentiate them other 
than their positions. Especially with the fact that joins do move columns 
around and that (mostly?) all operations in Spark do not support referring to 
columns by their positions. Beyond that, I guess it's more of a question of 
engineering design and vision of the software.

> withColumnRenamed duplicates columns if new column already exists
> -
>
> Key: SPARK-43513
> URL: https://issues.apache.org/jira/browse/SPARK-43513
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Frederik Paradis
>Priority: Major
>
> withColumnRenamed should either replace the column when new column already 
> exists or should specify the specificity in the documentation. See the code 
> below as an example of the current state.
> {code:python}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()
> df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", 
> "test_score"])
> r = df.withColumnRenamed("test_score", "score")
> print(r)  # DataFrame[id: bigint, score: double, score: double]
> # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could 
> be: score, score.
> print(r.select("score"))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists

2023-07-11 Thread Frederik Paradis (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742175#comment-17742175
 ] 

Frederik Paradis edited comment on SPARK-43513 at 7/11/23 9:02 PM:
---

Hi [~wenxin]. Thank you for your comment.

1. Didn't know what to put there so I just put the latest version.
2. I see what you mean. However, it seems counter-intuitive to me to allow to 
allow columns with the same name and no other ways to differentiate them other 
than their positions. Especially with the fact that joins do move columns 
around and that (mostly?) all operations in Spark do not support referring to 
columns by their positions. Beyond that, I guess it's more of a question of 
engineering design and vision of the software.


was (Author: JIRAUSER300280):
Hi [~wenxin]. Thank you for your comment.

1. Didn't what to put there so I just put the latest version.
2. I see what you mean. However, it seems counter-intuitive to me to allow to 
allow columns with the same name and no other ways to differentiate them other 
than their positions. Especially with the fact that joins do move columns 
around and that (mostly?) all operations in Spark do not support referring to 
columns by their positions. Beyond that, I guess it's more of a question of 
engineering design and vision of the software.

> withColumnRenamed duplicates columns if new column already exists
> -
>
> Key: SPARK-43513
> URL: https://issues.apache.org/jira/browse/SPARK-43513
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Frederik Paradis
>Priority: Major
>
> withColumnRenamed should either replace the column when new column already 
> exists or should specify the specificity in the documentation. See the code 
> below as an example of the current state.
> {code:python}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()
> df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", 
> "test_score"])
> r = df.withColumnRenamed("test_score", "score")
> print(r)  # DataFrame[id: bigint, score: double, score: double]
> # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could 
> be: score, score.
> print(r.select("score"))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44279) Upgrade word-wrap

2023-07-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742146#comment-17742146
 ] 

Bjørn Jørgensen commented on SPARK-44279:
-

have a look at https://github.com/apache/spark/pull/35628 and 
https://github.com/apache/spark/pull/39143

> Upgrade word-wrap
> -
>
> Key: SPARK-44279
> URL: https://issues.apache.org/jira/browse/SPARK-44279
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Regular Expression Denial of Service (ReDoS) - 
> CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44279) Upgrade word-wrap

2023-07-11 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742139#comment-17742139
 ] 

Sean R. Owen commented on SPARK-44279:
--

This is a dumb question, but what is that file? packages that what part of 
Spark uses? I have never seen it

> Upgrade word-wrap
> -
>
> Key: SPARK-44279
> URL: https://issues.apache.org/jira/browse/SPARK-44279
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Regular Expression Denial of Service (ReDoS) - 
> CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44279) Upgrade word-wrap

2023-07-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742137#comment-17742137
 ] 

Bjørn Jørgensen commented on SPARK-44279:
-

[~srowen] 
https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/dev/package-lock.json#L2226
 

[word-wrap vulnerable to Regular Expression Denial of 
Service|https://github.com/jonschlinkert/word-wrap/issues/40]


> Upgrade word-wrap
> -
>
> Key: SPARK-44279
> URL: https://issues.apache.org/jira/browse/SPARK-44279
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Regular Expression Denial of Service (ReDoS) - 
> CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44262) JdbcUtils hardcodes some SQL statements

2023-07-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-44262:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> JdbcUtils hardcodes some SQL statements
> ---
>
> Key: SPARK-44262
> URL: https://issues.apache.org/jira/browse/SPARK-44262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Florent BIVILLE
>Priority: Minor
>
> I am currently investigating an integration with the [Neo4j JBDC 
> driver|https://github.com/neo4j-contrib/neo4j-jdbc] and a Spark-based cloud 
> vendor SDK.
>  
> This SDK relies on Spark's {{JdbcUtils}} to run queries and insert data.
> While {{JdbcUtils}} partly delegates to 
> \{{org.apache.spark.sql.jdbc.JdbcDialect}} for some queries, some others are 
> hardcoded to SQL, see:
>  * {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#dropTable}}
>  * 
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#getInsertStatement}}
>  
> This works fine for relational databases but breaks for NOSQL stores that do 
> not support SQL translation (like Neo4j).
> Is there a plan to augment the {{JdbcDialect}} surface so that it is also 
> responsible for these currently-hardcoded queries?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43439) Drop does not work when passed a string with an alias

2023-07-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-43439:
-
Priority: Minor  (was: Major)

> Drop does not work when passed a string with an alias
> -
>
> Key: SPARK-43439
> URL: https://issues.apache.org/jira/browse/SPARK-43439
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Frederik Paradis
>Priority: Minor
>
> When passing a string to the drop method, if the string contains an alias, 
> the column is not dropped. However, passing a column object with the same 
> name and alias, it works.
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as F
> spark = 
> SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate()
> df = spark.createDataFrame([(1, 10)], ["any", "hour"]).alias("a")
> j = df.drop("a.hour")
> print(j)  # DataFrame[any: bigint, hour: bigint]
> jj = df.drop(F.col("a.hour"))
> print(jj)  # DataFrame[any: bigint]
> {code}
>  
> Related issues:
> https://issues.apache.org/jira/browse/SPARK-31123
> https://issues.apache.org/jira/browse/SPARK-14759
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44058) Remove deprecated API usage in HiveShim.scala

2023-07-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-44058.
--
Resolution: Not A Problem

> Remove deprecated API usage in HiveShim.scala
> -
>
> Key: SPARK-44058
> URL: https://issues.apache.org/jira/browse/SPARK-44058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 3.4.0
>Reporter: Aman Raj
>Priority: Major
>
> Spark's HiveShim.scala calls this particular method in Hive :
> createPartitionMethod.invoke(
> hive,
> table,
> spec,
> location,
> params, // partParams
> null, // inputFormat
> null, // outputFormat
> -1: JInteger, // numBuckets
> null, // cols
> null, // serializationLib
> null, // serdeParams
> null, // bucketCols
> null) // sortCols
> }
>  
> We do not have any such implementation of createPartition in Hive. We only 
> have this definition :
> public Partition createPartition(Table tbl, Map partSpec) 
> throws HiveException {
>     try
> {       org.apache.hadoop.hive.metastore.api.Partition part =           
> Partition.createMetaPartitionObject(tbl, partSpec, null);       
> AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, 
> tbl);       part.setWriteId(tableSnapshot != null ? 
> tableSnapshot.getWriteId() : 0);       return new Partition(tbl, 
> getMSC().add_partition(part));     }
> catch (Exception e)
> {       LOG.error(StringUtils.stringifyException(e));       throw new 
> HiveException(e);     }
>   }
> *The 12 parameter implementation was removed in HIVE-5951*
>  
> The issue is that this 12 parameter implementation of createPartition method 
> was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was 
> used in Spark, SPARK-15334 commit in Spark added this 12 parameters 
> implementation. But after Hive migrated to newer APIs somehow this was not 
> changed in Spark OSS and it looks to us like a Bug from the Spark end.
>  
> We need to migrate to the newest implementation of Hive createPartition 
> method otherwise this flow can break



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory

2023-07-11 Thread Shardul Mahadik (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-44379:

Description: 
Context: After migrating to Spark 3 with AQE, we saw a significant increase in 
driver and executor memory usage in our jobs which contains star joins. By 
analyzing heapdump, we saw that majority of the memory was being taken up by 
{{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
broadcast joins in the query.

!screenshot-1.png|width=851,height=70!

This took up over 6GB of total memory, even though every table being 
broadcasted was around ~1MB and hence should only have been ~100MB total. I 
found that this is because {{BytesToBytesMap}} used within 
{{UnsafeHashedRelation}} allocates memory in ["pageSize" 
increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
 which in our case was 64MB. Based on the [default page size 
calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
 this should be the case for any container with > 1 GB of memory (assuming 
executor.cores = 1) which is far too common. Thus in our case, most of the 
memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.

!screenshot-2.png|width=389,height=101!

I think this is a major inefficiency for broadcast joins (especially star 
joins). I think there are a few ways to tackle the problem.
1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce 
the memory consumption of broadcast joins, but I am not sure what it implies 
for the rest of Spark machinery
2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all 
values are added to the map and allocates a new page only for the required 
bytes. 
3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
keys and values, and use those during deserialization to only request the 
required memory.
4) Use a lower page size for certain {{BytesToBytesMap}} based on the estimated 
data size of broadcast joins.

I believe Option 3 would be simple enough to implement and I have a POC PR 
which I will post soon, but I am interested in knowing other people's thoughts 
here. 

  was:
Context: After migrating to Spark 3 with AQE, we saw a significant increase in 
driver and executor memory usage in our jobs which contains star joins. By 
analyzing heapdump, we saw that majority of the memory was being taken up by 
{{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
broadcast joins in the query.

!screenshot-1.png|width=851,height=70!

This took up over 6GB of total memory, even though every table being 
broadcasted was around ~1MB and hence should only have been ~100MB total. I 
found that this is because {{BytesToBytesMap}} used within 
{{UnsafeHashedRelation}} allocates memory in ["pageSize" 
increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
 which in our case was 64MB. Based on the [default page size 
calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
 this should be the case for any container with > 1 GB of memory (assuming 
executor.cores = 1) which is far too common. Thus in our case, most of the 
memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.

!screenshot-2.png|width=389,height=101!

I think this is a major inefficiency for broadcast joins (especially star 
joins). I think there are a few ways to tackle the problem.
1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce 
the memory consumption of broadcast joins, but I am not sure what it implies 
for the rest of Spark machinery
2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all 
values are added to the map and allocates a new page only for the required 
bytes. 
3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
keys and values, and use those during deserialization to only request the 
required memory.
4) Use a lower page size for certain {{BytesToBytesMap}}s based on the 
estimated data size of broadcast joins.

I believe Option 3 would be simple enough to implement and I have a POC PR 
which I will post soon, but I am interested in knowing other people's thoughts 
here. 


> Broadcast Joins taking up too much memory
> -
>
> Key: SPARK-44379
> URL: https://issues.apache.org/jira/browse/SPARK-44379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>

[jira] [Commented] (SPARK-44379) Broadcast Joins taking up too much memory

2023-07-11 Thread Shardul Mahadik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742126#comment-17742126
 ] 

Shardul Mahadik commented on SPARK-44379:
-

cc: [~cloud_fan] [~joshrosen] [~mridul] Would be interested in knowing your 
thoughts here.

> Broadcast Joins taking up too much memory
> -
>
> Key: SPARK-44379
> URL: https://issues.apache.org/jira/browse/SPARK-44379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Context: After migrating to Spark 3 with AQE, we saw a significant increase 
> in driver and executor memory usage in our jobs which contains star joins. By 
> analyzing heapdump, we saw that majority of the memory was being taken up by 
> {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
> broadcast joins in the query.
> !screenshot-1.png|width=851,height=70!
> This took up over 6GB of total memory, even though every table being 
> broadcasted was around ~1MB and hence should only have been ~100MB total. I 
> found that this is because {{BytesToBytesMap}} used within 
> {{UnsafeHashedRelation}} allocates memory in ["pageSize" 
> increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
>  which in our case was 64MB. Based on the [default page size 
> calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
>  this should be the case for any container with > 1 GB of memory (assuming 
> executor.cores = 1) which is far too common. Thus in our case, most of the 
> memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.
> !screenshot-2.png|width=389,height=101!
> I think this is a major inefficiency for broadcast joins (especially star 
> joins). I think there are a few ways to tackle the problem.
> 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does 
> reduce the memory consumption of broadcast joins, but I am not sure what it 
> implies for the rest of Spark machinery
> 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after 
> all values are added to the map and allocates a new page only for the 
> required bytes. 
> 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
> keys and values, and use those during deserialization to only request the 
> required memory.
> 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the 
> estimated data size of broadcast joins.
> I believe Option 3 would be simple enough to implement and I have a POC PR 
> which I will post soon, but I am interested in knowing other people's 
> thoughts here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory

2023-07-11 Thread Shardul Mahadik (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-44379:

Description: 
Context: After migrating to Spark 3 with AQE, we saw a significant increase in 
driver and executor memory usage in our jobs which contains star joins. By 
analyzing heapdump, we saw that majority of the memory was being taken up by 
{{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
broadcast joins in the query.

!screenshot-1.png|width=851,height=70!

This took up over 6GB of total memory, even though every table being 
broadcasted was around ~1MB and hence should only have been ~100MB total. I 
found that this is because {{BytesToBytesMap}} used within 
{{UnsafeHashedRelation}} allocates memory in ["pageSize" 
increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
 which in our case was 64MB. Based on the [default page size 
calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
 this should be the case for any container with > 1 GB of memory (assuming 
executor.cores = 1) which is far too common. Thus in our case, most of the 
memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.

!screenshot-2.png|width=389,height=101!

I think this is a major inefficiency for broadcast joins (especially star 
joins). I think there are a few ways to tackle the problem.
1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce 
the memory consumption of broadcast joins, but I am not sure what it implies 
for the rest of Spark machinery
2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all 
values are added to the map and allocates a new page only for the required 
bytes. 
3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
keys and values, and use those during deserialization to only request the 
required memory.
4) Use a lower page size for certain {{BytesToBytesMap}}s based on the 
estimated data size of broadcast joins.

I believe Option 3 would be simple enough to implement and I have a POC PR 
which I will post soon, but I am interested in knowing other people's thoughts 
here. 

  was:
Context: After migrating to Spark 3 with AQE, we saw a significant increase in 
driver and executor memory usage in our jobs which contains star joins. By 
analyzing heapdump, we saw that majority of the memory was being taken up by 
{{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
broadcast joins in the query.

!image-2023-07-11-10-41-02-251.png|width=851,height=70!

This took up over 6GB of total memory, even though every table being 
broadcasted was around ~1MB and hence should only have been ~100MB total. I 
found that this is because {{BytesToBytesMap}} used within 
{{UnsafeHashedRelation}} allocates memory in ["pageSize" 
increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
 which in our case was 64MB. Based on the [default page size 
calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
 this should be the case for any container with > 1 GB of memory (assuming 
executor.cores = 1) which is far too common. Thus in our case, most of the 
memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.

!image-2023-07-11-10-52-59-553.png|width=389,height=101!

I think this is a major inefficiency for broadcast joins (especially star 
joins). I think there are a few ways to tackle the problem.
1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce 
the memory consumption of broadcast joins, but I am not sure what it implies 
for the rest of Spark machinery
2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all 
values are added to the map and allocates a new page only for the required 
bytes. 
3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
keys and values, and use those during deserialization to only request the 
required memory.
4) Use a lower page size for certain {{BytesToBytesMap}}s based on the 
estimated data size of broadcast joins.

I believe Option 3 would be simple enough to implement and I have a POC PR 
which I will post soon, but I am interested in knowing other people's thoughts 
here. 


> Broadcast Joins taking up too much memory
> -
>
> Key: SPARK-44379
> URL: https://issues.apache.org/jira/browse/SPARK-44379
> Project: Spark
>  Issue Type: Improve

[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory

2023-07-11 Thread Shardul Mahadik (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-44379:

Attachment: screenshot-1.png

> Broadcast Joins taking up too much memory
> -
>
> Key: SPARK-44379
> URL: https://issues.apache.org/jira/browse/SPARK-44379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Context: After migrating to Spark 3 with AQE, we saw a significant increase 
> in driver and executor memory usage in our jobs which contains star joins. By 
> analyzing heapdump, we saw that majority of the memory was being taken up by 
> {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
> broadcast joins in the query.
> !image-2023-07-11-10-41-02-251.png|width=851,height=70!
> This took up over 6GB of total memory, even though every table being 
> broadcasted was around ~1MB and hence should only have been ~100MB total. I 
> found that this is because {{BytesToBytesMap}} used within 
> {{UnsafeHashedRelation}} allocates memory in ["pageSize" 
> increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
>  which in our case was 64MB. Based on the [default page size 
> calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
>  this should be the case for any container with > 1 GB of memory (assuming 
> executor.cores = 1) which is far too common. Thus in our case, most of the 
> memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.
> !image-2023-07-11-10-52-59-553.png|width=389,height=101!
> I think this is a major inefficiency for broadcast joins (especially star 
> joins). I think there are a few ways to tackle the problem.
> 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does 
> reduce the memory consumption of broadcast joins, but I am not sure what it 
> implies for the rest of Spark machinery
> 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after 
> all values are added to the map and allocates a new page only for the 
> required bytes. 
> 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
> keys and values, and use those during deserialization to only request the 
> required memory.
> 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the 
> estimated data size of broadcast joins.
> I believe Option 3 would be simple enough to implement and I have a POC PR 
> which I will post soon, but I am interested in knowing other people's 
> thoughts here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory

2023-07-11 Thread Shardul Mahadik (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-44379:

Attachment: screenshot-2.png

> Broadcast Joins taking up too much memory
> -
>
> Key: SPARK-44379
> URL: https://issues.apache.org/jira/browse/SPARK-44379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Context: After migrating to Spark 3 with AQE, we saw a significant increase 
> in driver and executor memory usage in our jobs which contains star joins. By 
> analyzing heapdump, we saw that majority of the memory was being taken up by 
> {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
> broadcast joins in the query.
> !image-2023-07-11-10-41-02-251.png|width=851,height=70!
> This took up over 6GB of total memory, even though every table being 
> broadcasted was around ~1MB and hence should only have been ~100MB total. I 
> found that this is because {{BytesToBytesMap}} used within 
> {{UnsafeHashedRelation}} allocates memory in ["pageSize" 
> increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
>  which in our case was 64MB. Based on the [default page size 
> calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
>  this should be the case for any container with > 1 GB of memory (assuming 
> executor.cores = 1) which is far too common. Thus in our case, most of the 
> memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.
> !image-2023-07-11-10-52-59-553.png|width=389,height=101!
> I think this is a major inefficiency for broadcast joins (especially star 
> joins). I think there are a few ways to tackle the problem.
> 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does 
> reduce the memory consumption of broadcast joins, but I am not sure what it 
> implies for the rest of Spark machinery
> 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after 
> all values are added to the map and allocates a new page only for the 
> required bytes. 
> 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
> keys and values, and use those during deserialization to only request the 
> required memory.
> 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the 
> estimated data size of broadcast joins.
> I believe Option 3 would be simple enough to implement and I have a POC PR 
> which I will post soon, but I am interested in knowing other people's 
> thoughts here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44379) Broadcast Joins taking up too much memory

2023-07-11 Thread Shardul Mahadik (Jira)

Shardul Mahadik created SPARK-44379:
---

 Summary: Broadcast Joins taking up too much memory
 Key: SPARK-44379
 URL: https://issues.apache.org/jira/browse/SPARK-44379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: Shardul Mahadik


Context: After migrating to Spark 3 with AQE, we saw a significant increase in 
driver and executor memory usage in our jobs which contains star joins. By 
analyzing heapdump, we saw that majority of the memory was being taken up by 
{{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 
broadcast joins in the query.

!image-2023-07-11-10-41-02-251.png|width=851,height=70!

This took up over 6GB of total memory, even though every table being 
broadcasted was around ~1MB and hence should only have been ~100MB total. I 
found that this is because {{BytesToBytesMap}} used within 
{{UnsafeHashedRelation}} allocates memory in ["pageSize" 
increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117]
 which in our case was 64MB. Based on the [default page size 
calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251],
 this should be the case for any container with > 1 GB of memory (assuming 
executor.cores = 1) which is far too common. Thus in our case, most of the 
memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s.

!image-2023-07-11-10-52-59-553.png|width=389,height=101!

I think this is a major inefficiency for broadcast joins (especially star 
joins). I think there are a few ways to tackle the problem.
1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce 
the memory consumption of broadcast joins, but I am not sure what it implies 
for the rest of Spark machinery
2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all 
values are added to the map and allocates a new page only for the required 
bytes. 
3) Enhance the serialization of {{BytesToBytesMap}} to record the number of 
keys and values, and use those during deserialization to only request the 
required memory.
4) Use a lower page size for certain {{BytesToBytesMap}}s based on the 
estimated data size of broadcast joins.

I believe Option 3 would be simple enough to implement and I have a POC PR 
which I will post soon, but I am interested in knowing other people's thoughts 
here. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44279) Upgrade word-wrap

2023-07-11 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742120#comment-17742120
 ] 

Sean R. Owen commented on SPARK-44279:
--

Is this a library that's used in spark? I couldn't find it

> Upgrade word-wrap
> -
>
> Key: SPARK-44279
> URL: https://issues.apache.org/jira/browse/SPARK-44279
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [Regular Expression Denial of Service (ReDoS) - 
> CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44304) Broadcast operation is not required when no parameters are specified

2023-07-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-44304.
--
Resolution: Duplicate

> Broadcast operation is not required when no parameters are specified
> 
>
> Key: SPARK-44304
> URL: https://issues.apache.org/jira/browse/SPARK-44304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: 7mming7
>Priority: Minor
>
> The ability introduced by SPARK-14912, we can broadcast the parameters of the 
> data source to the read and write operations, but if the user does not 
> specify a specific parameter, the propagation operation will also be 
> performed, which affects the performance has a greater impact, so we need to 
> avoid broadcasting the full Hadoop parameters when the user does not specify 
> a specific parameter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple

2023-07-11 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742105#comment-17742105
 ] 

Sean R. Owen commented on SPARK-44377:
--

Sure can you open a PR?

> exclude junit5 deps from jersey-test-framework-provider-simple
> --
>
> Key: SPARK-44377
> URL: https://issues.apache.org/jira/browse/SPARK-44377
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use 
> [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], 
> Spark core module uses 
> `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
>  which cascades and introduces the dependencies of Junit5, this causes Java 
> tests no longer be executed when performing maven tests on the core module.
> run `mvn clean install -pl core -am`
>  
> {code:java}
> [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 
> ---
> [INFO] Using auto detected provider 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
> [INFO] 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
> [INFO] 
> [INFO] 
> [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
> [INFO] Skipping execution of surefire because it has already been run for 
> this configuration{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44376) Build using maven is broken using 2.13 and Java 11 and Java 17

2023-07-11 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742104#comment-17742104
 ] 

Sean R. Owen commented on SPARK-44376:
--

Did you run dev/change-scala-version.sh 2.13 ?

> Build using maven is broken using 2.13 and Java 11 and Java 17
> --
>
> Key: SPARK-44376
> URL: https://issues.apache.org/jira/browse/SPARK-44376
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Emil Ejbyfeldt
>Priority: Major
>
> Fails with
> ```
> $ ./build/mvn compile -Pscala-2.13 -Djava.version=11 -X
> ...
> [WARNING] [Warn] : [deprecation @  | origin= | version=] -target is 
> deprecated: Use -release instead to compile against the correct platform API.
> [ERROR] [Error] : target platform version 8 is older than the release version 
> 11
> [WARNING] one warning found
> [ERROR] one error found
> ...
> ```
> if setting the `java.version` property or
> ```
> $ ./build/mvn compile -Pscala-2.13
> ...
> [WARNING] [Warn] : [deprecation @  | origin= | version=] -target is 
> deprecated: Use -release instead to compile against the correct platform API.
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71:
>  not found: value sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
>  not found: object sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
>  not found: object sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210:
>  not found: type Unsafe
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212:
>  not found: type Unsafe
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236:
>  not found: type DirectBuffer
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
>  Unused import
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
>  Unused import
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452:
>  not found: value sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
>  not found: object sun
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
>  not found: type SignalHandler
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
>  not found: type Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83:
>  not found: type Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
>  not found: type SignalHandler
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
>  not found: value Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114:
>  not found: type Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116:
>  not found: value Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128:
>  not found: value Signal
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
>  Unused import
> [ERROR] [Error] 
> /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
>  Unused import
> [WARNING] one warning found
> [ERROR] 23 errors found
> ...
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-un

[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Raju updated SPARK-44378:
--
Attachment: image2.png

> Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
> --
>
> Key: SPARK-44378
> URL: https://issues.apache.org/jira/browse/SPARK-44378
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: Priyanka Raju
>Priority: Major
>  Labels: aqe
> Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot 
> 2023-07-11 at 9.36.19 AM.png, image2.png
>
>
> We have a few spark scala jobs that are currently running in production. Most 
> jobs typically use Dataset, Dataframes. There is a small code in our custom 
> library code, that makes rdd calls example to check if the dataframe is 
> empty: df.rdd.getNumPartitions == 0
> When I enable aqe for these jobs, this .rdd is converted into a separate job 
> of it's own and the entire dag is executed 2x, taking 2x more time. This does 
> not happen when AQE is disabled. Why does this happen and what is the best 
> way to fix the issue?
>  
> Sample code to reproduce the issue:
>  
>  
> {code:java}
> import org.apache.spark.sql._ 
>   case class Record(
> id: Int,
> name: String
>  )
>  
> val partCount = 4
> val input1 = (0 until 100).map(part => Record(part, "a"))
>  
> val input2 = (100 until 110).map(part => Record(part, "c"))
>  
> implicit val enc: Encoder[Record] = Encoders.product[Record]
>  
> val ds1 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input1, partCount)
> )
>  
> va
> l ds2 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input2, partCount)
> )
>  
> val ds3 = ds1.join(ds2, Seq("id"))
> val l = ds3.count()
>  
> val incomingPartitions = ds3.rdd.getNumPartitions
> log.info(s"Num partitions ${incomingPartitions}")
>   {code}
>  
> Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!
>  
> Spark UI for the same job without AQE:
>  
> !Screenshot 2023-07-11 at 9.36.19 AM.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Raju updated SPARK-44378:
--
Description: 
We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
id: Int,
name: String
 )
 
val partCount = 4
val input1 = (0 until 100).map(part => Record(part, "a"))
 
val input2 = (100 until 110).map(part => Record(part, "c"))
 
implicit val enc: Encoder[Record] = Encoders.product[Record]
 
val ds1 = spark.createDataset(
  spark.sparkContext
.parallelize(input1, partCount)
)
 
va


l ds2 = spark.createDataset(
  spark.sparkContext
.parallelize(input2, partCount)
)
 
val ds3 = ds1.join(ds2, Seq("id"))
val l = ds3.count()
 
val incomingPartitions = ds3.rdd.getNumPartitions
log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!

 

Spark UI for the same job without AQE:

 

!Screenshot 2023-07-11 at 9.36.19 AM.png!

 

This is causing unexpected regression in our jobs when we try to enable AQE for 
our jobs in production. We use spark 3.1 in production, but I can see the same 
behavior in spark 3.2 from the console as well

 

!image2.png!

  was:
We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
id: Int,
name: String
 )
 
val partCount = 4
val input1 = (0 until 100).map(part => Record(part, "a"))
 
val input2 = (100 until 110).map(part => Record(part, "c"))
 
implicit val enc: Encoder[Record] = Encoders.product[Record]
 
val ds1 = spark.createDataset(
  spark.sparkContext
.parallelize(input1, partCount)
)
 
va


l ds2 = spark.createDataset(
  spark.sparkContext
.parallelize(input2, partCount)
)
 
val ds3 = ds1.join(ds2, Seq("id"))
val l = ds3.count()
 
val incomingPartitions = ds3.rdd.getNumPartitions
log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!

 

Spark UI for the same job without AQE:

 

!Screenshot 2023-07-11 at 9.36.19 AM.png!


> Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
> --
>
> Key: SPARK-44378
> URL: https://issues.apache.org/jira/browse/SPARK-44378
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: Priyanka Raju
>Priority: Major
>  Labels: aqe
> Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot 
> 2023-07-11 at 9.36.19 AM.png, image2.png
>
>
> We have a few spark scala jobs that are currently running in production. Most 
> jobs typically use Dataset, Dataframes. There is a small code in our custom 
> library code, that makes rdd calls example to check if the dataframe is 
> empty: df.rdd.getNumPartitions == 0
> When I enable aqe for these jobs, this .rdd is converted into a separate job 
> of it's own and the entire dag is executed 2x, taking 2x more time. This does 
> not happen when AQE is disabled. Why does this happen and what is the best 
> way to fix the issue?
>  
> Sample code to reproduce the issue:
>  
>  
> {code:java}
> import org.apache.spark.sql._ 
>   case class Record(
> id: Int,
> name: String
>  )
>  
> val partCount = 4
> val input1 = (0 until 100).map(part => Record(part, "a"))
>  
> val input2 = (100 until 110).map(part => Record(part, "c"))
>  
> implicit val enc: Encoder[Record] = Encoders.product[Record]
>  
> val ds1 = spark.createDataset(
>   spark.sparkContext
> .parallelize(inp

[jira] [Updated] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec

2023-07-11 Thread Vinod KC (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-44362:
-
Summary: Use  PartitionEvaluator API in 
AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec  (was: Use  
PartitionEvaluator API in AggregateInPandasExec, 
WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec)

> Use  PartitionEvaluator API in 
> AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
> -
>
> Key: SPARK-44362
> URL: https://issues.apache.org/jira/browse/SPARK-44362
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Priority: Major
>
> Use  PartitionEvaluator API in
> AggregateInPandasExec
> EvalPythonExec
> AttachDistributedSequenceExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec, WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec

2023-07-11 Thread Vinod KC (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-44362:
-
Description: 
Use  PartitionEvaluator API in

AggregateInPandasExec

EvalPythonExec

AttachDistributedSequenceExec

  was:
Use  PartitionEvaluator API in

AggregateInPandasExec

WindowInPandasExec

EvalPythonExec

AttachDistributedSequenceExec


> Use  PartitionEvaluator API in AggregateInPandasExec, 
> WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
> -
>
> Key: SPARK-44362
> URL: https://issues.apache.org/jira/browse/SPARK-44362
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Priority: Major
>
> Use  PartitionEvaluator API in
> AggregateInPandasExec
> EvalPythonExec
> AttachDistributedSequenceExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec, WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec

2023-07-11 Thread Vinod KC (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742099#comment-17742099
 ] 

Vinod KC commented on SPARK-44362:
--

yes, please go ahead

> Use  PartitionEvaluator API in AggregateInPandasExec, 
> WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
> -
>
> Key: SPARK-44362
> URL: https://issues.apache.org/jira/browse/SPARK-44362
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Priority: Major
>
> Use  PartitionEvaluator API in
> AggregateInPandasExec
> WindowInPandasExec
> EvalPythonExec
> AttachDistributedSequenceExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Raju updated SPARK-44378:
--
Description: 
We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
id: Int,
name: String
 )
 
val partCount = 4
val input1 = (0 until 100).map(part => Record(part, "a"))
 
val input2 = (100 until 110).map(part => Record(part, "c"))
 
implicit val enc: Encoder[Record] = Encoders.product[Record]
 
val ds1 = spark.createDataset(
  spark.sparkContext
.parallelize(input1, partCount)
)
 
va


l ds2 = spark.createDataset(
  spark.sparkContext
.parallelize(input2, partCount)
)
 
val ds3 = ds1.join(ds2, Seq("id"))
val l = ds3.count()
 
val incomingPartitions = ds3.rdd.getNumPartitions
log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!

 

Spark UI for the same job without AQE:

 

!Screenshot 2023-07-11 at 9.36.19 AM.png!

  was:
We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
id: Int,
name: String
 )
 
val partCount = 4
val input1 = (0 until 100).map(part => Record(part, "a"))
 
val input2 = (100 until 110).map(part => Record(part, "c"))
 
implicit val enc: Encoder[Record] = Encoders.product[Record]
 
val ds1 = spark.createDataset(
  spark.sparkContext
.parallelize(input1, partCount)
)
 
va


l ds2 = spark.createDataset(
  spark.sparkContext
.parallelize(input2, partCount)
)
 
val ds3 = ds1.join(ds2, Seq("id"))
val l = ds3.count()
 
val incomingPartitions = ds3.rdd.getNumPartitions
log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!

 

 


> Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
> --
>
> Key: SPARK-44378
> URL: https://issues.apache.org/jira/browse/SPARK-44378
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: Priyanka Raju
>Priority: Major
>  Labels: aqe
> Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot 
> 2023-07-11 at 9.36.19 AM.png
>
>
> We have a few spark scala jobs that are currently running in production. Most 
> jobs typically use Dataset, Dataframes. There is a small code in our custom 
> library code, that makes rdd calls example to check if the dataframe is 
> empty: df.rdd.getNumPartitions == 0
> When I enable aqe for these jobs, this .rdd is converted into a separate job 
> of it's own and the entire dag is executed 2x, taking 2x more time. This does 
> not happen when AQE is disabled. Why does this happen and what is the best 
> way to fix the issue?
>  
> Sample code to reproduce the issue:
>  
>  
> {code:java}
> import org.apache.spark.sql._ 
>   case class Record(
> id: Int,
> name: String
>  )
>  
> val partCount = 4
> val input1 = (0 until 100).map(part => Record(part, "a"))
>  
> val input2 = (100 until 110).map(part => Record(part, "c"))
>  
> implicit val enc: Encoder[Record] = Encoders.product[Record]
>  
> val ds1 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input1, partCount)
> )
>  
> va
> l ds2 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input2, partCount)
> )
>  
> val ds3 = ds1.join(ds2, Seq("id"))
> val l = ds3.count()
>  
> val incomingPartitions = ds3.rdd.getNumPartitions
> log.info(s"Num partitions ${incomingPartit

[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Raju updated SPARK-44378:
--
Attachment: Screenshot 2023-07-11 at 9.36.19 AM.png

> Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
> --
>
> Key: SPARK-44378
> URL: https://issues.apache.org/jira/browse/SPARK-44378
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: Priyanka Raju
>Priority: Major
>  Labels: aqe
> Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot 
> 2023-07-11 at 9.36.19 AM.png
>
>
> We have a few spark scala jobs that are currently running in production. Most 
> jobs typically use Dataset, Dataframes. There is a small code in our custom 
> library code, that makes rdd calls example to check if the dataframe is 
> empty: df.rdd.getNumPartitions == 0
> When I enable aqe for these jobs, this .rdd is converted into a separate job 
> of it's own and the entire dag is executed 2x, taking 2x more time. This does 
> not happen when AQE is disabled. Why does this happen and what is the best 
> way to fix the issue?
>  
> Sample code to reproduce the issue:
>  
>  
> {code:java}
> import org.apache.spark.sql._ 
>   case class Record(
> id: Int,
> name: String
>  )
>  
> val partCount = 4
> val input1 = (0 until 100).map(part => Record(part, "a"))
>  
> val input2 = (100 until 110).map(part => Record(part, "c"))
>  
> implicit val enc: Encoder[Record] = Encoders.product[Record]
>  
> val ds1 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input1, partCount)
> )
>  
> va
> l ds2 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input2, partCount)
> )
>  
> val ds3 = ds1.join(ds2, Seq("id"))
> val l = ds3.count()
>  
> val incomingPartitions = ds3.rdd.getNumPartitions
> log.info(s"Num partitions ${incomingPartitions}")
>   {code}
>  
> Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Raju updated SPARK-44378:
--
Attachment: Screenshot 2023-07-11 at 9.36.14 AM.png

> Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
> --
>
> Key: SPARK-44378
> URL: https://issues.apache.org/jira/browse/SPARK-44378
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: Priyanka Raju
>Priority: Major
>  Labels: aqe
> Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png
>
>
> We have a few spark scala jobs that are currently running in production. Most 
> jobs typically use Dataset, Dataframes. There is a small code in our custom 
> library code, that makes rdd calls example to check if the dataframe is 
> empty: df.rdd.getNumPartitions == 0
> When I enable aqe for these jobs, this .rdd is converted into a separate job 
> of it's own and the entire dag is executed 2x, taking 2x more time. This does 
> not happen when AQE is disabled. Why does this happen and what is the best 
> way to fix the issue?
>  
> Sample code to reproduce the issue:
>  
>  
> {code:java}
> import org.apache.spark.sql._ 
>   case class Record(
> id: Int,
> name: String
>  )
>  
> val partCount = 4
> val input1 = (0 until 100).map(part => Record(part, "a"))
>  
> val input2 = (100 until 110).map(part => Record(part, "c"))
>  
> implicit val enc: Encoder[Record] = Encoders.product[Record]
>  
> val ds1 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input1, partCount)
> )
>  
> val ds2 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input2, partCount)
> )
>  
> val ds3 = ds1.join(ds2, Seq("id"))
> val l = ds3.count()
>  
> val incomingPartitions = ds3.rdd.getNumPartitions
> log.info(s"Num partitions ${incomingPartitions}")
>   {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Priyanka Raju updated SPARK-44378:
--
Description: 
We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
id: Int,
name: String
 )
 
val partCount = 4
val input1 = (0 until 100).map(part => Record(part, "a"))
 
val input2 = (100 until 110).map(part => Record(part, "c"))
 
implicit val enc: Encoder[Record] = Encoders.product[Record]
 
val ds1 = spark.createDataset(
  spark.sparkContext
.parallelize(input1, partCount)
)
 
va


l ds2 = spark.createDataset(
  spark.sparkContext
.parallelize(input2, partCount)
)
 
val ds3 = ds1.join(ds2, Seq("id"))
val l = ds3.count()
 
val incomingPartitions = ds3.rdd.getNumPartitions
log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!

 

 

  was:
We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
id: Int,
name: String
 )
 
val partCount = 4
val input1 = (0 until 100).map(part => Record(part, "a"))
 
val input2 = (100 until 110).map(part => Record(part, "c"))
 
implicit val enc: Encoder[Record] = Encoders.product[Record]
 
val ds1 = spark.createDataset(
  spark.sparkContext
.parallelize(input1, partCount)
)
 
val ds2 = spark.createDataset(
  spark.sparkContext
.parallelize(input2, partCount)
)
 
val ds3 = ds1.join(ds2, Seq("id"))
val l = ds3.count()
 
val incomingPartitions = ds3.rdd.getNumPartitions
log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

 

 


> Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
> --
>
> Key: SPARK-44378
> URL: https://issues.apache.org/jira/browse/SPARK-44378
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: Priyanka Raju
>Priority: Major
>  Labels: aqe
> Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png
>
>
> We have a few spark scala jobs that are currently running in production. Most 
> jobs typically use Dataset, Dataframes. There is a small code in our custom 
> library code, that makes rdd calls example to check if the dataframe is 
> empty: df.rdd.getNumPartitions == 0
> When I enable aqe for these jobs, this .rdd is converted into a separate job 
> of it's own and the entire dag is executed 2x, taking 2x more time. This does 
> not happen when AQE is disabled. Why does this happen and what is the best 
> way to fix the issue?
>  
> Sample code to reproduce the issue:
>  
>  
> {code:java}
> import org.apache.spark.sql._ 
>   case class Record(
> id: Int,
> name: String
>  )
>  
> val partCount = 4
> val input1 = (0 until 100).map(part => Record(part, "a"))
>  
> val input2 = (100 until 110).map(part => Record(part, "c"))
>  
> implicit val enc: Encoder[Record] = Encoders.product[Record]
>  
> val ds1 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input1, partCount)
> )
>  
> va
> l ds2 = spark.createDataset(
>   spark.sparkContext
> .parallelize(input2, partCount)
> )
>  
> val ds3 = ds1.join(ds2, Seq("id"))
> val l = ds3.count()
>  
> val incomingPartitions = ds3.rdd.getNumPartitions
> log.info(s"Num partitions ${incomingPartitions}")
>   {code}
>  
> Spark UI for the same job with AQE,  !Screenshot 2023-07-11 at 9.36.14 AM.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Created] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.

2023-07-11 Thread Priyanka Raju (Jira)

Priyanka Raju created SPARK-44378:
-

 Summary: Jobs that have join & have .rdd calls get executed 2x 
when AQE is enabled.
 Key: SPARK-44378
 URL: https://issues.apache.org/jira/browse/SPARK-44378
 Project: Spark
  Issue Type: Question
  Components: Spark Submit
Affects Versions: 3.1.2
Reporter: Priyanka Raju


We have a few spark scala jobs that are currently running in production. Most 
jobs typically use Dataset, Dataframes. There is a small code in our custom 
library code, that makes rdd calls example to check if the dataframe is empty: 
df.rdd.getNumPartitions == 0

When I enable aqe for these jobs, this .rdd is converted into a separate job of 
it's own and the entire dag is executed 2x, taking 2x more time. This does not 
happen when AQE is disabled. Why does this happen and what is the best way to 
fix the issue?

 

Sample code to reproduce the issue:

 

 
{code:java}
import org.apache.spark.sql._ 
  case class Record(
id: Int,
name: String
 )
 
val partCount = 4
val input1 = (0 until 100).map(part => Record(part, "a"))
 
val input2 = (100 until 110).map(part => Record(part, "c"))
 
implicit val enc: Encoder[Record] = Encoders.product[Record]
 
val ds1 = spark.createDataset(
  spark.sparkContext
.parallelize(input1, partCount)
)
 
val ds2 = spark.createDataset(
  spark.sparkContext
.parallelize(input2, partCount)
)
 
val ds3 = ds1.join(ds2, Seq("id"))
val l = ds3.count()
 
val incomingPartitions = ds3.rdd.getNumPartitions
log.info(s"Num partitions ${incomingPartitions}")
  {code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44360) Support schema pruning in delta-based MERGE operations

2023-07-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44360.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41930
[https://github.com/apache/spark/pull/41930]

> Support schema pruning in delta-based MERGE operations
> --
>
> Key: SPARK-44360
> URL: https://issues.apache.org/jira/browse/SPARK-44360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
> Fix For: 3.5.0
>
>
> We need to support schema pruning in delta-based MERGE operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44360) Support schema pruning in delta-based MERGE operations

2023-07-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44360:
-

Assignee: Anton Okolnychyi

> Support schema pruning in delta-based MERGE operations
> --
>
> Key: SPARK-44360
> URL: https://issues.apache.org/jira/browse/SPARK-44360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>
> We need to support schema pruning in delta-based MERGE operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple

2023-07-11 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-44377:
-
Description: 
SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use [Junit5 
instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], Spark core 
module uses 
`org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
 which cascades and introduces the dependencies of Junit5, this causes Java 
tests no longer be executed when performing maven tests on the core module.

run `mvn clean install -pl core -am`

 
{code:java}
[INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 ---
[INFO] Using auto detected provider 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
[INFO] Skipping execution of surefire because it has already been run for this 
configuration{code}

  was:
SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use Junit5 
instead of Junit4, Spark core module uses 
`org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
 which cascades and introduces the dependencies of Junit5, this causes Java 
tests no longer be executed when performing maven tests on the core module.

run `mvn clean install -pl core -am`

 
{code:java}
[INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 ---
[INFO] Using auto detected provider 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
[INFO] Skipping execution of surefire because it has already been run for this 
configuration{code}


> exclude junit5 deps from jersey-test-framework-provider-simple
> --
>
> Key: SPARK-44377
> URL: https://issues.apache.org/jira/browse/SPARK-44377
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use 
> [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], 
> Spark core module uses 
> `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
>  which cascades and introduces the dependencies of Junit5, this causes Java 
> tests no longer be executed when performing maven tests on the core module.
> run `mvn clean install -pl core -am`
>  
> {code:java}
> [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 
> ---
> [INFO] Using auto detected provider 
> org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
> [INFO] 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] 
> [INFO] Results:
> [INFO] 
> [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
> [INFO] 
> [INFO] 
> [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
> [INFO] Skipping execution of surefire because it has already been run for 
> this configuration{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple

2023-07-11 Thread Yang Jie (Jira)

Yang Jie created SPARK-44377:


 Summary: exclude junit5 deps from 
jersey-test-framework-provider-simple
 Key: SPARK-44377
 URL: https://issues.apache.org/jira/browse/SPARK-44377
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use Junit5 
instead of Junit4, Spark core module uses 
`org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`,
 which cascades and introduces the dependencies of Junit5, this causes Java 
tests no longer be executed when performing maven tests on the core module.

run `mvn clean install -pl core -am`

 
{code:java}
[INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 ---
[INFO] Using auto detected provider 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
[INFO] 
[INFO] ---
[INFO]  T E S T S
[INFO] ---
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 ---
[INFO] Skipping execution of surefire because it has already been run for this 
configuration{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44376) Build using maven is broken using 2.13 and Java 11 and Java 17

2023-07-11 Thread Emil Ejbyfeldt (Jira)

Emil Ejbyfeldt created SPARK-44376:
--

 Summary: Build using maven is broken using 2.13 and Java 11 and 
Java 17
 Key: SPARK-44376
 URL: https://issues.apache.org/jira/browse/SPARK-44376
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.5.0
Reporter: Emil Ejbyfeldt


Fails with
```
$ ./build/mvn compile -Pscala-2.13 -Djava.version=11 -X
...
[WARNING] [Warn] : [deprecation @  | origin= | version=] -target is deprecated: 
Use -release instead to compile against the correct platform API.
[ERROR] [Error] : target platform version 8 is older than the release version 11
[WARNING] one warning found
[ERROR] one error found
...
```
if setting the `java.version` property or
```
$ ./build/mvn compile -Pscala-2.13
...
[WARNING] [Warn] : [deprecation @  | origin= | version=] -target is deprecated: 
Use -release instead to compile against the correct platform API.
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71:
 not found: value sun
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
 not found: object sun
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
 not found: object sun
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206:
 not found: type DirectBuffer
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210:
 not found: type Unsafe
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212:
 not found: type Unsafe
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213:
 not found: type DirectBuffer
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216:
 not found: type DirectBuffer
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236:
 not found: type DirectBuffer
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26:
 Unused import
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27:
 Unused import
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452:
 not found: value sun
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
 not found: object sun
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
 not found: type SignalHandler
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99:
 not found: type Signal
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83:
 not found: type Signal
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
 not found: type SignalHandler
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108:
 not found: value Signal
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114:
 not found: type Signal
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116:
 not found: value Signal
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128:
 not found: value Signal
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
 Unused import
[ERROR] [Error] 
/home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26:
 Unused import
[WARNING] one warning found
[ERROR] 23 errors found
...
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2023-07-11 Thread Pratik Malani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742006#comment-17742006
 ] 

Pratik Malani edited comment on SPARK-33782 at 7/11/23 1:33 PM:


Hi [~pralabhkumar] 

The latest update in the SparkSubmit.scala is causing the FileNotFoundException.
The below mentioned jar is present at the said location /opt/spark/work-dir/, 
but the Files.copy statement in the SparkSubmit.scala is causing the issue.
Can you please help to check what could be possible cause?
{code:java}
Files  local:///opt/spark/work-dir/sample.jar from 
/opt/spark/work-dir/sample.jar to /opt/spark/work-dir/./sample.jar
Exception in thread "main" java.nio.file.NoSuchFileException: 
/opt/spark/work-dir/sample.jar
        at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
        at 
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
        at java.nio.file.Files.copy(Files.java:1274)
        at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:424)
        at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:449)
        at scala.Option.map(Option.scala:230)
        at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:449)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
 {code}


was (Author: JIRAUSER296450):
Hi [~pralabhkumar] 

The latest update in the SparkSubmit.scala is causing the FileNotFoundException.
The below mentioned jar is present at the said location, but the Files.copy 
statement in the SparkSubmit.scala is causing the issue.
Can you please help to check what could be possible cause?
{code:java}
Files  local:///opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar from 
/opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar to 
/opt/spark/work-dir/./database-scripts-1.1-SNAPSHOT.jar
Exception in thread "main" java.nio.file.NoSuchFileException: 
/opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar
        at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
        at 
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
        at java.nio.file.Files.copy(Files.java:1274)
        at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:424)
        at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:449)
        at scala.Option.map(Option.scala:230)
        at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:449)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
 {code}

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
>

[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2023-07-11 Thread Pratik Malani (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742006#comment-17742006
 ] 

Pratik Malani commented on SPARK-33782:
---

Hi [~pralabhkumar] 

The latest update in the SparkSubmit.scala is causing the FileNotFoundException.
The below mentioned jar is present at the said location, but the Files.copy 
statement in the SparkSubmit.scala is causing the issue.
Can you please help to check what could be possible cause?
{code:java}
Files  local:///opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar from 
/opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar to 
/opt/spark/work-dir/./database-scripts-1.1-SNAPSHOT.jar
Exception in thread "main" java.nio.file.NoSuchFileException: 
/opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar
        at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
        at 
sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
        at java.nio.file.Files.copy(Files.java:1274)
        at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437)
        at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:286)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at 
org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:424)
        at 
org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:449)
        at scala.Option.map(Option.scala:230)
        at 
org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:449)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
 {code}

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Pralabh Kumar
>Priority: Major
> Fix For: 3.4.0
>
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44375) Use PartitionEvaluator API in DebugExec

2023-07-11 Thread Jia Fan (Jira)

Jia Fan created SPARK-44375:
---

 Summary: Use PartitionEvaluator API in DebugExec
 Key: SPARK-44375
 URL: https://issues.apache.org/jira/browse/SPARK-44375
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: Jia Fan


Use PartitionEvaluator API in DebugExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44375) Use PartitionEvaluator API in DebugExec

2023-07-11 Thread Jia Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742004#comment-17742004
 ] 

Jia Fan commented on SPARK-44375:
-

I'm working on it.

> Use PartitionEvaluator API in DebugExec
> ---
>
> Key: SPARK-44375
> URL: https://issues.apache.org/jira/browse/SPARK-44375
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Jia Fan
>Priority: Major
>
> Use PartitionEvaluator API in DebugExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44374) Add example code

2023-07-11 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-44374:
--

 Summary: Add example code
 Key: SPARK-44374
 URL: https://issues.apache.org/jira/browse/SPARK-44374
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu


Add example code for distributed ML <> spark connect .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44374) Add example code

2023-07-11 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-44374:
--

Assignee: Weichen Xu

> Add example code
> 
>
> Key: SPARK-44374
> URL: https://issues.apache.org/jira/browse/SPARK-44374
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Add example code for distributed ML <> spark connect .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42471) Distributed ML <> spark connect

2023-07-11 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-42471:
--

Assignee: Weichen Xu

> Distributed ML <> spark connect
> ---
>
> Key: SPARK-42471
> URL: https://issues.apache.org/jira/browse/SPARK-42471
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44341) Define the computing logic through PartitionEvaluator API and use it in WindowExec and WindowInPandasExec

2023-07-11 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-44341:
---
Summary: Define the computing logic through PartitionEvaluator API and use 
it in WindowExec and WindowInPandasExec  (was: Define the computing logic 
through PartitionEvaluator API and use it in WindowExec)

> Define the computing logic through PartitionEvaluator API and use it in 
> WindowExec and WindowInPandasExec
> -
>
> Key: SPARK-44341
> URL: https://issues.apache.org/jira/browse/SPARK-44341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>
> Define the computing logic through PartitionEvaluator API and use it in 
> WindowExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44373) Wrap withActive for Dataset API w/ parse logic

2023-07-11 Thread Kent Yao (Jira)

Kent Yao created SPARK-44373:


 Summary: Wrap withActive for Dataset API w/ parse logic
 Key: SPARK-44373
 URL: https://issues.apache.org/jira/browse/SPARK-44373
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38476) Use error classes in org.apache.spark.storage

2023-07-11 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38476:


Assignee: Bo Zhang

> Use error classes in org.apache.spark.storage
> -
>
> Key: SPARK-38476
> URL: https://issues.apache.org/jira/browse/SPARK-38476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38476) Use error classes in org.apache.spark.storage

2023-07-11 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38476.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41923
[https://github.com/apache/spark/pull/41923]

> Use error classes in org.apache.spark.storage
> -
>
> Key: SPARK-38476
> URL: https://issues.apache.org/jira/browse/SPARK-38476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-44354) Cannot create dataframe with CharType/VarcharType column

2023-07-11 Thread Kai-Michael Roesner (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai-Michael Roesner updated SPARK-44354:

Description: 
When trying to create a dataframe with a CharType or VarcharType column like so:
{code}
from datetime import date
from decimal import Decimal
from pyspark.sql import SparkSession
from pyspark.sql.types import *

data = [
  (1, 'abc', Decimal(3.142), date(2023, 1, 1)),
  (2, 'bcd', Decimal(1.414), date(2023, 1, 2)),
  (3, 'cde', Decimal(2.718), date(2023, 1, 3))]

schema = StructType([
  StructField('INT', IntegerType()),
  StructField('STR', CharType(3)),
  StructField('DEC', DecimalType(4, 3)),
  StructField('DAT', DateType())])

spark = SparkSession.builder.appName('data-types').getOrCreate()
df = spark.createDataFrame(data, schema)
df.show()
{code}
a {{java.lang.IllegalStateException}} is thrown 
[here|https://github.com/apache/spark/blob/85e252e8503534009f4fb5ea005d44c9eda31447/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L168].

Excerpt from the logs:
{code}
py4j.protocol.Py4JJavaError: An error occurred while calling 
o24.applySchemaToPythonRDD.
: java.lang.IllegalStateException: [BUG] logical plan should not have output of 
char/varchar type: LogicalRDD [INT#0, STR#1, DEC#2, DAT#3], false

at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:168)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:163)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:160)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:156)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:146)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202)
at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at 
org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:571)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:804)
at 
org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:789)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnectio

[jira] [Commented] (SPARK-44354) Cannot create dataframe with CharType/VarcharType column

2023-07-11 Thread Kai-Michael Roesner (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741925#comment-17741925
 ] 

Kai-Michael Roesner commented on SPARK-44354:
-

PS: I tried to work around the exception by using `StringType()` in the schema 
and then doing 
{code}
df.withColumn('STR', col('STR').cast(CharType(3)))
{code}
That got me a
{code}
WARN CharVarcharUtils: The Spark cast operator does not support char/varchar 
type and simply treats them as string type.
{code}
So now I'm wondering whether `CharType()` is supported as column data type at 
all...

> Cannot create dataframe with CharType/VarcharType column
> 
>
> Key: SPARK-44354
> URL: https://issues.apache.org/jira/browse/SPARK-44354
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Kai-Michael Roesner
>Priority: Major
>
> When trying to create a dataframe with a CharType or VarcharType column like 
> so:
> {code}
> from datetime import date
> from decimal import Decimal
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> data = [
>   (1, 'abc', Decimal(3.142), date(2023, 1, 1)),
>   (2, 'bcd', Decimal(1.414), date(2023, 1, 2)),
>   (3, 'cde', Decimal(2.718), date(2023, 1, 3))]
> schema = StructType([
>   StructField('INT', IntegerType()),
>   StructField('STR', CharType(3)),
>   StructField('DEC', DecimalType(4, 3)),
>   StructField('DAT', DateType())])
> spark = SparkSession.builder.appName('data-types').getOrCreate()
> df = spark.createDataFrame(data, schema)
> df.show()
> {code}
> a {{java.lang.IllegalStateException}} is thrown 
> [here|https://github.com/apache/spark/blob/85e252e8503534009f4fb5ea005d44c9eda31447/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L168].
> I'm expecting this to work...
> PS: Excerpt from the logs:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> o24.applySchemaToPythonRDD.
> : java.lang.IllegalStateException: [BUG] logical plan should not have output 
> of char/varchar type: LogicalRDD [INT#0, STR#1, DEC#2, DAT#3], false
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:168)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:163)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:160)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:156)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:146)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208)
> at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76)
> at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
> at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202)
> at 
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
> at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202)
> at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
> at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66)
> at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
> at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
> at 
> org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scal

[jira] [Commented] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec, WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec

2023-07-11 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741921#comment-17741921
 ] 

jiaan.geng commented on SPARK-44362:


[~vinodkc] Because WindowInPandasExec related to WindowExec, Could I finish 
them together ?

> Use  PartitionEvaluator API in AggregateInPandasExec, 
> WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
> -
>
> Key: SPARK-44362
> URL: https://issues.apache.org/jira/browse/SPARK-44362
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Vinod KC
>Priority: Major
>
> Use  PartitionEvaluator API in
> AggregateInPandasExec
> WindowInPandasExec
> EvalPythonExec
> AttachDistributedSequenceExec



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect

2023-07-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741916#comment-17741916
 ] 

ASF GitHub Bot commented on SPARK-43665:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/41931

> Enable PandasSQLStringFormatter.vformat to work with Spark Connect
> --
>
> Key: SPARK-43665
> URL: https://issues.apache.org/jira/browse/SPARK-43665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable PandasSQLStringFormatter.vformat to work with Spark Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect

2023-07-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741915#comment-17741915
 ] 

ASF GitHub Bot commented on SPARK-43665:


User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/41931

> Enable PandasSQLStringFormatter.vformat to work with Spark Connect
> --
>
> Key: SPARK-43665
> URL: https://issues.apache.org/jira/browse/SPARK-43665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable PandasSQLStringFormatter.vformat to work with Spark Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44263) Allow ChannelBuilder extensions -- Scala

2023-07-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-44263:


Assignee: Alice Sayutina

> Allow ChannelBuilder extensions -- Scala
> 
>
> Key: SPARK-44263
> URL: https://issues.apache.org/jira/browse/SPARK-44263
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Major
>
> Follow up to https://issues.apache.org/jira/browse/SPARK-43332
> Provide similar extension capabilities in Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44263) Allow ChannelBuilder extensions -- Scala

2023-07-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-44263.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41880
[https://github.com/apache/spark/pull/41880]

> Allow ChannelBuilder extensions -- Scala
> 
>
> Key: SPARK-44263
> URL: https://issues.apache.org/jira/browse/SPARK-44263
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Major
> Fix For: 3.5.0
>
>
> Follow up to https://issues.apache.org/jira/browse/SPARK-43332
> Provide similar extension capabilities in Scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-44320) Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277]

2023-07-11 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-44320.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41909
[https://github.com/apache/spark/pull/41909]

> Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277]
> -
>
> Key: SPARK-44320
> URL: https://issues.apache.org/jira/browse/SPARK-44320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-44320) Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277]

2023-07-11 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-44320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-44320:


Assignee: BingKun Pan

> Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277]
> -
>
> Key: SPARK-44320
> URL: https://issues.apache.org/jira/browse/SPARK-44320
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44372) Enable KernelDensity within Spark Connect

2023-07-11 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-44372:
---

 Summary: Enable KernelDensity within Spark Connect
 Key: SPARK-44372
 URL: https://issues.apache.org/jira/browse/SPARK-44372
 Project: Spark
  Issue Type: Sub-task
  Components: Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


import pyspark.pandas as ps
psdf = ps.DataFrame(\{"a": [1, 2, 3, 4, 5], "b": [1, 3, 5, 7, 9], "c": [2, 4, 
6, 8, 10]})
psdf.plot.kde(bw_method=5, ind=3)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43629) Enable RDD dependent tests with Spark Connect

2023-07-11 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-43629:

Summary: Enable RDD dependent tests with Spark Connect  (was: Enable RDD 
with Spark Connect)

> Enable RDD dependent tests with Spark Connect
> -
>
> Key: SPARK-43629
> URL: https://issues.apache.org/jira/browse/SPARK-43629
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Enable RDD with Spark Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec

2023-07-11 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741882#comment-17741882
 ] 

jiaan.geng commented on SPARK-44371:


I'm working on.

> Define the computing logic through PartitionEvaluator API and use it in 
> CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
> -
>
> Key: SPARK-44371
> URL: https://issues.apache.org/jira/browse/SPARK-44371
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec

2023-07-11 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-44371:
--

 Summary: Define the computing logic through PartitionEvaluator API 
and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and 
GlobalLimitExec
 Key: SPARK-44371
 URL: https://issues.apache.org/jira/browse/SPARK-44371
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

84 matches

Mail list logo