[jira] [Updated] (SPARK-44381) How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider refresh token regularly
[ https://issues.apache.org/jira/browse/SPARK-44381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qingbo jiao updated SPARK-44381: Description: export KRB5CCNAME=FILE:/tmp/krb5cc_1001 ./bin/spark-submit -{-}master yarn --deploy-mode client --proxy-user --conf spark.app.name=spark-hive-test --conf spark.security.credentials.renewalRatio=0.58 --conf spark.kerberos.renewal.credentials=ccache{-} -class org.apache.spark.examples.sql.hive.SparkHiveExample /examples/jars/spark-examples_2.12-3.1.1.jar spark version 3.1.1,I configured it to refresh every 5 seconds。 --deploy-mode client/cluster wtih/without --proxy-user have all been tried, but none of them will work Missing any configuration parameters? was: export KRB5CCNAME=FILE:/tmp/krb5cc_1001 ./bin/spark-submit --master yarn --deploy-mode client --proxy-user ocdp --conf spark.app.name=spark-hive-test --conf spark.security.credentials.renewalRatio=0.58 --conf spark.kerberos.renewal.credentials=ccache --class org.apache.spark.examples.sql.hive.SparkHiveExample /examples/jars/spark-examples_2.12-3.1.1.jar spark version 3.1.1,I configured it to refresh every 5 seconds。 --deploy-mode client/cluster wtih/without --proxy-user have all been tried, but none of them will work Missing any configuration parameters? > How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider > refresh token regularly > - > > Key: SPARK-44381 > URL: https://issues.apache.org/jira/browse/SPARK-44381 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: qingbo jiao >Priority: Minor > > export KRB5CCNAME=FILE:/tmp/krb5cc_1001 > ./bin/spark-submit -{-}master yarn --deploy-mode client --proxy-user > --conf spark.app.name=spark-hive-test --conf > spark.security.credentials.renewalRatio=0.58 --conf > spark.kerberos.renewal.credentials=ccache{-} -class > org.apache.spark.examples.sql.hive.SparkHiveExample > /examples/jars/spark-examples_2.12-3.1.1.jar > spark version 3.1.1,I configured it to refresh every 5 seconds。 > --deploy-mode client/cluster wtih/without --proxy-user have all been tried, > but none of them will work > Missing any configuration parameters? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44381) How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider refresh token regularly
[ https://issues.apache.org/jira/browse/SPARK-44381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742293#comment-17742293 ] qingbo jiao commented on SPARK-44381: - [~jshao] please cc ,thanks > How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider > refresh token regularly > - > > Key: SPARK-44381 > URL: https://issues.apache.org/jira/browse/SPARK-44381 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: qingbo jiao >Priority: Minor > > export KRB5CCNAME=FILE:/tmp/krb5cc_1001 > ./bin/spark-submit --master yarn --deploy-mode client --proxy-user ocdp > --conf spark.app.name=spark-hive-test --conf > spark.security.credentials.renewalRatio=0.58 --conf > spark.kerberos.renewal.credentials=ccache --class > org.apache.spark.examples.sql.hive.SparkHiveExample > /examples/jars/spark-examples_2.12-3.1.1.jar > spark version 3.1.1,I configured it to refresh every 5 seconds。 > --deploy-mode client/cluster wtih/without --proxy-user have all been tried, > but none of them will work > Missing any configuration parameters? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44353) Remove toAttributes from StructType
[ https://issues.apache.org/jira/browse/SPARK-44353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44353. --- Fix Version/s: 3.5.0 Resolution: Fixed > Remove toAttributes from StructType > --- > > Key: SPARK-44353 > URL: https://issues.apache.org/jira/browse/SPARK-44353 > Project: Spark > Issue Type: New Feature > Components: Connect, SQL >Affects Versions: 3.4.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44373) Wrap withActive for Dataset API w/ parse logic
[ https://issues.apache.org/jira/browse/SPARK-44373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44373. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41938 [https://github.com/apache/spark/pull/41938] > Wrap withActive for Dataset API w/ parse logic > -- > > Key: SPARK-44373 > URL: https://issues.apache.org/jira/browse/SPARK-44373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44373) Wrap withActive for Dataset API w/ parse logic
[ https://issues.apache.org/jira/browse/SPARK-44373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44373: Assignee: Kent Yao > Wrap withActive for Dataset API w/ parse logic > -- > > Key: SPARK-44373 > URL: https://issues.apache.org/jira/browse/SPARK-44373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44334) Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED
[ https://issues.apache.org/jira/browse/SPARK-44334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44334. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41891 [https://github.com/apache/spark/pull/41891] > Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED > --- > > Key: SPARK-44334 > URL: https://issues.apache.org/jira/browse/SPARK-44334 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44334) Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED
[ https://issues.apache.org/jira/browse/SPARK-44334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44334: Assignee: Kent Yao > Status of execution w/ error and w/o jobs shall be FAILED not COMPLETED > --- > > Key: SPARK-44334 > URL: https://issues.apache.org/jira/browse/SPARK-44334 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.3.2, 3.4.1, 3.5.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44370) Migrate Buf remote generation alpha to remote plugins
[ https://issues.apache.org/jira/browse/SPARK-44370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44370. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41933 [https://github.com/apache/spark/pull/41933] > Migrate Buf remote generation alpha to remote plugins > - > > Key: SPARK-44370 > URL: https://issues.apache.org/jira/browse/SPARK-44370 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.1 >Reporter: Jia Fan >Assignee: Jia Fan >Priority: Major > Fix For: 3.5.0 > > > Buf unsupported remote generation alpha at now. Please refer > [https://buf.build/docs/migration-guides/migrate-remote-generation-alpha/] . > We should migrate Buf remote generation alpha to remote plugins by follow the > guide. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43755) Spark Connect - decouple query execution from RPC handler
[ https://issues.apache.org/jira/browse/SPARK-43755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742252#comment-17742252 ] Snoot.io commented on SPARK-43755: -- User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/41315 > Spark Connect - decouple query execution from RPC handler > - > > Key: SPARK-43755 > URL: https://issues.apache.org/jira/browse/SPARK-43755 > Project: Spark > Issue Type: Story > Components: Connect >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Priority: Major > > Move actual query execution out of the RPC handler callback. This allows: > * (immediately) better control over query cancellation, by interrupting the > execution thread. > * design changes to the RPC interface to allow different execution models > than stream-push from server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44381) How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider refresh token regularly
qingbo jiao created SPARK-44381: --- Summary: How to specify parameters in spark-sumbit to make HiveDelegationTokenProvider refresh token regularly Key: SPARK-44381 URL: https://issues.apache.org/jira/browse/SPARK-44381 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1 Reporter: qingbo jiao export KRB5CCNAME=FILE:/tmp/krb5cc_1001 ./bin/spark-submit --master yarn --deploy-mode client --proxy-user ocdp --conf spark.app.name=spark-hive-test --conf spark.security.credentials.renewalRatio=0.58 --conf spark.kerberos.renewal.credentials=ccache --class org.apache.spark.examples.sql.hive.SparkHiveExample /examples/jars/spark-examples_2.12-3.1.1.jar spark version 3.1.1,I configured it to refresh every 5 seconds。 --deploy-mode client/cluster wtih/without --proxy-user have all been tried, but none of them will work Missing any configuration parameters? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44340. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41899 [https://github.com/apache/spark/pull/41899] > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec > > > Key: SPARK-44340 > URL: https://issues.apache.org/jira/browse/SPARK-44340 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44340: --- Assignee: jiaan.geng > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec > > > Key: SPARK-44340 > URL: https://issues.apache.org/jira/browse/SPARK-44340 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742246#comment-17742246 ] Snoot.io commented on SPARK-44340: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41899 > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec > > > Key: SPARK-44340 > URL: https://issues.apache.org/jira/browse/SPARK-44340 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44340) Define the computing logic through PartitionEvaluator API and use it in WindowGroupLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742247#comment-17742247 ] Snoot.io commented on SPARK-44340: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41899 > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec > > > Key: SPARK-44340 > URL: https://issues.apache.org/jira/browse/SPARK-44340 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > > Define the computing logic through PartitionEvaluator API and use it in > WindowGroupLimitExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43665. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41931 [https://github.com/apache/spark/pull/41931] > Enable PandasSQLStringFormatter.vformat to work with Spark Connect > -- > > Key: SPARK-43665 > URL: https://issues.apache.org/jira/browse/SPARK-43665 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > Enable PandasSQLStringFormatter.vformat to work with Spark Connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43665: - Assignee: Haejoon Lee > Enable PandasSQLStringFormatter.vformat to work with Spark Connect > -- > > Key: SPARK-43665 > URL: https://issues.apache.org/jira/browse/SPARK-43665 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Enable PandasSQLStringFormatter.vformat to work with Spark Connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44325) Define the computing logic through PartitionEvaluator API and use it in SortMergeJoinExec
[ https://issues.apache.org/jira/browse/SPARK-44325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44325. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41884 [https://github.com/apache/spark/pull/41884] > Define the computing logic through PartitionEvaluator API and use it in > SortMergeJoinExec > - > > Key: SPARK-44325 > URL: https://issues.apache.org/jira/browse/SPARK-44325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Major > Fix For: 3.5.0 > > > Define the computing logic through PartitionEvaluator API and use it in > SortMergeJoinExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44325) Define the computing logic through PartitionEvaluator API and use it in SortMergeJoinExec
[ https://issues.apache.org/jira/browse/SPARK-44325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44325: Assignee: Vinod KC > Define the computing logic through PartitionEvaluator API and use it in > SortMergeJoinExec > - > > Key: SPARK-44325 > URL: https://issues.apache.org/jira/browse/SPARK-44325 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Assignee: Vinod KC >Priority: Major > > Define the computing logic through PartitionEvaluator API and use it in > SortMergeJoinExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple
[ https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44377. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41944 [https://github.com/apache/spark/pull/41944] > exclude junit5 deps from jersey-test-framework-provider-simple > -- > > Key: SPARK-44377 > URL: https://issues.apache.org/jira/browse/SPARK-44377 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use > [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], > Spark core module uses > `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, > which cascades and introduces the dependencies of Junit5, this causes Java > tests no longer be executed when performing maven tests on the core module. > run `mvn clean install -pl core -am` > > {code:java} > [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 > --- > [INFO] Using auto detected provider > org.apache.maven.surefire.junitplatform.JUnitPlatformProvider > [INFO] > [INFO] --- > [INFO] T E S T S > [INFO] --- > [INFO] > [INFO] Results: > [INFO] > [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 > [INFO] > [INFO] > [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- > [INFO] Skipping execution of surefire because it has already been run for > this configuration{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple
[ https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742234#comment-17742234 ] Yang Jie commented on SPARK-44377: -- Fixed > exclude junit5 deps from jersey-test-framework-provider-simple > -- > > Key: SPARK-44377 > URL: https://issues.apache.org/jira/browse/SPARK-44377 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use > [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], > Spark core module uses > `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, > which cascades and introduces the dependencies of Junit5, this causes Java > tests no longer be executed when performing maven tests on the core module. > run `mvn clean install -pl core -am` > > {code:java} > [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 > --- > [INFO] Using auto detected provider > org.apache.maven.surefire.junitplatform.JUnitPlatformProvider > [INFO] > [INFO] --- > [INFO] T E S T S > [INFO] --- > [INFO] > [INFO] Results: > [INFO] > [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 > [INFO] > [INFO] > [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- > [INFO] Skipping execution of surefire because it has already been run for > this configuration{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple
[ https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44377: Assignee: Yang Jie > exclude junit5 deps from jersey-test-framework-provider-simple > -- > > Key: SPARK-44377 > URL: https://issues.apache.org/jira/browse/SPARK-44377 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use > [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], > Spark core module uses > `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, > which cascades and introduces the dependencies of Junit5, this causes Java > tests no longer be executed when performing maven tests on the core module. > run `mvn clean install -pl core -am` > > {code:java} > [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 > --- > [INFO] Using auto detected provider > org.apache.maven.surefire.junitplatform.JUnitPlatformProvider > [INFO] > [INFO] --- > [INFO] T E S T S > [INFO] --- > [INFO] > [INFO] Results: > [INFO] > [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 > [INFO] > [INFO] > [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- > [INFO] Skipping execution of surefire because it has already been run for > this configuration{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44374) Add example code
[ https://issues.apache.org/jira/browse/SPARK-44374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-44374: --- Fix Version/s: 3.5.0 > Add example code > > > Key: SPARK-44374 > URL: https://issues.apache.org/jira/browse/SPARK-44374 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.5.0 > > > Add example code for distributed ML <> spark connect . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44374) Add example code
[ https://issues.apache.org/jira/browse/SPARK-44374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu resolved SPARK-44374. Resolution: Done > Add example code > > > Key: SPARK-44374 > URL: https://issues.apache.org/jira/browse/SPARK-44374 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Add example code for distributed ML <> spark connect . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
[ https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1774#comment-1774 ] jiaan.geng commented on SPARK-44362: Thank you. > Use PartitionEvaluator API in > AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec > - > > Key: SPARK-44362 > URL: https://issues.apache.org/jira/browse/SPARK-44362 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > AggregateInPandasExec > EvalPythonExec > AttachDistributedSequenceExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44380) Support for UDTF to analyze in Python
Takuya Ueshin created SPARK-44380: - Summary: Support for UDTF to analyze in Python Key: SPARK-44380 URL: https://issues.apache.org/jira/browse/SPARK-44380 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44217) Allow custom precision for fp approx equality
[ https://issues.apache.org/jira/browse/SPARK-44217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44217: --- Summary: Allow custom precision for fp approx equality (was: Add assert_approx_df_equality util function) > Allow custom precision for fp approx equality > - > > Key: SPARK-44217 > URL: https://issues.apache.org/jira/browse/SPARK-44217 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44264) DeepSpeed Distrobutor
[ https://issues.apache.org/jira/browse/SPARK-44264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44264. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41770 [https://github.com/apache/spark/pull/41770] > DeepSpeed Distrobutor > - > > Key: SPARK-44264 > URL: https://issues.apache.org/jira/browse/SPARK-44264 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.4.1 >Reporter: Lu Wang >Priority: Critical > Fix For: 3.5.0 > > > To make it easier for Pyspark users to run distributed training and inference > with DeepSpeed on spark clusters using PySpark. This was a project determined > by the Databricks ML Training Team. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists
[ https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742175#comment-17742175 ] Frederik Paradis commented on SPARK-43513: -- Hi [~wenxin]. Thank you for your comment. 1. Didn't what to put there so I just put the latest version. 2. I see what you mean. However, it seems counter-intuitive to me to allow to allow columns with the same name and no other ways to differentiate them other than their positions. Especially with the fact that joins do move columns around and that (mostly?) all operations in Spark do not support referring to columns by their positions. Beyond that, I guess it's more of a question of engineering design and vision of the software. > withColumnRenamed duplicates columns if new column already exists > - > > Key: SPARK-43513 > URL: https://issues.apache.org/jira/browse/SPARK-43513 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Frederik Paradis >Priority: Major > > withColumnRenamed should either replace the column when new column already > exists or should specify the specificity in the documentation. See the code > below as an example of the current state. > {code:python} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate() > df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", > "test_score"]) > r = df.withColumnRenamed("test_score", "score") > print(r) # DataFrame[id: bigint, score: double, score: double] > # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could > be: score, score. > print(r.select("score")) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43513) withColumnRenamed duplicates columns if new column already exists
[ https://issues.apache.org/jira/browse/SPARK-43513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742175#comment-17742175 ] Frederik Paradis edited comment on SPARK-43513 at 7/11/23 9:02 PM: --- Hi [~wenxin]. Thank you for your comment. 1. Didn't know what to put there so I just put the latest version. 2. I see what you mean. However, it seems counter-intuitive to me to allow to allow columns with the same name and no other ways to differentiate them other than their positions. Especially with the fact that joins do move columns around and that (mostly?) all operations in Spark do not support referring to columns by their positions. Beyond that, I guess it's more of a question of engineering design and vision of the software. was (Author: JIRAUSER300280): Hi [~wenxin]. Thank you for your comment. 1. Didn't what to put there so I just put the latest version. 2. I see what you mean. However, it seems counter-intuitive to me to allow to allow columns with the same name and no other ways to differentiate them other than their positions. Especially with the fact that joins do move columns around and that (mostly?) all operations in Spark do not support referring to columns by their positions. Beyond that, I guess it's more of a question of engineering design and vision of the software. > withColumnRenamed duplicates columns if new column already exists > - > > Key: SPARK-43513 > URL: https://issues.apache.org/jira/browse/SPARK-43513 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Frederik Paradis >Priority: Major > > withColumnRenamed should either replace the column when new column already > exists or should specify the specificity in the documentation. See the code > below as an example of the current state. > {code:python} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate() > df = spark.createDataFrame([(1, 0.5, 0.4), (2, 0.5, 0.8)], ["id", "score", > "test_score"]) > r = df.withColumnRenamed("test_score", "score") > print(r) # DataFrame[id: bigint, score: double, score: double] > # pyspark.sql.utils.AnalysisException: Reference 'score' is ambiguous, could > be: score, score. > print(r.select("score")) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44279) Upgrade word-wrap
[ https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742146#comment-17742146 ] Bjørn Jørgensen commented on SPARK-44279: - have a look at https://github.com/apache/spark/pull/35628 and https://github.com/apache/spark/pull/39143 > Upgrade word-wrap > - > > Key: SPARK-44279 > URL: https://issues.apache.org/jira/browse/SPARK-44279 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [Regular Expression Denial of Service (ReDoS) - > CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44279) Upgrade word-wrap
[ https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742139#comment-17742139 ] Sean R. Owen commented on SPARK-44279: -- This is a dumb question, but what is that file? packages that what part of Spark uses? I have never seen it > Upgrade word-wrap > - > > Key: SPARK-44279 > URL: https://issues.apache.org/jira/browse/SPARK-44279 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [Regular Expression Denial of Service (ReDoS) - > CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44279) Upgrade word-wrap
[ https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742137#comment-17742137 ] Bjørn Jørgensen commented on SPARK-44279: - [~srowen] https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/dev/package-lock.json#L2226 [word-wrap vulnerable to Regular Expression Denial of Service|https://github.com/jonschlinkert/word-wrap/issues/40] > Upgrade word-wrap > - > > Key: SPARK-44279 > URL: https://issues.apache.org/jira/browse/SPARK-44279 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [Regular Expression Denial of Service (ReDoS) - > CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44262) JdbcUtils hardcodes some SQL statements
[ https://issues.apache.org/jira/browse/SPARK-44262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-44262: - Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) > JdbcUtils hardcodes some SQL statements > --- > > Key: SPARK-44262 > URL: https://issues.apache.org/jira/browse/SPARK-44262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Florent BIVILLE >Priority: Minor > > I am currently investigating an integration with the [Neo4j JBDC > driver|https://github.com/neo4j-contrib/neo4j-jdbc] and a Spark-based cloud > vendor SDK. > > This SDK relies on Spark's {{JdbcUtils}} to run queries and insert data. > While {{JdbcUtils}} partly delegates to > \{{org.apache.spark.sql.jdbc.JdbcDialect}} for some queries, some others are > hardcoded to SQL, see: > * {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#dropTable}} > * > {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils#getInsertStatement}} > > This works fine for relational databases but breaks for NOSQL stores that do > not support SQL translation (like Neo4j). > Is there a plan to augment the {{JdbcDialect}} surface so that it is also > responsible for these currently-hardcoded queries? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43439) Drop does not work when passed a string with an alias
[ https://issues.apache.org/jira/browse/SPARK-43439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-43439: - Priority: Minor (was: Major) > Drop does not work when passed a string with an alias > - > > Key: SPARK-43439 > URL: https://issues.apache.org/jira/browse/SPARK-43439 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Frederik Paradis >Priority: Minor > > When passing a string to the drop method, if the string contains an alias, > the column is not dropped. However, passing a column object with the same > name and alias, it works. > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as F > spark = > SparkSession.builder.master("local[1]").appName("local-spark-session").getOrCreate() > df = spark.createDataFrame([(1, 10)], ["any", "hour"]).alias("a") > j = df.drop("a.hour") > print(j) # DataFrame[any: bigint, hour: bigint] > jj = df.drop(F.col("a.hour")) > print(jj) # DataFrame[any: bigint] > {code} > > Related issues: > https://issues.apache.org/jira/browse/SPARK-31123 > https://issues.apache.org/jira/browse/SPARK-14759 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44058) Remove deprecated API usage in HiveShim.scala
[ https://issues.apache.org/jira/browse/SPARK-44058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-44058. -- Resolution: Not A Problem > Remove deprecated API usage in HiveShim.scala > - > > Key: SPARK-44058 > URL: https://issues.apache.org/jira/browse/SPARK-44058 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.4.0 >Reporter: Aman Raj >Priority: Major > > Spark's HiveShim.scala calls this particular method in Hive : > createPartitionMethod.invoke( > hive, > table, > spec, > location, > params, // partParams > null, // inputFormat > null, // outputFormat > -1: JInteger, // numBuckets > null, // cols > null, // serializationLib > null, // serdeParams > null, // bucketCols > null) // sortCols > } > > We do not have any such implementation of createPartition in Hive. We only > have this definition : > public Partition createPartition(Table tbl, Map partSpec) > throws HiveException { > try > { org.apache.hadoop.hive.metastore.api.Partition part = > Partition.createMetaPartitionObject(tbl, partSpec, null); > AcidUtils.TableSnapshot tableSnapshot = AcidUtils.getTableSnapshot(conf, > tbl); part.setWriteId(tableSnapshot != null ? > tableSnapshot.getWriteId() : 0); return new Partition(tbl, > getMSC().add_partition(part)); } > catch (Exception e) > { LOG.error(StringUtils.stringifyException(e)); throw new > HiveException(e); } > } > *The 12 parameter implementation was removed in HIVE-5951* > > The issue is that this 12 parameter implementation of createPartition method > was added in Hive-0.12 and then was removed in Hive-0.13. When Hive 0.12 was > used in Spark, SPARK-15334 commit in Spark added this 12 parameters > implementation. But after Hive migrated to newer APIs somehow this was not > changed in Spark OSS and it looks to us like a Bug from the Spark end. > > We need to migrate to the newest implementation of Hive createPartition > method otherwise this flow can break -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory
[ https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik updated SPARK-44379: Description: Context: After migrating to Spark 3 with AQE, we saw a significant increase in driver and executor memory usage in our jobs which contains star joins. By analyzing heapdump, we saw that majority of the memory was being taken up by {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 broadcast joins in the query. !screenshot-1.png|width=851,height=70! This took up over 6GB of total memory, even though every table being broadcasted was around ~1MB and hence should only have been ~100MB total. I found that this is because {{BytesToBytesMap}} used within {{UnsafeHashedRelation}} allocates memory in ["pageSize" increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] which in our case was 64MB. Based on the [default page size calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], this should be the case for any container with > 1 GB of memory (assuming executor.cores = 1) which is far too common. Thus in our case, most of the memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. !screenshot-2.png|width=389,height=101! I think this is a major inefficiency for broadcast joins (especially star joins). I think there are a few ways to tackle the problem. 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce the memory consumption of broadcast joins, but I am not sure what it implies for the rest of Spark machinery 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all values are added to the map and allocates a new page only for the required bytes. 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of keys and values, and use those during deserialization to only request the required memory. 4) Use a lower page size for certain {{BytesToBytesMap}} based on the estimated data size of broadcast joins. I believe Option 3 would be simple enough to implement and I have a POC PR which I will post soon, but I am interested in knowing other people's thoughts here. was: Context: After migrating to Spark 3 with AQE, we saw a significant increase in driver and executor memory usage in our jobs which contains star joins. By analyzing heapdump, we saw that majority of the memory was being taken up by {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 broadcast joins in the query. !screenshot-1.png|width=851,height=70! This took up over 6GB of total memory, even though every table being broadcasted was around ~1MB and hence should only have been ~100MB total. I found that this is because {{BytesToBytesMap}} used within {{UnsafeHashedRelation}} allocates memory in ["pageSize" increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] which in our case was 64MB. Based on the [default page size calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], this should be the case for any container with > 1 GB of memory (assuming executor.cores = 1) which is far too common. Thus in our case, most of the memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. !screenshot-2.png|width=389,height=101! I think this is a major inefficiency for broadcast joins (especially star joins). I think there are a few ways to tackle the problem. 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce the memory consumption of broadcast joins, but I am not sure what it implies for the rest of Spark machinery 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all values are added to the map and allocates a new page only for the required bytes. 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of keys and values, and use those during deserialization to only request the required memory. 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the estimated data size of broadcast joins. I believe Option 3 would be simple enough to implement and I have a POC PR which I will post soon, but I am interested in knowing other people's thoughts here. > Broadcast Joins taking up too much memory > - > > Key: SPARK-44379 > URL: https://issues.apache.org/jira/browse/SPARK-44379 > Project: Spark > Issue Type: Improvement > Components: SQL >
[jira] [Commented] (SPARK-44379) Broadcast Joins taking up too much memory
[ https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742126#comment-17742126 ] Shardul Mahadik commented on SPARK-44379: - cc: [~cloud_fan] [~joshrosen] [~mridul] Would be interested in knowing your thoughts here. > Broadcast Joins taking up too much memory > - > > Key: SPARK-44379 > URL: https://issues.apache.org/jira/browse/SPARK-44379 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Shardul Mahadik >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Context: After migrating to Spark 3 with AQE, we saw a significant increase > in driver and executor memory usage in our jobs which contains star joins. By > analyzing heapdump, we saw that majority of the memory was being taken up by > {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 > broadcast joins in the query. > !screenshot-1.png|width=851,height=70! > This took up over 6GB of total memory, even though every table being > broadcasted was around ~1MB and hence should only have been ~100MB total. I > found that this is because {{BytesToBytesMap}} used within > {{UnsafeHashedRelation}} allocates memory in ["pageSize" > increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] > which in our case was 64MB. Based on the [default page size > calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], > this should be the case for any container with > 1 GB of memory (assuming > executor.cores = 1) which is far too common. Thus in our case, most of the > memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. > !screenshot-2.png|width=389,height=101! > I think this is a major inefficiency for broadcast joins (especially star > joins). I think there are a few ways to tackle the problem. > 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does > reduce the memory consumption of broadcast joins, but I am not sure what it > implies for the rest of Spark machinery > 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after > all values are added to the map and allocates a new page only for the > required bytes. > 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of > keys and values, and use those during deserialization to only request the > required memory. > 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the > estimated data size of broadcast joins. > I believe Option 3 would be simple enough to implement and I have a POC PR > which I will post soon, but I am interested in knowing other people's > thoughts here. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory
[ https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik updated SPARK-44379: Description: Context: After migrating to Spark 3 with AQE, we saw a significant increase in driver and executor memory usage in our jobs which contains star joins. By analyzing heapdump, we saw that majority of the memory was being taken up by {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 broadcast joins in the query. !screenshot-1.png|width=851,height=70! This took up over 6GB of total memory, even though every table being broadcasted was around ~1MB and hence should only have been ~100MB total. I found that this is because {{BytesToBytesMap}} used within {{UnsafeHashedRelation}} allocates memory in ["pageSize" increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] which in our case was 64MB. Based on the [default page size calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], this should be the case for any container with > 1 GB of memory (assuming executor.cores = 1) which is far too common. Thus in our case, most of the memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. !screenshot-2.png|width=389,height=101! I think this is a major inefficiency for broadcast joins (especially star joins). I think there are a few ways to tackle the problem. 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce the memory consumption of broadcast joins, but I am not sure what it implies for the rest of Spark machinery 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all values are added to the map and allocates a new page only for the required bytes. 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of keys and values, and use those during deserialization to only request the required memory. 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the estimated data size of broadcast joins. I believe Option 3 would be simple enough to implement and I have a POC PR which I will post soon, but I am interested in knowing other people's thoughts here. was: Context: After migrating to Spark 3 with AQE, we saw a significant increase in driver and executor memory usage in our jobs which contains star joins. By analyzing heapdump, we saw that majority of the memory was being taken up by {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 broadcast joins in the query. !image-2023-07-11-10-41-02-251.png|width=851,height=70! This took up over 6GB of total memory, even though every table being broadcasted was around ~1MB and hence should only have been ~100MB total. I found that this is because {{BytesToBytesMap}} used within {{UnsafeHashedRelation}} allocates memory in ["pageSize" increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] which in our case was 64MB. Based on the [default page size calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], this should be the case for any container with > 1 GB of memory (assuming executor.cores = 1) which is far too common. Thus in our case, most of the memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. !image-2023-07-11-10-52-59-553.png|width=389,height=101! I think this is a major inefficiency for broadcast joins (especially star joins). I think there are a few ways to tackle the problem. 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce the memory consumption of broadcast joins, but I am not sure what it implies for the rest of Spark machinery 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all values are added to the map and allocates a new page only for the required bytes. 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of keys and values, and use those during deserialization to only request the required memory. 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the estimated data size of broadcast joins. I believe Option 3 would be simple enough to implement and I have a POC PR which I will post soon, but I am interested in knowing other people's thoughts here. > Broadcast Joins taking up too much memory > - > > Key: SPARK-44379 > URL: https://issues.apache.org/jira/browse/SPARK-44379 > Project: Spark > Issue Type: Improve
[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory
[ https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik updated SPARK-44379: Attachment: screenshot-1.png > Broadcast Joins taking up too much memory > - > > Key: SPARK-44379 > URL: https://issues.apache.org/jira/browse/SPARK-44379 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Shardul Mahadik >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Context: After migrating to Spark 3 with AQE, we saw a significant increase > in driver and executor memory usage in our jobs which contains star joins. By > analyzing heapdump, we saw that majority of the memory was being taken up by > {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 > broadcast joins in the query. > !image-2023-07-11-10-41-02-251.png|width=851,height=70! > This took up over 6GB of total memory, even though every table being > broadcasted was around ~1MB and hence should only have been ~100MB total. I > found that this is because {{BytesToBytesMap}} used within > {{UnsafeHashedRelation}} allocates memory in ["pageSize" > increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] > which in our case was 64MB. Based on the [default page size > calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], > this should be the case for any container with > 1 GB of memory (assuming > executor.cores = 1) which is far too common. Thus in our case, most of the > memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. > !image-2023-07-11-10-52-59-553.png|width=389,height=101! > I think this is a major inefficiency for broadcast joins (especially star > joins). I think there are a few ways to tackle the problem. > 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does > reduce the memory consumption of broadcast joins, but I am not sure what it > implies for the rest of Spark machinery > 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after > all values are added to the map and allocates a new page only for the > required bytes. > 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of > keys and values, and use those during deserialization to only request the > required memory. > 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the > estimated data size of broadcast joins. > I believe Option 3 would be simple enough to implement and I have a POC PR > which I will post soon, but I am interested in knowing other people's > thoughts here. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44379) Broadcast Joins taking up too much memory
[ https://issues.apache.org/jira/browse/SPARK-44379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik updated SPARK-44379: Attachment: screenshot-2.png > Broadcast Joins taking up too much memory > - > > Key: SPARK-44379 > URL: https://issues.apache.org/jira/browse/SPARK-44379 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.1 >Reporter: Shardul Mahadik >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Context: After migrating to Spark 3 with AQE, we saw a significant increase > in driver and executor memory usage in our jobs which contains star joins. By > analyzing heapdump, we saw that majority of the memory was being taken up by > {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 > broadcast joins in the query. > !image-2023-07-11-10-41-02-251.png|width=851,height=70! > This took up over 6GB of total memory, even though every table being > broadcasted was around ~1MB and hence should only have been ~100MB total. I > found that this is because {{BytesToBytesMap}} used within > {{UnsafeHashedRelation}} allocates memory in ["pageSize" > increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] > which in our case was 64MB. Based on the [default page size > calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], > this should be the case for any container with > 1 GB of memory (assuming > executor.cores = 1) which is far too common. Thus in our case, most of the > memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. > !image-2023-07-11-10-52-59-553.png|width=389,height=101! > I think this is a major inefficiency for broadcast joins (especially star > joins). I think there are a few ways to tackle the problem. > 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does > reduce the memory consumption of broadcast joins, but I am not sure what it > implies for the rest of Spark machinery > 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after > all values are added to the map and allocates a new page only for the > required bytes. > 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of > keys and values, and use those during deserialization to only request the > required memory. > 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the > estimated data size of broadcast joins. > I believe Option 3 would be simple enough to implement and I have a POC PR > which I will post soon, but I am interested in knowing other people's > thoughts here. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44379) Broadcast Joins taking up too much memory
Shardul Mahadik created SPARK-44379: --- Summary: Broadcast Joins taking up too much memory Key: SPARK-44379 URL: https://issues.apache.org/jira/browse/SPARK-44379 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.1 Reporter: Shardul Mahadik Context: After migrating to Spark 3 with AQE, we saw a significant increase in driver and executor memory usage in our jobs which contains star joins. By analyzing heapdump, we saw that majority of the memory was being taken up by {{UnsafeHashedRelation}} used for broadcast joins; in this case there were 92 broadcast joins in the query. !image-2023-07-11-10-41-02-251.png|width=851,height=70! This took up over 6GB of total memory, even though every table being broadcasted was around ~1MB and hence should only have been ~100MB total. I found that this is because {{BytesToBytesMap}} used within {{UnsafeHashedRelation}} allocates memory in ["pageSize" increments|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/java/org/apache/spark/memory/MemoryConsumer.java#L117] which in our case was 64MB. Based on the [default page size calculation|https://github.com/apache/spark/blob/37aa62f629e652ed70505620473530cd9611018e/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala#L251], this should be the case for any container with > 1 GB of memory (assuming executor.cores = 1) which is far too common. Thus in our case, most of the memory requested by {{BytesToBytesMap}} was un-utilized with just trailing 0s. !image-2023-07-11-10-52-59-553.png|width=389,height=101! I think this is a major inefficiency for broadcast joins (especially star joins). I think there are a few ways to tackle the problem. 1) Reduce {{spark.buffer.pageSize}} globally to a lower value. This does reduce the memory consumption of broadcast joins, but I am not sure what it implies for the rest of Spark machinery 2) Add a "finalize" operation to {{BytesToBytesMap}} which is called after all values are added to the map and allocates a new page only for the required bytes. 3) Enhance the serialization of {{BytesToBytesMap}} to record the number of keys and values, and use those during deserialization to only request the required memory. 4) Use a lower page size for certain {{BytesToBytesMap}}s based on the estimated data size of broadcast joins. I believe Option 3 would be simple enough to implement and I have a POC PR which I will post soon, but I am interested in knowing other people's thoughts here. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44279) Upgrade word-wrap
[ https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742120#comment-17742120 ] Sean R. Owen commented on SPARK-44279: -- Is this a library that's used in spark? I couldn't find it > Upgrade word-wrap > - > > Key: SPARK-44279 > URL: https://issues.apache.org/jira/browse/SPARK-44279 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [Regular Expression Denial of Service (ReDoS) - > CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44304) Broadcast operation is not required when no parameters are specified
[ https://issues.apache.org/jira/browse/SPARK-44304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-44304. -- Resolution: Duplicate > Broadcast operation is not required when no parameters are specified > > > Key: SPARK-44304 > URL: https://issues.apache.org/jira/browse/SPARK-44304 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: 7mming7 >Priority: Minor > > The ability introduced by SPARK-14912, we can broadcast the parameters of the > data source to the read and write operations, but if the user does not > specify a specific parameter, the propagation operation will also be > performed, which affects the performance has a greater impact, so we need to > avoid broadcasting the full Hadoop parameters when the user does not specify > a specific parameter -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple
[ https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742105#comment-17742105 ] Sean R. Owen commented on SPARK-44377: -- Sure can you open a PR? > exclude junit5 deps from jersey-test-framework-provider-simple > -- > > Key: SPARK-44377 > URL: https://issues.apache.org/jira/browse/SPARK-44377 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use > [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], > Spark core module uses > `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, > which cascades and introduces the dependencies of Junit5, this causes Java > tests no longer be executed when performing maven tests on the core module. > run `mvn clean install -pl core -am` > > {code:java} > [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 > --- > [INFO] Using auto detected provider > org.apache.maven.surefire.junitplatform.JUnitPlatformProvider > [INFO] > [INFO] --- > [INFO] T E S T S > [INFO] --- > [INFO] > [INFO] Results: > [INFO] > [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 > [INFO] > [INFO] > [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- > [INFO] Skipping execution of surefire because it has already been run for > this configuration{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44376) Build using maven is broken using 2.13 and Java 11 and Java 17
[ https://issues.apache.org/jira/browse/SPARK-44376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742104#comment-17742104 ] Sean R. Owen commented on SPARK-44376: -- Did you run dev/change-scala-version.sh 2.13 ? > Build using maven is broken using 2.13 and Java 11 and Java 17 > -- > > Key: SPARK-44376 > URL: https://issues.apache.org/jira/browse/SPARK-44376 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.5.0 >Reporter: Emil Ejbyfeldt >Priority: Major > > Fails with > ``` > $ ./build/mvn compile -Pscala-2.13 -Djava.version=11 -X > ... > [WARNING] [Warn] : [deprecation @ | origin= | version=] -target is > deprecated: Use -release instead to compile against the correct platform API. > [ERROR] [Error] : target platform version 8 is older than the release version > 11 > [WARNING] one warning found > [ERROR] one error found > ... > ``` > if setting the `java.version` property or > ``` > $ ./build/mvn compile -Pscala-2.13 > ... > [WARNING] [Warn] : [deprecation @ | origin= | version=] -target is > deprecated: Use -release instead to compile against the correct platform API. > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71: > not found: value sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: > not found: object sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: > not found: object sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210: > not found: type Unsafe > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212: > not found: type Unsafe > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236: > not found: type DirectBuffer > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: > Unused import > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: > Unused import > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452: > not found: value sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: > not found: object sun > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: > not found: type SignalHandler > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: > not found: type Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83: > not found: type Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: > not found: type SignalHandler > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: > not found: value Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114: > not found: type Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116: > not found: value Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128: > not found: value Signal > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: > Unused import > [ERROR] [Error] > /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: > Unused import > [WARNING] one warning found > [ERROR] 23 errors found > ... > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-un
[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
[ https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Raju updated SPARK-44378: -- Attachment: image2.png > Jobs that have join & have .rdd calls get executed 2x when AQE is enabled. > -- > > Key: SPARK-44378 > URL: https://issues.apache.org/jira/browse/SPARK-44378 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.1.2 >Reporter: Priyanka Raju >Priority: Major > Labels: aqe > Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot > 2023-07-11 at 9.36.19 AM.png, image2.png > > > We have a few spark scala jobs that are currently running in production. Most > jobs typically use Dataset, Dataframes. There is a small code in our custom > library code, that makes rdd calls example to check if the dataframe is > empty: df.rdd.getNumPartitions == 0 > When I enable aqe for these jobs, this .rdd is converted into a separate job > of it's own and the entire dag is executed 2x, taking 2x more time. This does > not happen when AQE is disabled. Why does this happen and what is the best > way to fix the issue? > > Sample code to reproduce the issue: > > > {code:java} > import org.apache.spark.sql._ > case class Record( > id: Int, > name: String > ) > > val partCount = 4 > val input1 = (0 until 100).map(part => Record(part, "a")) > > val input2 = (100 until 110).map(part => Record(part, "c")) > > implicit val enc: Encoder[Record] = Encoders.product[Record] > > val ds1 = spark.createDataset( > spark.sparkContext > .parallelize(input1, partCount) > ) > > va > l ds2 = spark.createDataset( > spark.sparkContext > .parallelize(input2, partCount) > ) > > val ds3 = ds1.join(ds2, Seq("id")) > val l = ds3.count() > > val incomingPartitions = ds3.rdd.getNumPartitions > log.info(s"Num partitions ${incomingPartitions}") > {code} > > Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! > > Spark UI for the same job without AQE: > > !Screenshot 2023-07-11 at 9.36.19 AM.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
[ https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Raju updated SPARK-44378: -- Description: We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs, this .rdd is converted into a separate job of it's own and the entire dag is executed 2x, taking 2x more time. This does not happen when AQE is disabled. Why does this happen and what is the best way to fix the issue? Sample code to reproduce the issue: {code:java} import org.apache.spark.sql._ case class Record( id: Int, name: String ) val partCount = 4 val input1 = (0 until 100).map(part => Record(part, "a")) val input2 = (100 until 110).map(part => Record(part, "c")) implicit val enc: Encoder[Record] = Encoders.product[Record] val ds1 = spark.createDataset( spark.sparkContext .parallelize(input1, partCount) ) va l ds2 = spark.createDataset( spark.sparkContext .parallelize(input2, partCount) ) val ds3 = ds1.join(ds2, Seq("id")) val l = ds3.count() val incomingPartitions = ds3.rdd.getNumPartitions log.info(s"Num partitions ${incomingPartitions}") {code} Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! Spark UI for the same job without AQE: !Screenshot 2023-07-11 at 9.36.19 AM.png! This is causing unexpected regression in our jobs when we try to enable AQE for our jobs in production. We use spark 3.1 in production, but I can see the same behavior in spark 3.2 from the console as well !image2.png! was: We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs, this .rdd is converted into a separate job of it's own and the entire dag is executed 2x, taking 2x more time. This does not happen when AQE is disabled. Why does this happen and what is the best way to fix the issue? Sample code to reproduce the issue: {code:java} import org.apache.spark.sql._ case class Record( id: Int, name: String ) val partCount = 4 val input1 = (0 until 100).map(part => Record(part, "a")) val input2 = (100 until 110).map(part => Record(part, "c")) implicit val enc: Encoder[Record] = Encoders.product[Record] val ds1 = spark.createDataset( spark.sparkContext .parallelize(input1, partCount) ) va l ds2 = spark.createDataset( spark.sparkContext .parallelize(input2, partCount) ) val ds3 = ds1.join(ds2, Seq("id")) val l = ds3.count() val incomingPartitions = ds3.rdd.getNumPartitions log.info(s"Num partitions ${incomingPartitions}") {code} Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! Spark UI for the same job without AQE: !Screenshot 2023-07-11 at 9.36.19 AM.png! > Jobs that have join & have .rdd calls get executed 2x when AQE is enabled. > -- > > Key: SPARK-44378 > URL: https://issues.apache.org/jira/browse/SPARK-44378 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.1.2 >Reporter: Priyanka Raju >Priority: Major > Labels: aqe > Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot > 2023-07-11 at 9.36.19 AM.png, image2.png > > > We have a few spark scala jobs that are currently running in production. Most > jobs typically use Dataset, Dataframes. There is a small code in our custom > library code, that makes rdd calls example to check if the dataframe is > empty: df.rdd.getNumPartitions == 0 > When I enable aqe for these jobs, this .rdd is converted into a separate job > of it's own and the entire dag is executed 2x, taking 2x more time. This does > not happen when AQE is disabled. Why does this happen and what is the best > way to fix the issue? > > Sample code to reproduce the issue: > > > {code:java} > import org.apache.spark.sql._ > case class Record( > id: Int, > name: String > ) > > val partCount = 4 > val input1 = (0 until 100).map(part => Record(part, "a")) > > val input2 = (100 until 110).map(part => Record(part, "c")) > > implicit val enc: Encoder[Record] = Encoders.product[Record] > > val ds1 = spark.createDataset( > spark.sparkContext > .parallelize(inp
[jira] [Updated] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
[ https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-44362: - Summary: Use PartitionEvaluator API in AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec (was: Use PartitionEvaluator API in AggregateInPandasExec, WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec) > Use PartitionEvaluator API in > AggregateInPandasExec,EvalPythonExec,AttachDistributedSequenceExec > - > > Key: SPARK-44362 > URL: https://issues.apache.org/jira/browse/SPARK-44362 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > AggregateInPandasExec > EvalPythonExec > AttachDistributedSequenceExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec, WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
[ https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-44362: - Description: Use PartitionEvaluator API in AggregateInPandasExec EvalPythonExec AttachDistributedSequenceExec was: Use PartitionEvaluator API in AggregateInPandasExec WindowInPandasExec EvalPythonExec AttachDistributedSequenceExec > Use PartitionEvaluator API in AggregateInPandasExec, > WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec > - > > Key: SPARK-44362 > URL: https://issues.apache.org/jira/browse/SPARK-44362 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > AggregateInPandasExec > EvalPythonExec > AttachDistributedSequenceExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec, WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
[ https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742099#comment-17742099 ] Vinod KC commented on SPARK-44362: -- yes, please go ahead > Use PartitionEvaluator API in AggregateInPandasExec, > WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec > - > > Key: SPARK-44362 > URL: https://issues.apache.org/jira/browse/SPARK-44362 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > AggregateInPandasExec > WindowInPandasExec > EvalPythonExec > AttachDistributedSequenceExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
[ https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Raju updated SPARK-44378: -- Description: We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs, this .rdd is converted into a separate job of it's own and the entire dag is executed 2x, taking 2x more time. This does not happen when AQE is disabled. Why does this happen and what is the best way to fix the issue? Sample code to reproduce the issue: {code:java} import org.apache.spark.sql._ case class Record( id: Int, name: String ) val partCount = 4 val input1 = (0 until 100).map(part => Record(part, "a")) val input2 = (100 until 110).map(part => Record(part, "c")) implicit val enc: Encoder[Record] = Encoders.product[Record] val ds1 = spark.createDataset( spark.sparkContext .parallelize(input1, partCount) ) va l ds2 = spark.createDataset( spark.sparkContext .parallelize(input2, partCount) ) val ds3 = ds1.join(ds2, Seq("id")) val l = ds3.count() val incomingPartitions = ds3.rdd.getNumPartitions log.info(s"Num partitions ${incomingPartitions}") {code} Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! Spark UI for the same job without AQE: !Screenshot 2023-07-11 at 9.36.19 AM.png! was: We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs, this .rdd is converted into a separate job of it's own and the entire dag is executed 2x, taking 2x more time. This does not happen when AQE is disabled. Why does this happen and what is the best way to fix the issue? Sample code to reproduce the issue: {code:java} import org.apache.spark.sql._ case class Record( id: Int, name: String ) val partCount = 4 val input1 = (0 until 100).map(part => Record(part, "a")) val input2 = (100 until 110).map(part => Record(part, "c")) implicit val enc: Encoder[Record] = Encoders.product[Record] val ds1 = spark.createDataset( spark.sparkContext .parallelize(input1, partCount) ) va l ds2 = spark.createDataset( spark.sparkContext .parallelize(input2, partCount) ) val ds3 = ds1.join(ds2, Seq("id")) val l = ds3.count() val incomingPartitions = ds3.rdd.getNumPartitions log.info(s"Num partitions ${incomingPartitions}") {code} Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! > Jobs that have join & have .rdd calls get executed 2x when AQE is enabled. > -- > > Key: SPARK-44378 > URL: https://issues.apache.org/jira/browse/SPARK-44378 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.1.2 >Reporter: Priyanka Raju >Priority: Major > Labels: aqe > Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot > 2023-07-11 at 9.36.19 AM.png > > > We have a few spark scala jobs that are currently running in production. Most > jobs typically use Dataset, Dataframes. There is a small code in our custom > library code, that makes rdd calls example to check if the dataframe is > empty: df.rdd.getNumPartitions == 0 > When I enable aqe for these jobs, this .rdd is converted into a separate job > of it's own and the entire dag is executed 2x, taking 2x more time. This does > not happen when AQE is disabled. Why does this happen and what is the best > way to fix the issue? > > Sample code to reproduce the issue: > > > {code:java} > import org.apache.spark.sql._ > case class Record( > id: Int, > name: String > ) > > val partCount = 4 > val input1 = (0 until 100).map(part => Record(part, "a")) > > val input2 = (100 until 110).map(part => Record(part, "c")) > > implicit val enc: Encoder[Record] = Encoders.product[Record] > > val ds1 = spark.createDataset( > spark.sparkContext > .parallelize(input1, partCount) > ) > > va > l ds2 = spark.createDataset( > spark.sparkContext > .parallelize(input2, partCount) > ) > > val ds3 = ds1.join(ds2, Seq("id")) > val l = ds3.count() > > val incomingPartitions = ds3.rdd.getNumPartitions > log.info(s"Num partitions ${incomingPartit
[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
[ https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Raju updated SPARK-44378: -- Attachment: Screenshot 2023-07-11 at 9.36.19 AM.png > Jobs that have join & have .rdd calls get executed 2x when AQE is enabled. > -- > > Key: SPARK-44378 > URL: https://issues.apache.org/jira/browse/SPARK-44378 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.1.2 >Reporter: Priyanka Raju >Priority: Major > Labels: aqe > Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png, Screenshot > 2023-07-11 at 9.36.19 AM.png > > > We have a few spark scala jobs that are currently running in production. Most > jobs typically use Dataset, Dataframes. There is a small code in our custom > library code, that makes rdd calls example to check if the dataframe is > empty: df.rdd.getNumPartitions == 0 > When I enable aqe for these jobs, this .rdd is converted into a separate job > of it's own and the entire dag is executed 2x, taking 2x more time. This does > not happen when AQE is disabled. Why does this happen and what is the best > way to fix the issue? > > Sample code to reproduce the issue: > > > {code:java} > import org.apache.spark.sql._ > case class Record( > id: Int, > name: String > ) > > val partCount = 4 > val input1 = (0 until 100).map(part => Record(part, "a")) > > val input2 = (100 until 110).map(part => Record(part, "c")) > > implicit val enc: Encoder[Record] = Encoders.product[Record] > > val ds1 = spark.createDataset( > spark.sparkContext > .parallelize(input1, partCount) > ) > > va > l ds2 = spark.createDataset( > spark.sparkContext > .parallelize(input2, partCount) > ) > > val ds3 = ds1.join(ds2, Seq("id")) > val l = ds3.count() > > val incomingPartitions = ds3.rdd.getNumPartitions > log.info(s"Num partitions ${incomingPartitions}") > {code} > > Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
[ https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Raju updated SPARK-44378: -- Attachment: Screenshot 2023-07-11 at 9.36.14 AM.png > Jobs that have join & have .rdd calls get executed 2x when AQE is enabled. > -- > > Key: SPARK-44378 > URL: https://issues.apache.org/jira/browse/SPARK-44378 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.1.2 >Reporter: Priyanka Raju >Priority: Major > Labels: aqe > Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png > > > We have a few spark scala jobs that are currently running in production. Most > jobs typically use Dataset, Dataframes. There is a small code in our custom > library code, that makes rdd calls example to check if the dataframe is > empty: df.rdd.getNumPartitions == 0 > When I enable aqe for these jobs, this .rdd is converted into a separate job > of it's own and the entire dag is executed 2x, taking 2x more time. This does > not happen when AQE is disabled. Why does this happen and what is the best > way to fix the issue? > > Sample code to reproduce the issue: > > > {code:java} > import org.apache.spark.sql._ > case class Record( > id: Int, > name: String > ) > > val partCount = 4 > val input1 = (0 until 100).map(part => Record(part, "a")) > > val input2 = (100 until 110).map(part => Record(part, "c")) > > implicit val enc: Encoder[Record] = Encoders.product[Record] > > val ds1 = spark.createDataset( > spark.sparkContext > .parallelize(input1, partCount) > ) > > val ds2 = spark.createDataset( > spark.sparkContext > .parallelize(input2, partCount) > ) > > val ds3 = ds1.join(ds2, Seq("id")) > val l = ds3.count() > > val incomingPartitions = ds3.rdd.getNumPartitions > log.info(s"Num partitions ${incomingPartitions}") > {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
[ https://issues.apache.org/jira/browse/SPARK-44378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Priyanka Raju updated SPARK-44378: -- Description: We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs, this .rdd is converted into a separate job of it's own and the entire dag is executed 2x, taking 2x more time. This does not happen when AQE is disabled. Why does this happen and what is the best way to fix the issue? Sample code to reproduce the issue: {code:java} import org.apache.spark.sql._ case class Record( id: Int, name: String ) val partCount = 4 val input1 = (0 until 100).map(part => Record(part, "a")) val input2 = (100 until 110).map(part => Record(part, "c")) implicit val enc: Encoder[Record] = Encoders.product[Record] val ds1 = spark.createDataset( spark.sparkContext .parallelize(input1, partCount) ) va l ds2 = spark.createDataset( spark.sparkContext .parallelize(input2, partCount) ) val ds3 = ds1.join(ds2, Seq("id")) val l = ds3.count() val incomingPartitions = ds3.rdd.getNumPartitions log.info(s"Num partitions ${incomingPartitions}") {code} Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! was: We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs, this .rdd is converted into a separate job of it's own and the entire dag is executed 2x, taking 2x more time. This does not happen when AQE is disabled. Why does this happen and what is the best way to fix the issue? Sample code to reproduce the issue: {code:java} import org.apache.spark.sql._ case class Record( id: Int, name: String ) val partCount = 4 val input1 = (0 until 100).map(part => Record(part, "a")) val input2 = (100 until 110).map(part => Record(part, "c")) implicit val enc: Encoder[Record] = Encoders.product[Record] val ds1 = spark.createDataset( spark.sparkContext .parallelize(input1, partCount) ) val ds2 = spark.createDataset( spark.sparkContext .parallelize(input2, partCount) ) val ds3 = ds1.join(ds2, Seq("id")) val l = ds3.count() val incomingPartitions = ds3.rdd.getNumPartitions log.info(s"Num partitions ${incomingPartitions}") {code} > Jobs that have join & have .rdd calls get executed 2x when AQE is enabled. > -- > > Key: SPARK-44378 > URL: https://issues.apache.org/jira/browse/SPARK-44378 > Project: Spark > Issue Type: Question > Components: Spark Submit >Affects Versions: 3.1.2 >Reporter: Priyanka Raju >Priority: Major > Labels: aqe > Attachments: Screenshot 2023-07-11 at 9.36.14 AM.png > > > We have a few spark scala jobs that are currently running in production. Most > jobs typically use Dataset, Dataframes. There is a small code in our custom > library code, that makes rdd calls example to check if the dataframe is > empty: df.rdd.getNumPartitions == 0 > When I enable aqe for these jobs, this .rdd is converted into a separate job > of it's own and the entire dag is executed 2x, taking 2x more time. This does > not happen when AQE is disabled. Why does this happen and what is the best > way to fix the issue? > > Sample code to reproduce the issue: > > > {code:java} > import org.apache.spark.sql._ > case class Record( > id: Int, > name: String > ) > > val partCount = 4 > val input1 = (0 until 100).map(part => Record(part, "a")) > > val input2 = (100 until 110).map(part => Record(part, "c")) > > implicit val enc: Encoder[Record] = Encoders.product[Record] > > val ds1 = spark.createDataset( > spark.sparkContext > .parallelize(input1, partCount) > ) > > va > l ds2 = spark.createDataset( > spark.sparkContext > .parallelize(input2, partCount) > ) > > val ds3 = ds1.join(ds2, Seq("id")) > val l = ds3.count() > > val incomingPartitions = ds3.rdd.getNumPartitions > log.info(s"Num partitions ${incomingPartitions}") > {code} > > Spark UI for the same job with AQE, !Screenshot 2023-07-11 at 9.36.14 AM.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010) -
[jira] [Created] (SPARK-44378) Jobs that have join & have .rdd calls get executed 2x when AQE is enabled.
Priyanka Raju created SPARK-44378: - Summary: Jobs that have join & have .rdd calls get executed 2x when AQE is enabled. Key: SPARK-44378 URL: https://issues.apache.org/jira/browse/SPARK-44378 Project: Spark Issue Type: Question Components: Spark Submit Affects Versions: 3.1.2 Reporter: Priyanka Raju We have a few spark scala jobs that are currently running in production. Most jobs typically use Dataset, Dataframes. There is a small code in our custom library code, that makes rdd calls example to check if the dataframe is empty: df.rdd.getNumPartitions == 0 When I enable aqe for these jobs, this .rdd is converted into a separate job of it's own and the entire dag is executed 2x, taking 2x more time. This does not happen when AQE is disabled. Why does this happen and what is the best way to fix the issue? Sample code to reproduce the issue: {code:java} import org.apache.spark.sql._ case class Record( id: Int, name: String ) val partCount = 4 val input1 = (0 until 100).map(part => Record(part, "a")) val input2 = (100 until 110).map(part => Record(part, "c")) implicit val enc: Encoder[Record] = Encoders.product[Record] val ds1 = spark.createDataset( spark.sparkContext .parallelize(input1, partCount) ) val ds2 = spark.createDataset( spark.sparkContext .parallelize(input2, partCount) ) val ds3 = ds1.join(ds2, Seq("id")) val l = ds3.count() val incomingPartitions = ds3.rdd.getNumPartitions log.info(s"Num partitions ${incomingPartitions}") {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44360) Support schema pruning in delta-based MERGE operations
[ https://issues.apache.org/jira/browse/SPARK-44360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44360. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41930 [https://github.com/apache/spark/pull/41930] > Support schema pruning in delta-based MERGE operations > -- > > Key: SPARK-44360 > URL: https://issues.apache.org/jira/browse/SPARK-44360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.5.0 > > > We need to support schema pruning in delta-based MERGE operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44360) Support schema pruning in delta-based MERGE operations
[ https://issues.apache.org/jira/browse/SPARK-44360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44360: - Assignee: Anton Okolnychyi > Support schema pruning in delta-based MERGE operations > -- > > Key: SPARK-44360 > URL: https://issues.apache.org/jira/browse/SPARK-44360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > We need to support schema pruning in delta-based MERGE operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple
[ https://issues.apache.org/jira/browse/SPARK-44377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44377: - Description: SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], Spark core module uses `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, which cascades and introduces the dependencies of Junit5, this causes Java tests no longer be executed when performing maven tests on the core module. run `mvn clean install -pl core -am` {code:java} [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 --- [INFO] Using auto detected provider org.apache.maven.surefire.junitplatform.JUnitPlatformProvider [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] [INFO] Results: [INFO] [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- [INFO] Skipping execution of surefire because it has already been run for this configuration{code} was: SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use Junit5 instead of Junit4, Spark core module uses `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, which cascades and introduces the dependencies of Junit5, this causes Java tests no longer be executed when performing maven tests on the core module. run `mvn clean install -pl core -am` {code:java} [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 --- [INFO] Using auto detected provider org.apache.maven.surefire.junitplatform.JUnitPlatformProvider [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] [INFO] Results: [INFO] [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- [INFO] Skipping execution of surefire because it has already been run for this configuration{code} > exclude junit5 deps from jersey-test-framework-provider-simple > -- > > Key: SPARK-44377 > URL: https://issues.apache.org/jira/browse/SPARK-44377 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use > [Junit5 instead of Junit4|https://github.com/eclipse-ee4j/jersey/pull/5123], > Spark core module uses > `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, > which cascades and introduces the dependencies of Junit5, this causes Java > tests no longer be executed when performing maven tests on the core module. > run `mvn clean install -pl core -am` > > {code:java} > [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 > --- > [INFO] Using auto detected provider > org.apache.maven.surefire.junitplatform.JUnitPlatformProvider > [INFO] > [INFO] --- > [INFO] T E S T S > [INFO] --- > [INFO] > [INFO] Results: > [INFO] > [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 > [INFO] > [INFO] > [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- > [INFO] Skipping execution of surefire because it has already been run for > this configuration{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44377) exclude junit5 deps from jersey-test-framework-provider-simple
Yang Jie created SPARK-44377: Summary: exclude junit5 deps from jersey-test-framework-provider-simple Key: SPARK-44377 URL: https://issues.apache.org/jira/browse/SPARK-44377 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Yang Jie SPARK-44316 upgrade Jersey from 2.36 to 2.40. Jersey 2.38 start to use Junit5 instead of Junit4, Spark core module uses `org.glassfish.jersey.test-framework.providers:jersey-test-framework-provider-simple:2.40`, which cascades and introduces the dependencies of Junit5, this causes Java tests no longer be executed when performing maven tests on the core module. run `mvn clean install -pl core -am` {code:java} [INFO] --- maven-surefire-plugin:3.1.2:test (default-test) @ spark-core_2.12 --- [INFO] Using auto detected provider org.apache.maven.surefire.junitplatform.JUnitPlatformProvider [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] [INFO] Results: [INFO] [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] --- maven-surefire-plugin:3.1.2:test (test) @ spark-core_2.12 --- [INFO] Skipping execution of surefire because it has already been run for this configuration{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44376) Build using maven is broken using 2.13 and Java 11 and Java 17
Emil Ejbyfeldt created SPARK-44376: -- Summary: Build using maven is broken using 2.13 and Java 11 and Java 17 Key: SPARK-44376 URL: https://issues.apache.org/jira/browse/SPARK-44376 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.5.0 Reporter: Emil Ejbyfeldt Fails with ``` $ ./build/mvn compile -Pscala-2.13 -Djava.version=11 -X ... [WARNING] [Warn] : [deprecation @ | origin= | version=] -target is deprecated: Use -release instead to compile against the correct platform API. [ERROR] [Error] : target platform version 8 is older than the release version 11 [WARNING] one warning found [ERROR] one error found ... ``` if setting the `java.version` property or ``` $ ./build/mvn compile -Pscala-2.13 ... [WARNING] [Warn] : [deprecation @ | origin= | version=] -target is deprecated: Use -release instead to compile against the correct platform API. [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/serializer/SerializationDebugger.scala:71: not found: value sun [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: not found: object sun [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: not found: object sun [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:206: not found: type DirectBuffer [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:210: not found: type Unsafe [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:212: not found: type Unsafe [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:213: not found: type DirectBuffer [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:216: not found: type DirectBuffer [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:236: not found: type DirectBuffer [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:26: Unused import [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/storage/StorageUtils.scala:27: Unused import [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala:452: not found: value sun [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: not found: object sun [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: not found: type SignalHandler [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:99: not found: type Signal [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:83: not found: type Signal [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: not found: type SignalHandler [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:108: not found: value Signal [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:114: not found: type Signal [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:116: not found: value Signal [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:128: not found: value Signal [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: Unused import [ERROR] [Error] /home/eejbyfeldt/dev/apache/spark/core/src/main/scala/org/apache/spark/util/SignalUtils.scala:26: Unused import [WARNING] one warning found [ERROR] 23 errors found ... ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742006#comment-17742006 ] Pratik Malani edited comment on SPARK-33782 at 7/11/23 1:33 PM: Hi [~pralabhkumar] The latest update in the SparkSubmit.scala is causing the FileNotFoundException. The below mentioned jar is present at the said location /opt/spark/work-dir/, but the Files.copy statement in the SparkSubmit.scala is causing the issue. Can you please help to check what could be possible cause? {code:java} Files local:///opt/spark/work-dir/sample.jar from /opt/spark/work-dir/sample.jar to /opt/spark/work-dir/./sample.jar Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/sample.jar at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526) at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253) at java.nio.file.Files.copy(Files.java:1274) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:424) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:449) at scala.Option.map(Option.scala:230) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:449) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) {code} was (Author: JIRAUSER296450): Hi [~pralabhkumar] The latest update in the SparkSubmit.scala is causing the FileNotFoundException. The below mentioned jar is present at the said location, but the Files.copy statement in the SparkSubmit.scala is causing the issue. Can you please help to check what could be possible cause? {code:java} Files local:///opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar from /opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar to /opt/spark/work-dir/./database-scripts-1.1-SNAPSHOT.jar Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526) at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253) at java.nio.file.Files.copy(Files.java:1274) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:424) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:449) at scala.Option.map(Option.scala:230) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:449) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) {code} > Place spark.files, spark.jars and spark.files under the current working > directory on the driver in K8S cluster mode >
[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode
[ https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742006#comment-17742006 ] Pratik Malani commented on SPARK-33782: --- Hi [~pralabhkumar] The latest update in the SparkSubmit.scala is causing the FileNotFoundException. The below mentioned jar is present at the said location, but the Files.copy statement in the SparkSubmit.scala is causing the issue. Can you please help to check what could be possible cause? {code:java} Files local:///opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar from /opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar to /opt/spark/work-dir/./database-scripts-1.1-SNAPSHOT.jar Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/database-scripts-1.1-SNAPSHOT.jar at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526) at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253) at java.nio.file.Files.copy(Files.java:1274) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:437) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:424) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$17(SparkSubmit.scala:449) at scala.Option.map(Option.scala:230) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:449) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) {code} > Place spark.files, spark.jars and spark.files under the current working > directory on the driver in K8S cluster mode > --- > > Key: SPARK-33782 > URL: https://issues.apache.org/jira/browse/SPARK-33782 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Pralabh Kumar >Priority: Major > Fix For: 3.4.0 > > > In Yarn cluster modes, the passed files are able to be accessed in the > current working directory. Looks like this is not the case in Kubernates > cluset mode. > By doing this, users can, for example, leverage PEX to manage Python > dependences in Apache Spark: > {code} > pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex > PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex > {code} > See also https://github.com/apache/spark/pull/30735/files#r540935585. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44375) Use PartitionEvaluator API in DebugExec
Jia Fan created SPARK-44375: --- Summary: Use PartitionEvaluator API in DebugExec Key: SPARK-44375 URL: https://issues.apache.org/jira/browse/SPARK-44375 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Jia Fan Use PartitionEvaluator API in DebugExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44375) Use PartitionEvaluator API in DebugExec
[ https://issues.apache.org/jira/browse/SPARK-44375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742004#comment-17742004 ] Jia Fan commented on SPARK-44375: - I'm working on it. > Use PartitionEvaluator API in DebugExec > --- > > Key: SPARK-44375 > URL: https://issues.apache.org/jira/browse/SPARK-44375 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Jia Fan >Priority: Major > > Use PartitionEvaluator API in DebugExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44374) Add example code
Weichen Xu created SPARK-44374: -- Summary: Add example code Key: SPARK-44374 URL: https://issues.apache.org/jira/browse/SPARK-44374 Project: Spark Issue Type: Sub-task Components: Connect, ML, PySpark Affects Versions: 3.5.0 Reporter: Weichen Xu Add example code for distributed ML <> spark connect . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44374) Add example code
[ https://issues.apache.org/jira/browse/SPARK-44374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-44374: -- Assignee: Weichen Xu > Add example code > > > Key: SPARK-44374 > URL: https://issues.apache.org/jira/browse/SPARK-44374 > Project: Spark > Issue Type: Sub-task > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Add example code for distributed ML <> spark connect . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42471) Distributed ML <> spark connect
[ https://issues.apache.org/jira/browse/SPARK-42471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-42471: -- Assignee: Weichen Xu > Distributed ML <> spark connect > --- > > Key: SPARK-42471 > URL: https://issues.apache.org/jira/browse/SPARK-42471 > Project: Spark > Issue Type: Umbrella > Components: Connect, ML >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44341) Define the computing logic through PartitionEvaluator API and use it in WindowExec and WindowInPandasExec
[ https://issues.apache.org/jira/browse/SPARK-44341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-44341: --- Summary: Define the computing logic through PartitionEvaluator API and use it in WindowExec and WindowInPandasExec (was: Define the computing logic through PartitionEvaluator API and use it in WindowExec) > Define the computing logic through PartitionEvaluator API and use it in > WindowExec and WindowInPandasExec > - > > Key: SPARK-44341 > URL: https://issues.apache.org/jira/browse/SPARK-44341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > > Define the computing logic through PartitionEvaluator API and use it in > WindowExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44373) Wrap withActive for Dataset API w/ parse logic
Kent Yao created SPARK-44373: Summary: Wrap withActive for Dataset API w/ parse logic Key: SPARK-44373 URL: https://issues.apache.org/jira/browse/SPARK-44373 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38476) Use error classes in org.apache.spark.storage
[ https://issues.apache.org/jira/browse/SPARK-38476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38476: Assignee: Bo Zhang > Use error classes in org.apache.spark.storage > - > > Key: SPARK-38476 > URL: https://issues.apache.org/jira/browse/SPARK-38476 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38476) Use error classes in org.apache.spark.storage
[ https://issues.apache.org/jira/browse/SPARK-38476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38476. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41923 [https://github.com/apache/spark/pull/41923] > Use error classes in org.apache.spark.storage > - > > Key: SPARK-38476 > URL: https://issues.apache.org/jira/browse/SPARK-38476 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44354) Cannot create dataframe with CharType/VarcharType column
[ https://issues.apache.org/jira/browse/SPARK-44354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kai-Michael Roesner updated SPARK-44354: Description: When trying to create a dataframe with a CharType or VarcharType column like so: {code} from datetime import date from decimal import Decimal from pyspark.sql import SparkSession from pyspark.sql.types import * data = [ (1, 'abc', Decimal(3.142), date(2023, 1, 1)), (2, 'bcd', Decimal(1.414), date(2023, 1, 2)), (3, 'cde', Decimal(2.718), date(2023, 1, 3))] schema = StructType([ StructField('INT', IntegerType()), StructField('STR', CharType(3)), StructField('DEC', DecimalType(4, 3)), StructField('DAT', DateType())]) spark = SparkSession.builder.appName('data-types').getOrCreate() df = spark.createDataFrame(data, schema) df.show() {code} a {{java.lang.IllegalStateException}} is thrown [here|https://github.com/apache/spark/blob/85e252e8503534009f4fb5ea005d44c9eda31447/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L168]. Excerpt from the logs: {code} py4j.protocol.Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD. : java.lang.IllegalStateException: [BUG] logical plan should not have output of char/varchar type: LogicalRDD [INT#0, STR#1, DEC#2, DAT#3], false at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:168) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:163) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:160) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:156) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:146) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:571) at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:804) at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:789) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnectio
[jira] [Commented] (SPARK-44354) Cannot create dataframe with CharType/VarcharType column
[ https://issues.apache.org/jira/browse/SPARK-44354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741925#comment-17741925 ] Kai-Michael Roesner commented on SPARK-44354: - PS: I tried to work around the exception by using `StringType()` in the schema and then doing {code} df.withColumn('STR', col('STR').cast(CharType(3))) {code} That got me a {code} WARN CharVarcharUtils: The Spark cast operator does not support char/varchar type and simply treats them as string type. {code} So now I'm wondering whether `CharType()` is supported as column data type at all... > Cannot create dataframe with CharType/VarcharType column > > > Key: SPARK-44354 > URL: https://issues.apache.org/jira/browse/SPARK-44354 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Kai-Michael Roesner >Priority: Major > > When trying to create a dataframe with a CharType or VarcharType column like > so: > {code} > from datetime import date > from decimal import Decimal > from pyspark.sql import SparkSession > from pyspark.sql.types import * > data = [ > (1, 'abc', Decimal(3.142), date(2023, 1, 1)), > (2, 'bcd', Decimal(1.414), date(2023, 1, 2)), > (3, 'cde', Decimal(2.718), date(2023, 1, 3))] > schema = StructType([ > StructField('INT', IntegerType()), > StructField('STR', CharType(3)), > StructField('DEC', DecimalType(4, 3)), > StructField('DAT', DateType())]) > spark = SparkSession.builder.appName('data-types').getOrCreate() > df = spark.createDataFrame(data, schema) > df.show() > {code} > a {{java.lang.IllegalStateException}} is thrown > [here|https://github.com/apache/spark/blob/85e252e8503534009f4fb5ea005d44c9eda31447/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L168]. > I'm expecting this to work... > PS: Excerpt from the logs: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > o24.applySchemaToPythonRDD. > : java.lang.IllegalStateException: [BUG] logical plan should not have output > of char/varchar type: LogicalRDD [INT#0, STR#1, DEC#2, DAT#3], false > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:168) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:163) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:160) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:156) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:146) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:188) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:211) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:208) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) > at > org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202) > at > org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202) > at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) > at > org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) > at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) > at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) > at > org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scal
[jira] [Commented] (SPARK-44362) Use PartitionEvaluator API in AggregateInPandasExec, WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec
[ https://issues.apache.org/jira/browse/SPARK-44362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741921#comment-17741921 ] jiaan.geng commented on SPARK-44362: [~vinodkc] Because WindowInPandasExec related to WindowExec, Could I finish them together ? > Use PartitionEvaluator API in AggregateInPandasExec, > WindowInPandasExec,EvalPythonExec,AttachDistributedSequenceExec > - > > Key: SPARK-44362 > URL: https://issues.apache.org/jira/browse/SPARK-44362 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Vinod KC >Priority: Major > > Use PartitionEvaluator API in > AggregateInPandasExec > WindowInPandasExec > EvalPythonExec > AttachDistributedSequenceExec -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741916#comment-17741916 ] ASF GitHub Bot commented on SPARK-43665: User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/41931 > Enable PandasSQLStringFormatter.vformat to work with Spark Connect > -- > > Key: SPARK-43665 > URL: https://issues.apache.org/jira/browse/SPARK-43665 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Enable PandasSQLStringFormatter.vformat to work with Spark Connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43665) Enable PandasSQLStringFormatter.vformat to work with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741915#comment-17741915 ] ASF GitHub Bot commented on SPARK-43665: User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/41931 > Enable PandasSQLStringFormatter.vformat to work with Spark Connect > -- > > Key: SPARK-43665 > URL: https://issues.apache.org/jira/browse/SPARK-43665 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Enable PandasSQLStringFormatter.vformat to work with Spark Connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44263) Allow ChannelBuilder extensions -- Scala
[ https://issues.apache.org/jira/browse/SPARK-44263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44263: Assignee: Alice Sayutina > Allow ChannelBuilder extensions -- Scala > > > Key: SPARK-44263 > URL: https://issues.apache.org/jira/browse/SPARK-44263 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.1 >Reporter: Alice Sayutina >Assignee: Alice Sayutina >Priority: Major > > Follow up to https://issues.apache.org/jira/browse/SPARK-43332 > Provide similar extension capabilities in Scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44263) Allow ChannelBuilder extensions -- Scala
[ https://issues.apache.org/jira/browse/SPARK-44263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44263. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41880 [https://github.com/apache/spark/pull/41880] > Allow ChannelBuilder extensions -- Scala > > > Key: SPARK-44263 > URL: https://issues.apache.org/jira/browse/SPARK-44263 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.1 >Reporter: Alice Sayutina >Assignee: Alice Sayutina >Priority: Major > Fix For: 3.5.0 > > > Follow up to https://issues.apache.org/jira/browse/SPARK-43332 > Provide similar extension capabilities in Scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44320) Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277]
[ https://issues.apache.org/jira/browse/SPARK-44320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-44320. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41909 [https://github.com/apache/spark/pull/41909] > Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277] > - > > Key: SPARK-44320 > URL: https://issues.apache.org/jira/browse/SPARK-44320 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44320) Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277]
[ https://issues.apache.org/jira/browse/SPARK-44320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-44320: Assignee: BingKun Pan > Assign names to the error class _LEGACY_ERROR_TEMP_[1067,1150,1220,1265,1277] > - > > Key: SPARK-44320 > URL: https://issues.apache.org/jira/browse/SPARK-44320 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44372) Enable KernelDensity within Spark Connect
Haejoon Lee created SPARK-44372: --- Summary: Enable KernelDensity within Spark Connect Key: SPARK-44372 URL: https://issues.apache.org/jira/browse/SPARK-44372 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee import pyspark.pandas as ps psdf = ps.DataFrame(\{"a": [1, 2, 3, 4, 5], "b": [1, 3, 5, 7, 9], "c": [2, 4, 6, 8, 10]}) psdf.plot.kde(bw_method=5, ind=3) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43629) Enable RDD dependent tests with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43629: Summary: Enable RDD dependent tests with Spark Connect (was: Enable RDD with Spark Connect) > Enable RDD dependent tests with Spark Connect > - > > Key: SPARK-43629 > URL: https://issues.apache.org/jira/browse/SPARK-43629 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Enable RDD with Spark Connect -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
[ https://issues.apache.org/jira/browse/SPARK-44371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741882#comment-17741882 ] jiaan.geng commented on SPARK-44371: I'm working on. > Define the computing logic through PartitionEvaluator API and use it in > CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec > - > > Key: SPARK-44371 > URL: https://issues.apache.org/jira/browse/SPARK-44371 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44371) Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec
jiaan.geng created SPARK-44371: -- Summary: Define the computing logic through PartitionEvaluator API and use it in CollectLimitExec, CollectTailExec, LocalLimitExec and GlobalLimitExec Key: SPARK-44371 URL: https://issues.apache.org/jira/browse/SPARK-44371 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org