[jira] [Assigned] (SPARK-42491) Upgrade jetty to 9.4.51.v20230217
[ https://issues.apache.org/jira/browse/SPARK-42491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42491: Assignee: (was: Apache Spark) > Upgrade jetty to 9.4.51.v20230217 > -- > > Key: SPARK-42491 > URL: https://issues.apache.org/jira/browse/SPARK-42491 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.51.v20230217 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42491) Upgrade jetty to 9.4.51.v20230217
[ https://issues.apache.org/jira/browse/SPARK-42491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42491: Assignee: Apache Spark > Upgrade jetty to 9.4.51.v20230217 > -- > > Key: SPARK-42491 > URL: https://issues.apache.org/jira/browse/SPARK-42491 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.51.v20230217 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42491) Upgrade jetty to 9.4.51.v20230217
[ https://issues.apache.org/jira/browse/SPARK-42491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694393#comment-17694393 ] Apache Spark commented on SPARK-42491: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40214 > Upgrade jetty to 9.4.51.v20230217 > -- > > Key: SPARK-42491 > URL: https://issues.apache.org/jira/browse/SPARK-42491 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > https://github.com/eclipse/jetty.project/releases/tag/jetty-9.4.51.v20230217 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42617) Support `isocalendar`
Haejoon Lee created SPARK-42617: --- Summary: Support `isocalendar` Key: SPARK-42617 URL: https://issues.apache.org/jira/browse/SPARK-42617 Project: Spark Issue Type: New Feature Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee We should support `isocalendar` to match pandas behavior (https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.Series.dt.isocalendar.html) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42548) Add ReferenceAllColumns to skip rewriting attributes
[ https://issues.apache.org/jira/browse/SPARK-42548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42548: --- Assignee: XiDuo You > Add ReferenceAllColumns to skip rewriting attributes > > > Key: SPARK-42548 > URL: https://issues.apache.org/jira/browse/SPARK-42548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42548) Add ReferenceAllColumns to skip rewriting attributes
[ https://issues.apache.org/jira/browse/SPARK-42548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42548. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40154 [https://github.com/apache/spark/pull/40154] > Add ReferenceAllColumns to skip rewriting attributes > > > Key: SPARK-42548 > URL: https://issues.apache.org/jira/browse/SPARK-42548 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42599) Make `CompatibilitySuite` as a tool like `dev/mima`
[ https://issues.apache.org/jira/browse/SPARK-42599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694378#comment-17694378 ] Apache Spark commented on SPARK-42599: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40213 > Make `CompatibilitySuite` as a tool like `dev/mima` > --- > > Key: SPARK-42599 > URL: https://issues.apache.org/jira/browse/SPARK-42599 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Major > > Using maven to test `CompatibilitySuite` requires some pre-work(need maven > build sql & > connect-client-jvm module before test), so when we run `mvn package test`, > there will be following errors: > > {code:java} > CompatibilitySuite: > - compatibility MiMa tests *** FAILED *** > java.lang.AssertionError: assertion failed: Failed to find the jar inside > folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > ... > - compatibility API tests: Dataset *** FAILED *** > java.lang.AssertionError: assertion failed: Failed to find the jar inside > folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53) > at > org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:110) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41171) Push down filter through window when partitionSpec is empty
[ https://issues.apache.org/jira/browse/SPARK-41171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41171. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40142 [https://github.com/apache/spark/pull/40142] > Push down filter through window when partitionSpec is empty > --- > > Key: SPARK-41171 > URL: https://issues.apache.org/jira/browse/SPARK-41171 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > > Sometimes, filter compares the rank-like window functions with number. > {code:java} > SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM Tab1 WHERE rn <= 5 > {code} > We can create a Limit(5) and push down it as the child of Window. > {code:java} > SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM (SELECT * FROM Tab1 ORDER > BY a LIMIT 5) t > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41171) Push down filter through window when partitionSpec is empty
[ https://issues.apache.org/jira/browse/SPARK-41171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-41171: --- Assignee: jiaan.geng > Push down filter through window when partitionSpec is empty > --- > > Key: SPARK-41171 > URL: https://issues.apache.org/jira/browse/SPARK-41171 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > Sometimes, filter compares the rank-like window functions with number. > {code:java} > SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM Tab1 WHERE rn <= 5 > {code} > We can create a Limit(5) and push down it as the child of Window. > {code:java} > SELECT *, ROW_NUMBER() OVER(ORDER BY a) AS rn FROM (SELECT * FROM Tab1 ORDER > BY a LIMIT 5) t > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42616) SparkSQLCLIDriver shall only close started hive sessionState
[ https://issues.apache.org/jira/browse/SPARK-42616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42616: Assignee: Apache Spark > SparkSQLCLIDriver shall only close started hive sessionState > > > Key: SPARK-42616 > URL: https://issues.apache.org/jira/browse/SPARK-42616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42613: Assignee: Apache Spark > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Assignee: Apache Spark >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores` as described in [PR > #38699|https://github.com/apache/spark/pull/38699]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694324#comment-17694324 ] Apache Spark commented on SPARK-42613: -- User 'jzhuge' has created a pull request for this issue: https://github.com/apache/spark/pull/40212 > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores` as described in [PR > #38699|https://github.com/apache/spark/pull/38699]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42616) SparkSQLCLIDriver shall only close started hive sessionState
[ https://issues.apache.org/jira/browse/SPARK-42616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42616: Assignee: (was: Apache Spark) > SparkSQLCLIDriver shall only close started hive sessionState > > > Key: SPARK-42616 > URL: https://issues.apache.org/jira/browse/SPARK-42616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42613: Assignee: (was: Apache Spark) > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores` as described in [PR > #38699|https://github.com/apache/spark/pull/38699]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42616) SparkSQLCLIDriver shall only close started hive sessionState
[ https://issues.apache.org/jira/browse/SPARK-42616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694323#comment-17694323 ] Apache Spark commented on SPARK-42616: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/40211 > SparkSQLCLIDriver shall only close started hive sessionState > > > Key: SPARK-42616 > URL: https://issues.apache.org/jira/browse/SPARK-42616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores` as described in [PR #38699|https://github.com/apache/spark/pull/38699]. was: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores` as described in [PR > #38699|https://github.com/apache/spark/pull/38699]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42583: -- Affects Version/s: 3.5.0 (was: 3.4.0) > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.5.0 > > > To support more cases: > https://github.com/pingcap/tidb/blob/master/planner/core/rule_join_elimination.go#L159 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42583. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40177 [https://github.com/apache/spark/pull/40177] > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.5.0 > > > To support more cases: > https://github.com/pingcap/tidb/blob/master/planner/core/rule_join_elimination.go#L159 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42583) Remove outer join if all aggregate functions are distinct
[ https://issues.apache.org/jira/browse/SPARK-42583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42583: - Assignee: Yuming Wang > Remove outer join if all aggregate functions are distinct > - > > Key: SPARK-42583 > URL: https://issues.apache.org/jira/browse/SPARK-42583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > To support more cases: > https://github.com/pingcap/tidb/blob/master/planner/core/rule_join_elimination.go#L159 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42610) Add implicit encoders to SQLImplicits
[ https://issues.apache.org/jira/browse/SPARK-42610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42610. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40205 [https://github.com/apache/spark/pull/40205] > Add implicit encoders to SQLImplicits > - > > Key: SPARK-42610 > URL: https://issues.apache.org/jira/browse/SPARK-42610 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42616) SparkSQLCLIDriver shall only close started hive sessionState
Kent Yao created SPARK-42616: Summary: SparkSQLCLIDriver shall only close started hive sessionState Key: SPARK-42616 URL: https://issues.apache.org/jira/browse/SPARK-42616 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Summary: PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default (was: PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default) > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have > issues when executer core is set to a very large number but task cpus is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42602) Provide more details in TaskEndReason s for tasks killed by TaskScheduler.cancelTasks
[ https://issues.apache.org/jira/browse/SPARK-42602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42602: - Assignee: Bo Zhang > Provide more details in TaskEndReason s for tasks killed by > TaskScheduler.cancelTasks > - > > Key: SPARK-42602 > URL: https://issues.apache.org/jira/browse/SPARK-42602 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > > Currently tasks killed by `TaskScheduler.cancelTasks` will have a > `TaskEndReason` "TaskKilled (Stage cancelled)". We should do better at > differentiating reasons for stage cancellations (e.g. user-initiated or > caused by task failures in the stage). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. was: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still > have issues when executer core is set to a very large number but task cpus is > 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42602) Provide more details in TaskEndReason s for tasks killed by TaskScheduler.cancelTasks
[ https://issues.apache.org/jira/browse/SPARK-42602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42602. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40194 [https://github.com/apache/spark/pull/40194] > Provide more details in TaskEndReason s for tasks killed by > TaskScheduler.cancelTasks > - > > Key: SPARK-42602 > URL: https://issues.apache.org/jira/browse/SPARK-42602 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 3.5.0 > > > Currently tasks killed by `TaskScheduler.cancelTasks` will have a > `TaskEndReason` "TaskKilled (Stage cancelled)". We should do better at > differentiating reasons for stage cancellations (e.g. user-initiated or > caused by task failures in the stage). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. was: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by > default > > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have > issues when executer core is set to a very large number but task cpus is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Summary: PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default (was: PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default) > PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by > default > > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still > have issues when executer core is set to a very large number but task cpus is > 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42615) Refactor the AnalyzePlan RPC and add `session.version`
[ https://issues.apache.org/jira/browse/SPARK-42615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694313#comment-17694313 ] Apache Spark commented on SPARK-42615: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40210 > Refactor the AnalyzePlan RPC and add `session.version` > -- > > Key: SPARK-42615 > URL: https://issues.apache.org/jira/browse/SPARK-42615 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42615) Refactor the AnalyzePlan RPC and add `session.version`
[ https://issues.apache.org/jira/browse/SPARK-42615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42615: Assignee: (was: Apache Spark) > Refactor the AnalyzePlan RPC and add `session.version` > -- > > Key: SPARK-42615 > URL: https://issues.apache.org/jira/browse/SPARK-42615 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42615) Refactor the AnalyzePlan RPC and add `session.version`
[ https://issues.apache.org/jira/browse/SPARK-42615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42615: Assignee: Apache Spark > Refactor the AnalyzePlan RPC and add `session.version` > -- > > Key: SPARK-42615 > URL: https://issues.apache.org/jira/browse/SPARK-42615 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42615) Refactor the AnalyzePlan RPC and add `session.version`
Ruifeng Zheng created SPARK-42615: - Summary: Refactor the AnalyzePlan RPC and add `session.version` Key: SPARK-42615 URL: https://issues.apache.org/jira/browse/SPARK-42615 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42406) [PROTOBUF] Recursive field handling is incompatible with delta
[ https://issues.apache.org/jira/browse/SPARK-42406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-42406. Resolution: Fixed Issue resolved by pull request 40141 [https://github.com/apache/spark/pull/40141] > [PROTOBUF] Recursive field handling is incompatible with delta > -- > > Key: SPARK-42406 > URL: https://issues.apache.org/jira/browse/SPARK-42406 > Project: Spark > Issue Type: Bug > Components: Protobuf >Affects Versions: 3.4.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 3.4.0 > > > Protobuf deserializer (`from_protobuf()` function()) optionally supports > recursive fields by limiting the depth to certain level. See example below. > It assigns a 'NullType' for such a field when allowed depth is reached. > It causes a few issues. E.g. a repeated field as in the following example > results in a Array field with 'NullType'. Delta does not support null type in > a complex type. > Actually `Array[NullType]` is not really useful anyway. > How about this fix: Drop the recursive field when the limit reached rather > than using a NullType. > The example below makes it clear: > Consider a recursive Protobuf: > > {code:python} > message TreeNode { > string value = 1; > repeated TreeNode children = 2; > } > {code} > Allow depth of 2: > > {code:python} > df.select( > 'proto', > messageName = 'TreeNode', > options = { ... "recursive.fields.max.depth" : "2" } > ).printSchema() > {code} > Schema looks like this: > {noformat} > root > |– from_protobuf(proto): struct (nullable = true)| > | |– value: string (nullable = true)| > | |– children: array (nullable = false)| > | | |– element: struct (containsNull = false)| > | | | |– value: string (nullable = true)| > | | | |– children: array (nullable = false)| > | | | | |– element: struct (containsNull = false)| > | | | | | |– value: string (nullable = true)| > | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop > this field === ]| > | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE > === ] > {noformat} > When we try to write this to a delta table, we get an error: > {noformat} > AnalysisException: Found nested NullType in column > from_protobuf(proto).children which is of ArrayType. Delta doesn't support > writing NullType in complex types. > {noformat} > > We could just drop the field 'element' when recursion depth is reached. It is > simpler and does not need to deal with NullType. We are ignoring the value > anyway. There is no use in keeping the field. > Another issue is setting for 'recursive.fields.max.depth': It is not enforced > correctly. '0' does not make sense. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42427) Conv should return an error if the internal conversion overflows
[ https://issues.apache.org/jira/browse/SPARK-42427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694299#comment-17694299 ] Apache Spark commented on SPARK-42427: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/40209 > Conv should return an error if the internal conversion overflows > > > Key: SPARK-42427 > URL: https://issues.apache.org/jira/browse/SPARK-42427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42539: Assignee: Apache Spark > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Assignee: Apache Spark >Priority: Major > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply extending the changes from SPARK-26839 to all > Java versions, instead of restricting to Java 9+ only. > *(B) Reverse the
[jira] [Assigned] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42539: Assignee: (was: Apache Spark) > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Priority: Major > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply extending the changes from SPARK-26839 to all > Java versions, instead of restricting to Java 9+ only. > *(B) Reverse the ordering of parent/child
[jira] [Reopened] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-42539: -- Assignee: (was: Erik Krogen) Reverted in https://github.com/apache/spark/commit/5627ceeddb45f2796fb8ad08b9f1c8a163823b2b and https://github.com/apache/spark/commit/26009d47c1f80897d65445fe48d8d5f2edcf848c > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Priority: Major > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath >
[jira] [Updated] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42539: - Fix Version/s: (was: 3.4.0) (was: 3.5.0) > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply extending the changes from SPARK-26839 to all > Java versions, instead of restricting
[jira] [Resolved] (SPARK-42612) Enable more parity tests related to functions
[ https://issues.apache.org/jira/browse/SPARK-42612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42612. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40203 [https://github.com/apache/spark/pull/40203] > Enable more parity tests related to functions > - > > Key: SPARK-42612 > URL: https://issues.apache.org/jira/browse/SPARK-42612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42612) Enable more parity tests related to functions
[ https://issues.apache.org/jira/browse/SPARK-42612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42612: Assignee: Takuya Ueshin > Enable more parity tests related to functions > - > > Key: SPARK-42612 > URL: https://issues.apache.org/jira/browse/SPARK-42612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40776) Add documentation (similar to Avro functions).
[ https://issues.apache.org/jira/browse/SPARK-40776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40776: Assignee: Sandish Kumar HN (was: Raghu Angadi) > Add documentation (similar to Avro functions). > -- > > Key: SPARK-40776 > URL: https://issues.apache.org/jira/browse/SPARK-40776 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Sandish Kumar HN >Priority: Major > Fix For: 3.4.0 > > > Build documentation for protobufs similar to > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > This is a follow up from Protobuf PR > https://github.com/apache/spark/pull/37972 > cc: [~sanysand...@gmail.com] . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40776) Add documentation (similar to Avro functions).
[ https://issues.apache.org/jira/browse/SPARK-40776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40776. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39039 [https://github.com/apache/spark/pull/39039] > Add documentation (similar to Avro functions). > -- > > Key: SPARK-40776 > URL: https://issues.apache.org/jira/browse/SPARK-40776 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 3.4.0 > > > Build documentation for protobufs similar to > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > This is a follow up from Protobuf PR > https://github.com/apache/spark/pull/37972 > cc: [~sanysand...@gmail.com] . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40776) Add documentation (similar to Avro functions).
[ https://issues.apache.org/jira/browse/SPARK-40776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40776: Assignee: Raghu Angadi > Add documentation (similar to Avro functions). > -- > > Key: SPARK-40776 > URL: https://issues.apache.org/jira/browse/SPARK-40776 > Project: Spark > Issue Type: Improvement > Components: Protobuf, Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > > Build documentation for protobufs similar to > [https://spark.apache.org/docs/latest/sql-data-sources-avro.html] > This is a follow up from Protobuf PR > https://github.com/apache/spark/pull/37972 > cc: [~sanysand...@gmail.com] . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42515) ClientE2ETestSuite local test failed
[ https://issues.apache.org/jira/browse/SPARK-42515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42515: Assignee: Yang Jie > ClientE2ETestSuite local test failed > > > Key: SPARK-42515 > URL: https://issues.apache.org/jira/browse/SPARK-42515 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > > local run `build/sbt clean "connect-client-jvm/test"`, > `ClientE2ETestSuite#write table` failed, GA not failed. > > {code:java} > [info] - rite table *** FAILED *** (41 milliseconds) > [info] io.grpc.StatusRuntimeException: UNKNOWN: > org/apache/parquet/hadoop/api/ReadSupport > [info] at io.grpc.Status.asRuntimeException(Status.java:535) > [info] at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > [info] at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) > [info] at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:255) > [info] at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:338) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$12(ClientE2ETestSuite.scala:145) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > [info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > [info] at org.scalatest.Suite.run(Suite.scala:1114) > [info] at org.scalatest.Suite.run$(Suite.scala:1096) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) > [info] at
[jira] [Resolved] (SPARK-42515) ClientE2ETestSuite local test failed
[ https://issues.apache.org/jira/browse/SPARK-42515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42515. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40136 [https://github.com/apache/spark/pull/40136] > ClientE2ETestSuite local test failed > > > Key: SPARK-42515 > URL: https://issues.apache.org/jira/browse/SPARK-42515 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > > local run `build/sbt clean "connect-client-jvm/test"`, > `ClientE2ETestSuite#write table` failed, GA not failed. > > {code:java} > [info] - rite table *** FAILED *** (41 milliseconds) > [info] io.grpc.StatusRuntimeException: UNKNOWN: > org/apache/parquet/hadoop/api/ReadSupport > [info] at io.grpc.Status.asRuntimeException(Status.java:535) > [info] at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > [info] at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) > [info] at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:255) > [info] at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:338) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$12(ClientE2ETestSuite.scala:145) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > [info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > [info] at org.scalatest.Suite.run(Suite.scala:1114) > [info] at org.scalatest.Suite.run$(Suite.scala:1096) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) > [info] at >
[jira] [Resolved] (SPARK-42367) DataFrame.drop should handle duplicated columns properly
[ https://issues.apache.org/jira/browse/SPARK-42367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42367. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40013 [https://github.com/apache/spark/pull/40013] > DataFrame.drop should handle duplicated columns properly > > > Key: SPARK-42367 > URL: https://issues.apache.org/jira/browse/SPARK-42367 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code:java} > >>> df.join(df2, df.name == df2.name, 'inner').show() > +---++--++ > |age|name|height|name| > +---++--++ > | 16| Bob|85| Bob| > | 14| Tom|80| Tom| > +---++--++ > >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show() > +---+--+ > |age|height| > +---+--+ > | 16|85| > | 14|80| > +---+--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42367) DataFrame.drop should handle duplicated columns properly
[ https://issues.apache.org/jira/browse/SPARK-42367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42367: - Assignee: Ruifeng Zheng > DataFrame.drop should handle duplicated columns properly > > > Key: SPARK-42367 > URL: https://issues.apache.org/jira/browse/SPARK-42367 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > >>> df.join(df2, df.name == df2.name, 'inner').show() > +---++--++ > |age|name|height|name| > +---++--++ > | 16| Bob|85| Bob| > | 14| Tom|80| Tom| > +---++--++ > >>> df.join(df2, df.name == df2.name, 'inner').drop('name').show() > +---+--+ > |age|height| > +---+--+ > | 16|85| > | 14|80| > +---+--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42572) Logic error for StateStore.validateStateRowFormat
[ https://issues.apache.org/jira/browse/SPARK-42572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-42572. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40187 [https://github.com/apache/spark/pull/40187] > Logic error for StateStore.validateStateRowFormat > - > > Key: SPARK-42572 > URL: https://issues.apache.org/jira/browse/SPARK-42572 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > > SPARK-42484 Changed the logic of whether to check state store format in > StateStore.validateStateRowFormat. Revert it and add unit test to make sure > this won't happen again -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42572) Logic error for StateStore.validateStateRowFormat
[ https://issues.apache.org/jira/browse/SPARK-42572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-42572: Assignee: Wei Liu > Logic error for StateStore.validateStateRowFormat > - > > Key: SPARK-42572 > URL: https://issues.apache.org/jira/browse/SPARK-42572 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > > SPARK-42484 Changed the logic of whether to check state store format in > StateStore.validateStateRowFormat. Revert it and add unit test to make sure > this won't happen again -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
[ https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694281#comment-17694281 ] Apache Spark commented on SPARK-42592: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/40208 > Document SS guide doc for supporting multiple stateful operators (especially > chained aggregations) > -- > > Key: SPARK-42592 > URL: https://issues.apache.org/jira/browse/SPARK-42592 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from > SPARK-42105 we only removed the section of "limitation of global watermark". > That said, we haven't provided any example of new functionality, especially > that users need to know about the change of SQL function (window) in chained > time window aggregations. > In this ticket, we will add the example of chained time window aggregations, > with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42614) Make all constructors private[sql]
[ https://issues.apache.org/jira/browse/SPARK-42614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694279#comment-17694279 ] Apache Spark commented on SPARK-42614: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40207 > Make all constructors private[sql] > -- > > Key: SPARK-42614 > URL: https://issues.apache.org/jira/browse/SPARK-42614 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42614) Make all constructors private[sql]
[ https://issues.apache.org/jira/browse/SPARK-42614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42614: Assignee: Herman van Hövell (was: Apache Spark) > Make all constructors private[sql] > -- > > Key: SPARK-42614 > URL: https://issues.apache.org/jira/browse/SPARK-42614 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42614) Make all constructors private[sql]
[ https://issues.apache.org/jira/browse/SPARK-42614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42614: Assignee: Apache Spark (was: Herman van Hövell) > Make all constructors private[sql] > -- > > Key: SPARK-42614 > URL: https://issues.apache.org/jira/browse/SPARK-42614 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42614) Make all constructors private[sql]
[ https://issues.apache.org/jira/browse/SPARK-42614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694278#comment-17694278 ] Apache Spark commented on SPARK-42614: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40207 > Make all constructors private[sql] > -- > > Key: SPARK-42614 > URL: https://issues.apache.org/jira/browse/SPARK-42614 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42611) Insert char/varchar length checks for inner fields during resolution
[ https://issues.apache.org/jira/browse/SPARK-42611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-42611: - Affects Version/s: 3.3.2 3.3.1 3.3.0 3.3.3 > Insert char/varchar length checks for inner fields during resolution > > > Key: SPARK-42611 > URL: https://issues.apache.org/jira/browse/SPARK-42611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0 >Reporter: Anton Okolnychyi >Priority: Major > > In SPARK-36498, we added support for reordering inner fields in structs > during resolution. Unfortunately, we don't add any length validation for > nested char/varchar columns in that path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42614) Make all constructors private[sql]
Herman van Hövell created SPARK-42614: - Summary: Make all constructors private[sql] Key: SPARK-42614 URL: https://issues.apache.org/jira/browse/SPARK-42614 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell Assignee: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42611) Insert char/varchar length checks for inner fields during resolution
[ https://issues.apache.org/jira/browse/SPARK-42611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42611: Assignee: (was: Apache Spark) > Insert char/varchar length checks for inner fields during resolution > > > Key: SPARK-42611 > URL: https://issues.apache.org/jira/browse/SPARK-42611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Priority: Major > > In SPARK-36498, we added support for reordering inner fields in structs > during resolution. Unfortunately, we don't add any length validation for > nested char/varchar columns in that path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-42550) table directory will lost on hdfs when `INSERT OVERWRITE` faild
[ https://issues.apache.org/jira/browse/SPARK-42550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kevinshin closed SPARK-42550. - > table directory will lost on hdfs when `INSERT OVERWRITE` faild > --- > > Key: SPARK-42550 > URL: https://issues.apache.org/jira/browse/SPARK-42550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 > Environment: spark 3.2.3 / HDP 3.1.4 >Reporter: kevinshin >Priority: Critical > Attachments: image-2023-02-24-15-21-55-273.png, > image-2023-02-24-15-23-32-977.png, image-2023-02-24-15-25-57-770.png > > > {color:#4c9aff}when a `{*}INSERT{*} OVERWRITE *TABLE`* statment faild during > execution, the table's directory will be deleted. this is not happen in > spark 3.2.1.{color} > {color:#4c9aff}for example: {color} > *CREATE* *TABLE* *IF* *NOT* *EXISTS* test.spark32_overwrite(amt1 {*}int{*}) > STORED *AS* ORC; > *INSERT* OVERWRITE *TABLE* test.spark32_overwrite *select* 128; > *CREATE* *TABLE* *IF* *NOT* *EXISTS* test.spark32_overwrite2(amt1 long) > STORED *AS* ORC; > *INSERT* OVERWRITE *TABLE* test.spark32_overwrite *select* 644164; > *INSERT* OVERWRITE *TABLE* test.spark32_overwrite *select* amt1 *from* > test.spark32_overwrite2; – {color:#de350b}*this will got Casting overflow > exception*{color} > {color:#4c9aff}and then :{color} > *select* * *from* test.spark32_overwrite; > {color:#4c9aff}will got error:{color} > {color:#172b4d}java.io.FileNotFoundException{color} > {color:#172b4d}!image-2023-02-24-15-21-55-273.png!{color} > {color:#172b4d}the table's directory is losted. use `hdfs dfs -ls` cmd to > check:{color} > !image-2023-02-24-15-23-32-977.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42611) Insert char/varchar length checks for inner fields during resolution
[ https://issues.apache.org/jira/browse/SPARK-42611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694272#comment-17694272 ] Apache Spark commented on SPARK-42611: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/40206 > Insert char/varchar length checks for inner fields during resolution > > > Key: SPARK-42611 > URL: https://issues.apache.org/jira/browse/SPARK-42611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Priority: Major > > In SPARK-36498, we added support for reordering inner fields in structs > during resolution. Unfortunately, we don't add any length validation for > nested char/varchar columns in that path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42611) Insert char/varchar length checks for inner fields during resolution
[ https://issues.apache.org/jira/browse/SPARK-42611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42611: Assignee: Apache Spark > Insert char/varchar length checks for inner fields during resolution > > > Key: SPARK-42611 > URL: https://issues.apache.org/jira/browse/SPARK-42611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Assignee: Apache Spark >Priority: Major > > In SPARK-36498, we added support for reordering inner fields in structs > during resolution. Unfortunately, we don't add any length validation for > nested char/varchar columns in that path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42611) Insert char/varchar length checks for inner fields during resolution
[ https://issues.apache.org/jira/browse/SPARK-42611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694271#comment-17694271 ] Apache Spark commented on SPARK-42611: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/40206 > Insert char/varchar length checks for inner fields during resolution > > > Key: SPARK-42611 > URL: https://issues.apache.org/jira/browse/SPARK-42611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Priority: Major > > In SPARK-36498, we added support for reordering inner fields in structs > during resolution. Unfortunately, we don't add any length validation for > nested char/varchar columns in that path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42550) table directory will lost on hdfs when `INSERT OVERWRITE` faild
[ https://issues.apache.org/jira/browse/SPARK-42550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694270#comment-17694270 ] kevinshin commented on SPARK-42550: --- this is not spark's issue > table directory will lost on hdfs when `INSERT OVERWRITE` faild > --- > > Key: SPARK-42550 > URL: https://issues.apache.org/jira/browse/SPARK-42550 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 > Environment: spark 3.2.3 / HDP 3.1.4 >Reporter: kevinshin >Priority: Critical > Attachments: image-2023-02-24-15-21-55-273.png, > image-2023-02-24-15-23-32-977.png, image-2023-02-24-15-25-57-770.png > > > {color:#4c9aff}when a `{*}INSERT{*} OVERWRITE *TABLE`* statment faild during > execution, the table's directory will be deleted. this is not happen in > spark 3.2.1.{color} > {color:#4c9aff}for example: {color} > *CREATE* *TABLE* *IF* *NOT* *EXISTS* test.spark32_overwrite(amt1 {*}int{*}) > STORED *AS* ORC; > *INSERT* OVERWRITE *TABLE* test.spark32_overwrite *select* 128; > *CREATE* *TABLE* *IF* *NOT* *EXISTS* test.spark32_overwrite2(amt1 long) > STORED *AS* ORC; > *INSERT* OVERWRITE *TABLE* test.spark32_overwrite *select* 644164; > *INSERT* OVERWRITE *TABLE* test.spark32_overwrite *select* amt1 *from* > test.spark32_overwrite2; – {color:#de350b}*this will got Casting overflow > exception*{color} > {color:#4c9aff}and then :{color} > *select* * *from* test.spark32_overwrite; > {color:#4c9aff}will got error:{color} > {color:#172b4d}java.io.FileNotFoundException{color} > {color:#172b4d}!image-2023-02-24-15-21-55-273.png!{color} > {color:#172b4d}the table's directory is losted. use `hdfs dfs -ls` cmd to > check:{color} > !image-2023-02-24-15-23-32-977.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42596. -- Fix Version/s: 3.3.3 3.2.4 3.4.0 Resolution: Fixed Issue resolved by pull request 40199 [https://github.com/apache/spark/pull/40199] > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > Fix For: 3.3.3, 3.2.4, 3.4.0 > > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42596: Assignee: John Zhuge > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42610) Add implicit encoders to SQLImplicits
[ https://issues.apache.org/jira/browse/SPARK-42610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694265#comment-17694265 ] Apache Spark commented on SPARK-42610: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40205 > Add implicit encoders to SQLImplicits > - > > Key: SPARK-42610 > URL: https://issues.apache.org/jira/browse/SPARK-42610 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42601) New physical type Decimal128 for DecimalType
[ https://issues.apache.org/jira/browse/SPARK-42601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694259#comment-17694259 ] Apache Spark commented on SPARK-42601: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40204 > New physical type Decimal128 for DecimalType > > > Key: SPARK-42601 > URL: https://issues.apache.org/jira/browse/SPARK-42601 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42601) New physical type Decimal128 for DecimalType
[ https://issues.apache.org/jira/browse/SPARK-42601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42601: Assignee: (was: Apache Spark) > New physical type Decimal128 for DecimalType > > > Key: SPARK-42601 > URL: https://issues.apache.org/jira/browse/SPARK-42601 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42601) New physical type Decimal128 for DecimalType
[ https://issues.apache.org/jira/browse/SPARK-42601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42601: Assignee: Apache Spark > New physical type Decimal128 for DecimalType > > > Key: SPARK-42601 > URL: https://issues.apache.org/jira/browse/SPARK-42601 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42601) New physical type Decimal128 for DecimalType
[ https://issues.apache.org/jira/browse/SPARK-42601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694258#comment-17694258 ] Apache Spark commented on SPARK-42601: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40204 > New physical type Decimal128 for DecimalType > > > Key: SPARK-42601 > URL: https://issues.apache.org/jira/browse/SPARK-42601 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. was: Coming from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still > have issues when executer core is set to a very large number but task cpus is > 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
John Zhuge created SPARK-42613: -- Summary: PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default Key: SPARK-42613 URL: https://issues.apache.org/jira/browse/SPARK-42613 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 3.3.0 Reporter: John Zhuge Coming from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42600) currentDatabase Shall use NamespaceHelper instead of MultipartIdentifierHelper
[ https://issues.apache.org/jira/browse/SPARK-42600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42600: --- Assignee: Kent Yao > currentDatabase Shall use NamespaceHelper instead of > MultipartIdentifierHelper > --- > > Key: SPARK-42600 > URL: https://issues.apache.org/jira/browse/SPARK-42600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42600) currentDatabase Shall use NamespaceHelper instead of MultipartIdentifierHelper
[ https://issues.apache.org/jira/browse/SPARK-42600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42600. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40192 [https://github.com/apache/spark/pull/40192] > currentDatabase Shall use NamespaceHelper instead of > MultipartIdentifierHelper > --- > > Key: SPARK-42600 > URL: https://issues.apache.org/jira/browse/SPARK-42600 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: Kent Yao >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41956) Refetch shuffle blocks when executor is decommissioned
[ https://issues.apache.org/jira/browse/SPARK-41956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongwei Zhu updated SPARK-41956: - Summary: Refetch shuffle blocks when executor is decommissioned (was: Shuffle output location refetch in ShuffleBlockFetcherIterator) > Refetch shuffle blocks when executor is decommissioned > -- > > Key: SPARK-41956 > URL: https://issues.apache.org/jira/browse/SPARK-41956 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.1 >Reporter: Zhongwei Zhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42612) Enable more parity tests related to functions
[ https://issues.apache.org/jira/browse/SPARK-42612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42612: Assignee: (was: Apache Spark) > Enable more parity tests related to functions > - > > Key: SPARK-42612 > URL: https://issues.apache.org/jira/browse/SPARK-42612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42612) Enable more parity tests related to functions
[ https://issues.apache.org/jira/browse/SPARK-42612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42612: Assignee: Apache Spark > Enable more parity tests related to functions > - > > Key: SPARK-42612 > URL: https://issues.apache.org/jira/browse/SPARK-42612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42612) Enable more parity tests related to functions
[ https://issues.apache.org/jira/browse/SPARK-42612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694244#comment-17694244 ] Apache Spark commented on SPARK-42612: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40203 > Enable more parity tests related to functions > - > > Key: SPARK-42612 > URL: https://issues.apache.org/jira/browse/SPARK-42612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42612) Enable more parity tests related to functions
[ https://issues.apache.org/jira/browse/SPARK-42612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694245#comment-17694245 ] Apache Spark commented on SPARK-42612: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40203 > Enable more parity tests related to functions > - > > Key: SPARK-42612 > URL: https://issues.apache.org/jira/browse/SPARK-42612 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42612) Enable more parity tests related to functions
Takuya Ueshin created SPARK-42612: - Summary: Enable more parity tests related to functions Key: SPARK-42612 URL: https://issues.apache.org/jira/browse/SPARK-42612 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42608) Use full column names for inner fields in resolution errors
[ https://issues.apache.org/jira/browse/SPARK-42608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694242#comment-17694242 ] Apache Spark commented on SPARK-42608: -- User 'aokolnychyi' has created a pull request for this issue: https://github.com/apache/spark/pull/40202 > Use full column names for inner fields in resolution errors > --- > > Key: SPARK-42608 > URL: https://issues.apache.org/jira/browse/SPARK-42608 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Priority: Major > > If there are multiple inner columns with the same name, resolution errors may > be confusing as we only use field names, not full column names. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42608) Use full column names for inner fields in resolution errors
[ https://issues.apache.org/jira/browse/SPARK-42608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42608: Assignee: Apache Spark > Use full column names for inner fields in resolution errors > --- > > Key: SPARK-42608 > URL: https://issues.apache.org/jira/browse/SPARK-42608 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Assignee: Apache Spark >Priority: Major > > If there are multiple inner columns with the same name, resolution errors may > be confusing as we only use field names, not full column names. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42608) Use full column names for inner fields in resolution errors
[ https://issues.apache.org/jira/browse/SPARK-42608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42608: Assignee: (was: Apache Spark) > Use full column names for inner fields in resolution errors > --- > > Key: SPARK-42608 > URL: https://issues.apache.org/jira/browse/SPARK-42608 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Priority: Major > > If there are multiple inner columns with the same name, resolution errors may > be confusing as we only use field names, not full column names. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42121) Add built-in table-valued functions posexplode and posexplode_outer
[ https://issues.apache.org/jira/browse/SPARK-42121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42121. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40151 [https://github.com/apache/spark/pull/40151] > Add built-in table-valued functions posexplode and posexplode_outer > --- > > Key: SPARK-42121 > URL: https://issues.apache.org/jira/browse/SPARK-42121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.4.0 > > > Add `posexplode` and `posexplode_outer` to the built-in table function > registry. > Add new SQL tests in `table-valued-functions.sql` and `join-lateral.sql`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42121) Add built-in table-valued functions posexplode and posexplode_outer
[ https://issues.apache.org/jira/browse/SPARK-42121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42121: --- Assignee: Allison Wang > Add built-in table-valued functions posexplode and posexplode_outer > --- > > Key: SPARK-42121 > URL: https://issues.apache.org/jira/browse/SPARK-42121 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Add `posexplode` and `posexplode_outer` to the built-in table function > registry. > Add new SQL tests in `table-valued-functions.sql` and `join-lateral.sql`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42611) Insert char/varchar length checks for inner fields during resolution
Anton Okolnychyi created SPARK-42611: Summary: Insert char/varchar length checks for inner fields during resolution Key: SPARK-42611 URL: https://issues.apache.org/jira/browse/SPARK-42611 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Anton Okolnychyi In SPARK-36498, we added support for reordering inner fields in structs during resolution. Unfortunately, we don't add any length validation for nested char/varchar columns in that path. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
[ https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-42592: - Fix Version/s: 3.4.1 > Document SS guide doc for supporting multiple stateful operators (especially > chained aggregations) > -- > > Key: SPARK-42592 > URL: https://issues.apache.org/jira/browse/SPARK-42592 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from > SPARK-42105 we only removed the section of "limitation of global watermark". > That said, we haven't provided any example of new functionality, especially > that users need to know about the change of SQL function (window) in chained > time window aggregations. > In this ticket, we will add the example of chained time window aggregations, > with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
[ https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-42592: Assignee: Jungtaek Lim > Document SS guide doc for supporting multiple stateful operators (especially > chained aggregations) > -- > > Key: SPARK-42592 > URL: https://issues.apache.org/jira/browse/SPARK-42592 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from > SPARK-42105 we only removed the section of "limitation of global watermark". > That said, we haven't provided any example of new functionality, especially > that users need to know about the change of SQL function (window) in chained > time window aggregations. > In this ticket, we will add the example of chained time window aggregations, > with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42592) Document SS guide doc for supporting multiple stateful operators (especially chained aggregations)
[ https://issues.apache.org/jira/browse/SPARK-42592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-42592. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40188 [https://github.com/apache/spark/pull/40188] > Document SS guide doc for supporting multiple stateful operators (especially > chained aggregations) > -- > > Key: SPARK-42592 > URL: https://issues.apache.org/jira/browse/SPARK-42592 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.5.0 > > > We made a change on the guide doc for SPARK-40925 via SPARK-42105, but from > SPARK-42105 we only removed the section of "limitation of global watermark". > That said, we haven't provided any example of new functionality, especially > that users need to know about the change of SQL function (window) in chained > time window aggregations. > In this ticket, we will add the example of chained time window aggregations, > with introducing new functionality of SQL function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-42539: - Fix Version/s: 3.4.0 > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.4.0, 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply extending the changes from SPARK-26839 to all > Java versions, instead of restricting to Java 9+ only. >
[jira] [Resolved] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-42539. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40144 [https://github.com/apache/spark/pull/40144] > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.5.0 > > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply
[jira] [Assigned] (SPARK-42539) User-provided JARs can override Spark's Hive metadata client JARs when using "builtin"
[ https://issues.apache.org/jira/browse/SPARK-42539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-42539: Assignee: Erik Krogen > User-provided JARs can override Spark's Hive metadata client JARs when using > "builtin" > -- > > Key: SPARK-42539 > URL: https://issues.apache.org/jira/browse/SPARK-42539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.3, 3.3.2 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > Recently we observed that on version 3.2.0 and Java 8, it is possible for > user-provided Hive JARs to break the ability for Spark, via the Hive metadata > client / {{IsolatedClientLoader}}, to communicate with Hive Metastore, when > using the default behavior of the "builtin" Hive version. After SPARK-35321, > when Spark is compiled against Hive >= 2.3.9 and the "builtin" Hive client > version is used, we will call the method {{Hive.getWithoutRegisterFns()}} > (from HIVE-21563) instead of {{Hive.get()}}. If the user has included, for > example, {{hive-exec-2.3.8.jar}} on their classpath, the client will break > with a {{NoSuchMethodError}}. This particular failure mode was resolved in > 3.2.1 by SPARK-37446, but while investigating, we found a general issue that > it's possible for user JARs to override Spark's own JARs -- but only inside > of the IsolatedClientLoader when using "builtin". This happens because even > when Spark is configured to use the "builtin" Hive classes, it still creates > a separate URLClassLoader for the HiveClientImpl used for HMS communication. > To get the set of JAR URLs to use for this classloader, Spark [collects all > of the JARs used by the user classloader (and its parent, and that > classloader's parent, and so > on)|https://github.com/apache/spark/blob/87e3d5625e76bb734b8dd753bfb25002822c8585/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala#L412-L438]. > Thus the newly created classloader will have all of the same JARs as the > user classloader, but the ordering has been reversed! User JARs get > prioritized ahead of system JARs, because the classloader hierarchy is > traversed from bottom-to-top. For example let's say we have user JARs > "foo.jar" and "hive-exec-2.3.8.jar". The user classloader will look like this: > {code} > MutableURLClassLoader > -- foo.jar > -- hive-exec-2.3.8.jar > -- parent: URLClassLoader > - spark-core_2.12-3.2.0.jar > - ... > - hive-exec-2.3.9.jar > - ... > {code} > This setup provides the expected behavior within the user classloader; it > will first check the parent, so hive-exec-2.3.9.jar takes precedence, and the > MutableURLClassLoader is only checked if the class doesn't exist in the > parent. But when a JAR list is constructed for the IsolatedClientLoader, it > traverses the URLs from MutableURLClassLoader first, then it's parent, so the > final list looks like (in order): > {code} > URLClassLoader [IsolatedClientLoader] > -- foo.jar > -- hive-exec-2.3.8.jar > -- spark-core_2.12-3.2.0.jar > -- ... > -- hive-exec-2.3.9.jar > -- ... > -- parent: boot classloader (JVM classes) > {code} > Now when a lookup happens, all of the JARs are within the same > URLClassLoader, and the user JARs are in front of the Spark ones, so the user > JARs get prioritized. This is the opposite of the expected behavior when > using the default user/application classloader in Spark, which has > parent-first behavior, prioritizing the Spark/system classes over the user > classes. (Note that this behavior is correct when using the > {{ChildFirstURLClassLoader}}.) > After SPARK-37446, the NoSuchMethodError is no longer an issue, but this > still breaks assumptions about how user JARs should be treated vs. system > JARs, and presents the ability for the client to break in other ways. For > example in SPARK-37446 it describes a scenario whereby Hive 2.3.8 JARs have > been included; the changes in Hive 2.3.9 were needed to improve compatibility > with older HMS, so if a user were to accidentally include these older JARs, > it could break the ability of Spark to communicate with HMS 1.x > I see two solutions to this: > *(A) Remove the separate classloader entirely when using "builtin"* > Starting from 3.0.0, due to SPARK-26839, when using Java 9+, we don't even > create a new classloader when using "builtin". This makes sense, as [called > out in this > comment|https://github.com/apache/spark/pull/24057#discussion_r265142878], > since the point of "builtin" is to use the existing JARs on the classpath > anyway. This proposes simply extending the changes from SPARK-26839 to all > Java versions, instead of restricting to Java 9+ only. > *(B) Reverse the ordering of
[jira] [Assigned] (SPARK-42610) Add implicit encoders to SQLImplicits
[ https://issues.apache.org/jira/browse/SPARK-42610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-42610: - Assignee: Herman van Hövell > Add implicit encoders to SQLImplicits > - > > Key: SPARK-42610 > URL: https://issues.apache.org/jira/browse/SPARK-42610 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42467) Spark Connect Scala Client: GroupBy and Aggregation
[ https://issues.apache.org/jira/browse/SPARK-42467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell updated SPARK-42467: -- Epic Link: SPARK-42554 > Spark Connect Scala Client: GroupBy and Aggregation > --- > > Key: SPARK-42467 > URL: https://issues.apache.org/jira/browse/SPARK-42467 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42542) Support Pivot without providing pivot column values
[ https://issues.apache.org/jira/browse/SPARK-42542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42542. --- Fix Version/s: 3.4.1 Resolution: Fixed > Support Pivot without providing pivot column values > --- > > Key: SPARK-42542 > URL: https://issues.apache.org/jira/browse/SPARK-42542 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41725) Remove the workaround of sql(...).collect back in PySpark tests
[ https://issues.apache.org/jira/browse/SPARK-41725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41725: Assignee: (was: Apache Spark) > Remove the workaround of sql(...).collect back in PySpark tests > --- > > Key: SPARK-41725 > URL: https://issues.apache.org/jira/browse/SPARK-41725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://github.com/apache/spark/pull/39224/files#r1057436437 > We don't have to `collect` for every `sql` but Spark Connect requires it. We > should remove them out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41725) Remove the workaround of sql(...).collect back in PySpark tests
[ https://issues.apache.org/jira/browse/SPARK-41725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694208#comment-17694208 ] Apache Spark commented on SPARK-41725: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40160 > Remove the workaround of sql(...).collect back in PySpark tests > --- > > Key: SPARK-41725 > URL: https://issues.apache.org/jira/browse/SPARK-41725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://github.com/apache/spark/pull/39224/files#r1057436437 > We don't have to `collect` for every `sql` but Spark Connect requires it. We > should remove them out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41725) Remove the workaround of sql(...).collect back in PySpark tests
[ https://issues.apache.org/jira/browse/SPARK-41725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41725: Assignee: Apache Spark > Remove the workaround of sql(...).collect back in PySpark tests > --- > > Key: SPARK-41725 > URL: https://issues.apache.org/jira/browse/SPARK-41725 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > See https://github.com/apache/spark/pull/39224/files#r1057436437 > We don't have to `collect` for every `sql` but Spark Connect requires it. We > should remove them out. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42510) Implement `DataFrame.mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17694162#comment-17694162 ] Apache Spark commented on SPARK-42510: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/40201 > Implement `DataFrame.mapInPandas` > - > > Key: SPARK-42510 > URL: https://issues.apache.org/jira/browse/SPARK-42510 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `DataFrame.mapInPandas` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42510) Implement `DataFrame.mapInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-42510. --- Fix Version/s: 3.4.0 Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 40104 https://github.com/apache/spark/pull/40104 > Implement `DataFrame.mapInPandas` > - > > Key: SPARK-42510 > URL: https://issues.apache.org/jira/browse/SPARK-42510 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Implement `DataFrame.mapInPandas` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42610) Add implicit encoders to SQLImplicits
Herman van Hövell created SPARK-42610: - Summary: Add implicit encoders to SQLImplicits Key: SPARK-42610 URL: https://issues.apache.org/jira/browse/SPARK-42610 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42609) Support Grouping/Grouping_Set
Rui Wang created SPARK-42609: Summary: Support Grouping/Grouping_Set Key: SPARK-42609 URL: https://issues.apache.org/jira/browse/SPARK-42609 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org