[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40493: -- Fix Version/s: 3.3.2 (was: 3.3.1) > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.2.3, 3.3.2 > > > Please see https://github.com/apache/spark/pull/30865#issuecomment-755285940 > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern
[ https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40494: Assignee: (was: Apache Spark) > Optimize the performance of `keys.zipWithIndex.toMap` code pattern > --- > > Key: SPARK-40494 > URL: https://issues.apache.org/jira/browse/SPARK-40494 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > Similar as SPARK-40175, can use {{`while loop manually}} style` to optimize > the performance of `keys.zipWithIndex.toMap` code pattern in Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern
[ https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40494: Assignee: Apache Spark > Optimize the performance of `keys.zipWithIndex.toMap` code pattern > --- > > Key: SPARK-40494 > URL: https://issues.apache.org/jira/browse/SPARK-40494 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > Similar as SPARK-40175, can use {{`while loop manually}} style` to optimize > the performance of `keys.zipWithIndex.toMap` code pattern in Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern
[ https://issues.apache.org/jira/browse/SPARK-40494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606915#comment-17606915 ] Apache Spark commented on SPARK-40494: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37940 > Optimize the performance of `keys.zipWithIndex.toMap` code pattern > --- > > Key: SPARK-40494 > URL: https://issues.apache.org/jira/browse/SPARK-40494 > Project: Spark > Issue Type: Improvement > Components: MLlib, Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > Similar as SPARK-40175, can use {{`while loop manually}} style` to optimize > the performance of `keys.zipWithIndex.toMap` code pattern in Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40493: Description: Please see https://github.com/apache/spark/pull/30865#issuecomment-755285940 for more details. > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.3.1, 3.2.3 > > > Please see https://github.com/apache/spark/pull/30865#issuecomment-755285940 > for more details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-40493: --- Assignee: Yuming Wang > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.3.1, 3.2.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606907#comment-17606907 ] Yuming Wang commented on SPARK-40493: - Issue resolved by pull request 37729 https://github.com/apache/spark/pull/37729 > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.3.1, 3.2.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40494) Optimize the performance of `keys.zipWithIndex.toMap` code pattern
Yang Jie created SPARK-40494: Summary: Optimize the performance of `keys.zipWithIndex.toMap` code pattern Key: SPARK-40494 URL: https://issues.apache.org/jira/browse/SPARK-40494 Project: Spark Issue Type: Improvement Components: MLlib, Spark Core, SQL Affects Versions: 3.4.0 Reporter: Yang Jie Similar as SPARK-40175, can use {{`while loop manually}} style` to optimize the performance of `keys.zipWithIndex.toMap` code pattern in Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-40493. - Resolution: Fixed > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.3.1, 3.2.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40493: Fix Version/s: 3.2.3 > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.3.1, 3.2.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40493: Fix Version/s: 3.3.1 > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Priority: Major > Fix For: 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
[ https://issues.apache.org/jira/browse/SPARK-40493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40493: Affects Version/s: 3.2.2 3.2.1 3.2.0 > Revert "[SPARK-33861][SQL] Simplify conditional in predicate" > - > > Key: SPARK-40493 > URL: https://issues.apache.org/jira/browse/SPARK-40493 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40493) Revert "[SPARK-33861][SQL] Simplify conditional in predicate"
Yuming Wang created SPARK-40493: --- Summary: Revert "[SPARK-33861][SQL] Simplify conditional in predicate" Key: SPARK-40493 URL: https://issues.apache.org/jira/browse/SPARK-40493 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38803) Set minio cpu to 250m (0.25) in K8s IT
[ https://issues.apache.org/jira/browse/SPARK-38803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38803: -- Fix Version/s: 3.3.2 > Set minio cpu to 250m (0.25) in K8s IT > -- > > Key: SPARK-38803 > URL: https://issues.apache.org/jira/browse/SPARK-38803 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0, 3.3.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38802) Support spark.kubernetes.test.(driver|executor)RequestCores
[ https://issues.apache.org/jira/browse/SPARK-38802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38802: -- Fix Version/s: 3.3.2 > Support spark.kubernetes.test.(driver|executor)RequestCores > --- > > Key: SPARK-38802 > URL: https://issues.apache.org/jira/browse/SPARK-38802 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0, 3.3.2 > > > [https://github.com/apache/spark/pull/35830#pullrequestreview-929597027] > > Support spark.kubernetes.test.(driver|executor)RequestCores to allow devs > setting specific cpu for driver/executor. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive
[ https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606885#comment-17606885 ] Apache Spark commented on SPARK-40492: -- User 'chaoqin-li1123' has created a pull request for this issue: https://github.com/apache/spark/pull/37935 > Perform maintenance of StateStore instances when they become inactive > - > > Key: SPARK-40492 > URL: https://issues.apache.org/jira/browse/SPARK-40492 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Chaoqin Li >Priority: Major > > Current the maintenance of StateStore is performed by a periodic task in the > management thread. If a streaming query become inactive before the next > maintenance task fire, its StateStore will be unloaded before cleanup. > There are 2 cases when a StateStore is unloaded. > # StateStoreProvider is not longer active in the system, for example, when a > query ends or the spark context terminates. > # There is other active StateStoreProvider in the system, for example, when > a partition is reassigned. > In case 1, we should do one last maintenance before unloading the instance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive
[ https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606884#comment-17606884 ] Apache Spark commented on SPARK-40492: -- User 'chaoqin-li1123' has created a pull request for this issue: https://github.com/apache/spark/pull/37935 > Perform maintenance of StateStore instances when they become inactive > - > > Key: SPARK-40492 > URL: https://issues.apache.org/jira/browse/SPARK-40492 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Chaoqin Li >Priority: Major > > Current the maintenance of StateStore is performed by a periodic task in the > management thread. If a streaming query become inactive before the next > maintenance task fire, its StateStore will be unloaded before cleanup. > There are 2 cases when a StateStore is unloaded. > # StateStoreProvider is not longer active in the system, for example, when a > query ends or the spark context terminates. > # There is other active StateStoreProvider in the system, for example, when > a partition is reassigned. > In case 1, we should do one last maintenance before unloading the instance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive
[ https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40492: Assignee: Apache Spark > Perform maintenance of StateStore instances when they become inactive > - > > Key: SPARK-40492 > URL: https://issues.apache.org/jira/browse/SPARK-40492 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Chaoqin Li >Assignee: Apache Spark >Priority: Major > > Current the maintenance of StateStore is performed by a periodic task in the > management thread. If a streaming query become inactive before the next > maintenance task fire, its StateStore will be unloaded before cleanup. > There are 2 cases when a StateStore is unloaded. > # StateStoreProvider is not longer active in the system, for example, when a > query ends or the spark context terminates. > # There is other active StateStoreProvider in the system, for example, when > a partition is reassigned. > In case 1, we should do one last maintenance before unloading the instance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive
[ https://issues.apache.org/jira/browse/SPARK-40492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40492: Assignee: (was: Apache Spark) > Perform maintenance of StateStore instances when they become inactive > - > > Key: SPARK-40492 > URL: https://issues.apache.org/jira/browse/SPARK-40492 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Chaoqin Li >Priority: Major > > Current the maintenance of StateStore is performed by a periodic task in the > management thread. If a streaming query become inactive before the next > maintenance task fire, its StateStore will be unloaded before cleanup. > There are 2 cases when a StateStore is unloaded. > # StateStoreProvider is not longer active in the system, for example, when a > query ends or the spark context terminates. > # There is other active StateStoreProvider in the system, for example, when > a partition is reassigned. > In case 1, we should do one last maintenance before unloading the instance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40472) Improve pyspark.sql.function example experience
[ https://issues.apache.org/jira/browse/SPARK-40472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606879#comment-17606879 ] deshanxiao commented on SPARK-40472: [~hyukjin.kwon] OK, thanks~ > Improve pyspark.sql.function example experience > --- > > Key: SPARK-40472 > URL: https://issues.apache.org/jira/browse/SPARK-40472 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: deshanxiao >Priority: Minor > > There are many exanple in pyspark.sql.function: > {code:java} > Examples > > >>> df = spark.range(1) > >>> df.select(lit(5).alias('height'), df.id).show() > +--+---+ > |height| id| > +--+---+ > | 5| 0| > +--+---+ {code} > We can add import statements so that the user can directly run it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40472) Improve pyspark.sql.function example experience
[ https://issues.apache.org/jira/browse/SPARK-40472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deshanxiao resolved SPARK-40472. Resolution: Fixed > Improve pyspark.sql.function example experience > --- > > Key: SPARK-40472 > URL: https://issues.apache.org/jira/browse/SPARK-40472 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: deshanxiao >Priority: Minor > > There are many exanple in pyspark.sql.function: > {code:java} > Examples > > >>> df = spark.range(1) > >>> df.select(lit(5).alias('height'), df.id).show() > +--+---+ > |height| id| > +--+---+ > | 5| 0| > +--+---+ {code} > We can add import statements so that the user can directly run it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40492) Perform maintenance of StateStore instances when they become inactive
Chaoqin Li created SPARK-40492: -- Summary: Perform maintenance of StateStore instances when they become inactive Key: SPARK-40492 URL: https://issues.apache.org/jira/browse/SPARK-40492 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Chaoqin Li Current the maintenance of StateStore is performed by a periodic task in the management thread. If a streaming query become inactive before the next maintenance task fire, its StateStore will be unloaded before cleanup. There are 2 cases when a StateStore is unloaded. # StateStoreProvider is not longer active in the system, for example, when a query ends or the spark context terminates. # There is other active StateStoreProvider in the system, for example, when a partition is reassigned. In case 1, we should do one last maintenance before unloading the instance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37275) Support ANSI intervals in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37275: - Fix Version/s: 3.3.0 > Support ANSI intervals in PySpark > - > > Key: SPARK-37275 > URL: https://issues.apache.org/jira/browse/SPARK-37275 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: release-notes > Fix For: 3.3.0 > > > This JIRA targets to implement ANSI interval types in PySpark: > - > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala > - > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606875#comment-17606875 ] Jungtaek Lim commented on SPARK-40489: -- It would be nice if you can help Spark to retain the dependency as SLF4J1 but also work with SLF4J2. If you meant to propose a PR for achieving this (instead of bumping the version), it would be really appreciated! > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Major > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. > {code} > private def isLog4j2(): Boolean = { > // This distinguishes the log4j 1.2 binding, currently > // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, > currently > // org.apache.logging.slf4j.Log4jLoggerFactory > val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr > "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass) > } > {code} > Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark > should not be using {{StaticLoggerBinder}} to do that detection. There are > many other approaches. (The code itself suggest one approach: > {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to > see if the root logger actually is a {{Log4jLogger}}. There may be even > better approaches.) > The other big problem is relying on the Log4J classes themselves. By relying > on those classes, you force me to bring in Log4J as a dependency, which
[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-40489: - Priority: Major (was: Critical) > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Major > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. > {code} > private def isLog4j2(): Boolean = { > // This distinguishes the log4j 1.2 binding, currently > // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, > currently > // org.apache.logging.slf4j.Log4jLoggerFactory > val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr > "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass) > } > {code} > Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark > should not be using {{StaticLoggerBinder}} to do that detection. There are > many other approaches. (The code itself suggest one approach: > {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to > see if the root logger actually is a {{Log4jLogger}}. There may be even > better approaches.) > The other big problem is relying on the Log4J classes themselves. By relying > on those classes, you force me to bring in Log4J as a dependency, which in > the latest versions will register themselves with the service loader > mechanism, causing conflicting SLF4J implementations. > It is paramount that you: > * Remove all reliance ton {{StaticLoggerBinder}}. If you
[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606872#comment-17606872 ] Jungtaek Lim commented on SPARK-40489: -- https://www.slf4j.org/news.html 2022-08-20 - Release of SLF4J 2.0.0 2022-09-14 - Release of SLF4J 2.0.1 It sounds like the new major version upgrade is done in a month ago and we don't quite know about stability. It doesn't seem like it is quite urgent to set the priority to critical. (I'm going to lower down the priority.) Also, we cannot easily move on before we make clear there is NO breakage/behavioral change on upgrading Spark version which migrates SLF4J1 to SLF4J2. We wouldn't be happy with breaking/behavioral changes the dependency has brought, hence we concern about major version upgrade on dependency. The comment about log4j1 is moot as recent version of Spark uses log4j2. > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Critical > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. > {code} > private def isLog4j2(): Boolean = { > // This distinguishes the log4j 1.2 binding, currently > // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, > currently > // org.apache.logging.slf4j.Log4jLoggerFactory > val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr > "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass) > } > {code} > Whatever the wisdom of Spark's relying on Log4J-specific
[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40460: - Fix Version/s: 3.3.2 > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0, 3.3.2 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606852#comment-17606852 ] Garret Wilson edited comment on SPARK-40489 at 9/20/22 3:15 AM: # Dropping explicit Log4J 1.x support is certainly one of the things that needs to be done immediately. Not only is it full of vulnerabilities, it [reached end of life|https://logging.apache.org/log4j/1.2/] over five years ago! # The Log4J implementation dependencies should be removed from Spark as well. See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people seem to have given any thought or care about, given the zero responses I have received so far). # And of course {{StaticLoggerBinder}} references should be abandoned. All this should have been done years ago. I mention this to give it some sense of urgency, in light of what I will say next. I hesitate to even mention the following, because it might lower the priority of the ticket, but for those who might be in a pickle, I just released {{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an [adapter|https://github.com/globalmentor/clogr/tree/master/clogr-slf4j1-adapter] (a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. Just include it as a dependency and Spark will stop breaking. *But this is a stop-gap measure! Please fix this bug!* :) was (Author: garretwilson): # Dropping explicit Log4J 1.x support is certainly one of the things that needs to be done immediately. Not only is it full of vulnerabilities, it [reached end of life|https://logging.apache.org/log4j/1.2/] over five years ago! # The Log4J implementation dependencies should be removed from Spark as well. See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people seem to have given any thought or care about, given the zero responses I have received so far). # And of course {{StaticLoggerBinder}} references should be abandoned. All this should have been done years ago. I mention this to give it some sense of urgency, in light of what I will say next. I hesitate to even mention the following, because it might lower the priority of the ticket, but for those who might be in a pickle, I just released {{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an adapter (a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. Just include it as a dependency and Spark will stop breaking. *But this is a stop-gap measure! Please fix this bug!* :) > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Critical > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at
[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606852#comment-17606852 ] Garret Wilson commented on SPARK-40489: --- # Dropping explicit Log4J 1.x support is certainly one of the things that needs to be done immediately. Not only is it full of vulnerabilities, it [reached end of life|https://logging.apache.org/log4j/1.2/] over five years ago! # The Log4J implementation dependencies should be removed from Spark as well. See my question [Correctly fixing multiple `StaticLoggerBinder` bindings in Spark|https://stackoverflow.com/q/73615263] on Stack Overflow (which few people seem to have given any thought or care about, given the zero responses I have received so far). # And of course {{StaticLoggerBinder}} references should be abandoned. All this should have been done years ago. I mention this to give it some sense of urgency, in light of what I will say next. I hesitate to even mention the following, because it might lower the priority of the ticket, but for those who might be in a pickle, I just released {{io.clogr:clogr-slf4j1-adapter:0.8.2}} to Maven Central, which is an adapter (a shim, really) that will keep Spark from breaking in the face of SLF4J 2.x. Just include it as a dependency and Spark will stop breaking. *But this is a stop-gap measure! Please fix this bug!* :) > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Critical > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on
[jira] [Commented] (SPARK-40491) Expose a jdbcRDD function in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606850#comment-17606850 ] Apache Spark commented on SPARK-40491: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37937 > Expose a jdbcRDD function in SparkContext > - > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
[ https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606851#comment-17606851 ] Apache Spark commented on SPARK-40490: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37938 > `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload > after SPARK-17321 > > > Key: SPARK-40490 > URL: https://issues.apache.org/jira/browse/SPARK-40490 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > After SPARK-17321, YarnShuffleService will persist data to local shuffle > state db and reload data from local shuffle state db only when Yarn > NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but > `YarnShuffleIntegrationSuite` not set this config and the default value of > the configuration is false, so `YarnShuffleIntegrationSuite` will neither > trigger data persistence to the db nor verify the reload of data > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40491) Expose a jdbcRDD function in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606848#comment-17606848 ] Apache Spark commented on SPARK-40491: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/37937 > Expose a jdbcRDD function in SparkContext > - > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
[ https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40490: Assignee: Apache Spark > `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload > after SPARK-17321 > > > Key: SPARK-40490 > URL: https://issues.apache.org/jira/browse/SPARK-40490 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > After SPARK-17321, YarnShuffleService will persist data to local shuffle > state db and reload data from local shuffle state db only when Yarn > NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but > `YarnShuffleIntegrationSuite` not set this config and the default value of > the configuration is false, so `YarnShuffleIntegrationSuite` will neither > trigger data persistence to the db nor verify the reload of data > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40491) Expose a jdbcRDD function in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40491: Assignee: Apache Spark > Expose a jdbcRDD function in SparkContext > - > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
[ https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40490: Assignee: (was: Apache Spark) > `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload > after SPARK-17321 > > > Key: SPARK-40490 > URL: https://issues.apache.org/jira/browse/SPARK-40490 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > After SPARK-17321, YarnShuffleService will persist data to local shuffle > state db and reload data from local shuffle state db only when Yarn > NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but > `YarnShuffleIntegrationSuite` not set this config and the default value of > the configuration is false, so `YarnShuffleIntegrationSuite` will neither > trigger data persistence to the db nor verify the reload of data > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
[ https://issues.apache.org/jira/browse/SPARK-40490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606849#comment-17606849 ] Apache Spark commented on SPARK-40490: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37938 > `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload > after SPARK-17321 > > > Key: SPARK-40490 > URL: https://issues.apache.org/jira/browse/SPARK-40490 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > > After SPARK-17321, YarnShuffleService will persist data to local shuffle > state db and reload data from local shuffle state db only when Yarn > NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but > `YarnShuffleIntegrationSuite` not set this config and the default value of > the configuration is false, so `YarnShuffleIntegrationSuite` will neither > trigger data persistence to the db nor verify the reload of data > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40491) Expose a jdbcRDD function in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40491: Assignee: (was: Apache Spark) > Expose a jdbcRDD function in SparkContext > - > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40491) Expose a jdbcRDD function in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-40491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-40491: --- Description: According to the legacy document of JdbcRDD, we need to expose a jdbcRDD function in SparkContext. (was: According the legacy document of JdbcRDD, we need to expose a jdbcRDD function in SparkContext.) > Expose a jdbcRDD function in SparkContext > - > > Key: SPARK-40491 > URL: https://issues.apache.org/jira/browse/SPARK-40491 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > According to the legacy document of JdbcRDD, we need to expose a jdbcRDD > function in SparkContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40491) Expose a jdbcRDD function in SparkContext
jiaan.geng created SPARK-40491: -- Summary: Expose a jdbcRDD function in SparkContext Key: SPARK-40491 URL: https://issues.apache.org/jira/browse/SPARK-40491 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng According the legacy document of JdbcRDD, we need to expose a jdbcRDD function in SparkContext. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40490) `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321
Yang Jie created SPARK-40490: Summary: `YarnShuffleIntegrationSuite` no longer verifies `registeredExecFile` reload after SPARK-17321 Key: SPARK-40490 URL: https://issues.apache.org/jira/browse/SPARK-40490 Project: Spark Issue Type: Improvement Components: Tests, YARN Affects Versions: 3.4.0 Reporter: Yang Jie After SPARK-17321, YarnShuffleService will persist data to local shuffle state db and reload data from local shuffle state db only when Yarn NodeManager start with `YarnConfiguration#NM_RECOVERY_ENABLED = true` , but `YarnShuffleIntegrationSuite` not set this config and the default value of the configuration is false, so `YarnShuffleIntegrationSuite` will neither trigger data persistence to the db nor verify the reload of data -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606817#comment-17606817 ] Asif commented on SPARK-33152: -- Added a test *CompareNewAndOldConstraintsSuite* in the PR which when run on master will highlight functionality issues with master as well as perf issue. > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1, 3.1.2 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to generates all > the possible combinations of constraints , based on the aliases ( even then > it may miss a lot of combinations if the expression is a complex expression > involving same attribute repeated multiple times within the expression and > there are many aliases to that column). There are query plans in our > production env, which can result in intermediate number of constraints going > into hundreds of thousands, causing OOM or taking time running into hours. > Also there are cases where it incorrectly generates an EqualNullSafe > constraint instead of EqualTo constraint , thus missing a possible IsNull > constraint on column. > Also it only pushes single column predicate on the other side of the join. > The constraints generated , in
[jira] [Reopened] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reopened SPARK-39494: -- > Support `createDataFrame` from a list of scalars when schema is not provided > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of native Python scalars is > unsupported in PySpark, for example, > {{>>> spark.createDataFrame([1, 2]).collect()}} > {{Traceback (most recent call last):}} > {{...}} > {{TypeError: Can not infer schema for type: }} > {{However, Spark DataFrame Scala API supports that:}} > {{scala> Seq(1, 2).toDF().collect()}} > {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. > See more > [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39494) Support `createDataFrame` from a list of scalars when schema is not provided
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-39494. -- Resolution: Won't Do > Support `createDataFrame` from a list of scalars when schema is not provided > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of native Python scalars is > unsupported in PySpark, for example, > {{>>> spark.createDataFrame([1, 2]).collect()}} > {{Traceback (most recent call last):}} > {{...}} > {{TypeError: Can not infer schema for type: }} > {{However, Spark DataFrame Scala API supports that:}} > {{scala> Seq(1, 2).toDF().collect()}} > {{res6: Array[org.apache.spark.sql.Row] = Array([1], [2])}} > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. > See more > [here]([https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-40084. -- Resolution: Resolved > Upgrade Py4J from 0.10.9.5 to 0.10.9.7 > -- > > Key: SPARK-40084 > URL: https://issues.apache.org/jira/browse/SPARK-40084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > * Java side: Add support for Java 11/17 > Release note: https://www.py4j.org/changelog.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606813#comment-17606813 ] Xinrong Meng commented on SPARK-40084: -- Resolved by https://github.com/apache/spark/pull/37523. > Upgrade Py4J from 0.10.9.5 to 0.10.9.7 > -- > > Key: SPARK-40084 > URL: https://issues.apache.org/jira/browse/SPARK-40084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > * Java side: Add support for Java 11/17 > Release note: https://www.py4j.org/changelog.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40084) Upgrade Py4J from 0.10.9.5 to 0.10.9.7
[ https://issues.apache.org/jira/browse/SPARK-40084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-40084: Assignee: BingKun Pan > Upgrade Py4J from 0.10.9.5 to 0.10.9.7 > -- > > Key: SPARK-40084 > URL: https://issues.apache.org/jira/browse/SPARK-40084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > * Java side: Add support for Java 11/17 > Release note: https://www.py4j.org/changelog.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39405) NumPy support in SQL
[ https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-39405. -- Resolution: Resolved > NumPy support in SQL > > > Key: SPARK-39405 > URL: https://issues.apache.org/jira/browse/SPARK-39405 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > NumPy is the fundamental package for scientific computing with Python. It is > very commonly used, especially in the data science world. For example, Pandas > is backed by NumPy, and Tensors also supports interchangeable conversion > from/to NumPy arrays. > > However, PySpark only supports Python built-in types with the exception of > “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. > > This issue has been raised multiple times internally and externally, see also > SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. > > With the NumPy support in SQL, we expect more adaptations from naive data > scientists and newcomers leveraging their existing background and codebase > with NumPy. > > See more > [https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#] > . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39405) NumPy support in SQL
[ https://issues.apache.org/jira/browse/SPARK-39405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-39405: Assignee: Xinrong Meng > NumPy support in SQL > > > Key: SPARK-39405 > URL: https://issues.apache.org/jira/browse/SPARK-39405 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > NumPy is the fundamental package for scientific computing with Python. It is > very commonly used, especially in the data science world. For example, Pandas > is backed by NumPy, and Tensors also supports interchangeable conversion > from/to NumPy arrays. > > However, PySpark only supports Python built-in types with the exception of > “SparkSession.createDataFrame(pandas.DataFrame)” and “DataFrame.toPandas”. > > This issue has been raised multiple times internally and externally, see also > SPARK-2012, SPARK-37697, SPARK-31776, and SPARK-6857. > > With the NumPy support in SQL, we expect more adaptations from naive data > scientists and newcomers leveraging their existing background and codebase > with NumPy. > > See more > [https://docs.google.com/document/d/1WsBiHoQB3UWERP47C47n_frffxZ9YIoGRwXSwIeMank/edit#] > . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39745) Accept a list that contains NumPy scalars in `createDataFrame`
[ https://issues.apache.org/jira/browse/SPARK-39745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-39745. -- Resolution: Won't Do > Accept a list that contains NumPy scalars in `createDataFrame` > -- > > Key: SPARK-39745 > URL: https://issues.apache.org/jira/browse/SPARK-39745 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, only lists of native Python scalars are accepted in > `createDataFrame`. > We should support Numpy scalars as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available
[ https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-40466: Assignee: Huanli Wang > Improve the error message if the DSv2 source is disabled but DSv1 streaming > source is not available > --- > > Key: SPARK-40466 > URL: https://issues.apache.org/jira/browse/SPARK-40466 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Huanli Wang >Assignee: Huanli Wang >Priority: Minor > > If the V2 data source is disabled, current behavior will fallback to use V1 > data source. But it will throw error when the DSv1 is not available. Update > the error message to indicate what config variable > (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in > order to enable the V2 data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40466) Improve the error message if the DSv2 source is disabled but DSv1 streaming source is not available
[ https://issues.apache.org/jira/browse/SPARK-40466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-40466. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37917 [https://github.com/apache/spark/pull/37917] > Improve the error message if the DSv2 source is disabled but DSv1 streaming > source is not available > --- > > Key: SPARK-40466 > URL: https://issues.apache.org/jira/browse/SPARK-40466 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Huanli Wang >Assignee: Huanli Wang >Priority: Minor > Fix For: 3.4.0 > > > If the V2 data source is disabled, current behavior will fallback to use V1 > data source. But it will throw error when the DSv1 is not available. Update > the error message to indicate what config variable > (spark.sql.streaming.disabledV2MicroBatchReaders) needs to be modified in > order to enable the V2 data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew reopened SPARK-40286: -- > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew reopened SPARK-40287: -- > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > -- > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-1.snappy.parquet > - s3://bucket/data/p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-1.snappy.parquet > - p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39991) AQE should use available column statistics from completed query stages
[ https://issues.apache.org/jira/browse/SPARK-39991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39991: - Assignee: Andy Grove > AQE should use available column statistics from completed query stages > -- > > Key: SPARK-39991 > URL: https://issues.apache.org/jira/browse/SPARK-39991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > > In QueryStageExec.computeStats we copy partial statistics from materlized > query stages by calling QueryStageExec#getRuntimeStatistics, which in turn > calls ShuffleExchangeLike#runtimeStatistics or > BroadcastExchangeLike#runtimeStatistics. > Only dataSize and numOutputRows are copied into the new Statistics object: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > Some(Statistics(dataSize, numOutputRows, isRuntime = true)) > } else { > None > } > {code} > I would like to also copy over the column statistics stored in > Statistics.attributeMap so that they can be fed back into the logical plan > optimization phase. This is a small change as shown below: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > val attributeStats = runtimeStats.attributeStats > Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = > true)) > } else { > None > } > {code} > The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do > not currently provide such column statistics, but other custom > implementations can. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39991) AQE should use available column statistics from completed query stages
[ https://issues.apache.org/jira/browse/SPARK-39991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39991. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37424 [https://github.com/apache/spark/pull/37424] > AQE should use available column statistics from completed query stages > -- > > Key: SPARK-39991 > URL: https://issues.apache.org/jira/browse/SPARK-39991 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 3.4.0 > > > In QueryStageExec.computeStats we copy partial statistics from materlized > query stages by calling QueryStageExec#getRuntimeStatistics, which in turn > calls ShuffleExchangeLike#runtimeStatistics or > BroadcastExchangeLike#runtimeStatistics. > Only dataSize and numOutputRows are copied into the new Statistics object: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > Some(Statistics(dataSize, numOutputRows, isRuntime = true)) > } else { > None > } > {code} > I would like to also copy over the column statistics stored in > Statistics.attributeMap so that they can be fed back into the logical plan > optimization phase. This is a small change as shown below: > {code:scala} > def computeStats(): Option[Statistics] = if (isMaterialized) { > val runtimeStats = getRuntimeStatistics > val dataSize = runtimeStats.sizeInBytes.max(0) > val numOutputRows = runtimeStats.rowCount.map(_.max(0)) > val attributeStats = runtimeStats.attributeStats > Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = > true)) > } else { > None > } > {code} > The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do > not currently provide such column statistics, but other custom > implementations can. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`
[ https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40477: Assignee: (was: Apache Spark) > Support `NullType` in `ColumnarBatchRow` > > > Key: SPARK-40477 > URL: https://issues.apache.org/jira/browse/SPARK-40477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > `ColumnarBatchRow.get()` does not support `NullType` currently. Support > `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column > type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`
[ https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606744#comment-17606744 ] Apache Spark commented on SPARK-40477: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/37934 > Support `NullType` in `ColumnarBatchRow` > > > Key: SPARK-40477 > URL: https://issues.apache.org/jira/browse/SPARK-40477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > `ColumnarBatchRow.get()` does not support `NullType` currently. Support > `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column > type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`
[ https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606742#comment-17606742 ] Apache Spark commented on SPARK-40477: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/37934 > Support `NullType` in `ColumnarBatchRow` > > > Key: SPARK-40477 > URL: https://issues.apache.org/jira/browse/SPARK-40477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > `ColumnarBatchRow.get()` does not support `NullType` currently. Support > `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column > type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`
[ https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40477: Assignee: Apache Spark > Support `NullType` in `ColumnarBatchRow` > > > Key: SPARK-40477 > URL: https://issues.apache.org/jira/browse/SPARK-40477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Assignee: Apache Spark >Priority: Minor > > `ColumnarBatchRow.get()` does not support `NullType` currently. Support > `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column > type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606684#comment-17606684 ] Apache Spark commented on SPARK-40474: -- User 'xiaonanyang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37933 > Infer columns with mixed date and timestamp as String in CSV schema inference > - > > Key: SPARK-40474 > URL: https://issues.apache.org/jira/browse/SPARK-40474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xiaonan Yang >Priority: Major > > In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we > introduced the support of date type in CSV schema inference. The schema > inference behavior on date time columns now is: > * For a column only containing dates, we will infer it as Date type > * For a column only containing timestamps, we will infer it as Timestamp type > * For a column containing a mixture of dates and timestamps, we will infer > it as Timestamp type > However, we found that we are too ambitious on the last scenario, to support > which we have introduced much complexity in code and caused a lot of > performance concerns. Thus, we want to simplify the behavior of the last > scenario as: > * For a column containing a mixture of dates and timestamps, we will infer > it as String type -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40474: Assignee: (was: Apache Spark) > Infer columns with mixed date and timestamp as String in CSV schema inference > - > > Key: SPARK-40474 > URL: https://issues.apache.org/jira/browse/SPARK-40474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xiaonan Yang >Priority: Major > > In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we > introduced the support of date type in CSV schema inference. The schema > inference behavior on date time columns now is: > * For a column only containing dates, we will infer it as Date type > * For a column only containing timestamps, we will infer it as Timestamp type > * For a column containing a mixture of dates and timestamps, we will infer > it as Timestamp type > However, we found that we are too ambitious on the last scenario, to support > which we have introduced much complexity in code and caused a lot of > performance concerns. Thus, we want to simplify the behavior of the last > scenario as: > * For a column containing a mixture of dates and timestamps, we will infer > it as String type -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606683#comment-17606683 ] Apache Spark commented on SPARK-40474: -- User 'xiaonanyang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37933 > Infer columns with mixed date and timestamp as String in CSV schema inference > - > > Key: SPARK-40474 > URL: https://issues.apache.org/jira/browse/SPARK-40474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xiaonan Yang >Priority: Major > > In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we > introduced the support of date type in CSV schema inference. The schema > inference behavior on date time columns now is: > * For a column only containing dates, we will infer it as Date type > * For a column only containing timestamps, we will infer it as Timestamp type > * For a column containing a mixture of dates and timestamps, we will infer > it as Timestamp type > However, we found that we are too ambitious on the last scenario, to support > which we have introduced much complexity in code and caused a lot of > performance concerns. Thus, we want to simplify the behavior of the last > scenario as: > * For a column containing a mixture of dates and timestamps, we will infer > it as String type -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40474: Assignee: Apache Spark > Infer columns with mixed date and timestamp as String in CSV schema inference > - > > Key: SPARK-40474 > URL: https://issues.apache.org/jira/browse/SPARK-40474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xiaonan Yang >Assignee: Apache Spark >Priority: Major > > In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we > introduced the support of date type in CSV schema inference. The schema > inference behavior on date time columns now is: > * For a column only containing dates, we will infer it as Date type > * For a column only containing timestamps, we will infer it as Timestamp type > * For a column containing a mixture of dates and timestamps, we will infer > it as Timestamp type > However, we found that we are too ambitious on the last scenario, to support > which we have introduced much complexity in code and caused a lot of > performance concerns. Thus, we want to simplify the behavior of the last > scenario as: > * For a column containing a mixture of dates and timestamps, we will infer > it as String type -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606679#comment-17606679 ] Apache Spark commented on SPARK-40460: -- User 'Yaohua628' has created a pull request for this issue: https://github.com/apache/spark/pull/37932 > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606678#comment-17606678 ] Yaohua Zhao commented on SPARK-40460: - [~kabhwan] You are right! Updated > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-40460: Affects Version/s: 3.4.0 > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao updated SPARK-40460: Affects Version/s: 3.3.1 3.3.2 (was: 3.2.0) (was: 3.2.1) (was: 3.2.2) > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.3.0, 3.3.1, 3.3.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40484) Upgrade log4j2 to 2.19.0
[ https://issues.apache.org/jira/browse/SPARK-40484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-40484. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37926 [https://github.com/apache/spark/pull/37926] > Upgrade log4j2 to 2.19.0 > > > Key: SPARK-40484 > URL: https://issues.apache.org/jira/browse/SPARK-40484 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40484) Upgrade log4j2 to 2.19.0
[ https://issues.apache.org/jira/browse/SPARK-40484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-40484: --- Assignee: Yang Jie > Upgrade log4j2 to 2.19.0 > > > Key: SPARK-40484 > URL: https://issues.apache.org/jira/browse/SPARK-40484 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606671#comment-17606671 ] Piotr Karwasz commented on SPARK-40489: --- Since {{StaticLoggerBinder}} is not API, but {{LoggerFactory}} is, replacing the code with: {code:java} val binderClass = LoggerFactory.getLoggerFactory.getClass.getName {code} should work on every version of {{{}SLF4J{}}}. Dropping Log4j 1.x support might be another solution. > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Critical > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. > {code} > private def isLog4j2(): Boolean = { > // This distinguishes the log4j 1.2 binding, currently > // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, > currently > // org.apache.logging.slf4j.Log4jLoggerFactory > val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr > "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass) > } > {code} > Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark > should not be using {{StaticLoggerBinder}} to do that detection. There are > many other approaches. (The code itself suggest one approach: > {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to > see if the root logger actually is a {{Log4jLogger}}. There may be even > better approaches.) > The other big problem is relying on the Log4J classes themselves. By relying > on
[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Garret Wilson updated SPARK-40489: -- Description: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use the {{ServiceLoader}} mechanism. As [described in the FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be compatible; an application just needs to use the latest Log4J/Logback implementation which has the service loader. *Above all the application must _not_ use the low-level {{StaticLoggerBinder}} method, because it has been removed!* Unfortunately [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks completely just trying to get a Spark session: {noformat} Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder at org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.log(Logging.scala:53) at org.apache.spark.internal.Logging.log$(Logging.scala:51) at org.apache.spark.SparkContext.log(SparkContext.scala:84) at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) at org.apache.spark.SparkContext.(SparkContext.scala:195) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) {noformat} This is because Spark is playing low-level tricks to find out if the logging platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. {code} private def isLog4j2(): Boolean = { // This distinguishes the log4j 1.2 binding, currently // org.slf4j.impl.Log4jLoggerFactory, from the log4j 2.0 binding, currently // org.apache.logging.slf4j.Log4jLoggerFactory val binderClass = StaticLoggerBinder.getSingleton.getLoggerFactoryClassStr "org.apache.logging.slf4j.Log4jLoggerFactory".equals(binderClass) } {code} Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark should not be using {{StaticLoggerBinder}} to do that detection. There are many other approaches. (The code itself suggest one approach: {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see if the root logger actually is a {{Log4jLogger}}. There may be even better approaches.) The other big problem is relying on the Log4J classes themselves. By relying on those classes, you force me to bring in Log4J as a dependency, which in the latest versions will register themselves with the service loader mechanism, causing conflicting SLF4J implementations. It is paramount that you: * Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, please check for it using reflection! * Remove all static references to the Log4J classes. (In an ideal world you wouldn't even be doing Log4J-specific things anyway.) If you must must must do Log4J-specific things, access the classes via reflection; don't statically link them in the code. The current situation absolutely (and unnecessarily) 100% breaks the use of SLF4J 2.x. was: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade
[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks with SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Garret Wilson updated SPARK-40489: -- Summary: Spark 3.3.0 breaks with SFL4J 2. (was: Spark 3.3.0 breaks SFL4J 2.) > Spark 3.3.0 breaks with SFL4J 2. > > > Key: SPARK-40489 > URL: https://issues.apache.org/jira/browse/SPARK-40489 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Garret Wilson >Priority: Critical > > Spark breaks fundamentally with SLF4J 2.x because it uses > {{StaticLoggerBinder}}. > SLF4J is the logging facade that is meant to shield the application from the > implementation, whether it be Log4J or Logback or whatever. Historically > SLF4J 1.x used a bad approach to configuration: it used a > {{StaticLoggerBinder}} (a global static singleton instance) rather than the > Java {{ServiceLoader}} mechanism. > SLF4J 2.x, which has been in development for years, has been released. It > finally switches to use the {{ServiceLoader}} mechanism. As [described in the > FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be > compatible; an application just needs to use the latest Log4J/Logback > implementation which has the service loader. > *Above all the application must _not_ use the low-level > {{StaticLoggerBinder}} method, because it has been removed!* > Unfortunately > [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] > uses {{StaticLoggerBinder}} and completely breaks any environment using > SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API > and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark > breaks completely just trying to get a Spark session: > {noformat} > Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder > at > org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) > at > org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) > at > org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) > at > org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) > at org.apache.spark.internal.Logging.log(Logging.scala:53) > at org.apache.spark.internal.Logging.log$(Logging.scala:51) > at org.apache.spark.SparkContext.log(SparkContext.scala:84) > at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) > at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) > at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) > at org.apache.spark.SparkContext.(SparkContext.scala:195) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) > at > org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) > at scala.Option.getOrElse(Option.scala:201) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) > {noformat} > This is because Spark is playing low-level tricks to find out if the logging > platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. > Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark > should not be using {{StaticLoggerBinder}} to do that detection. There are > many other approaches. (The code itself suggest one approach: > {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to > see if the root logger actually is a {{Log4jLogger}}. There may be even > better approaches.) > The other big problem is relying on the Log4J classes themselves. By relying > on those classes, you force me to bring in Log4J as a dependency, which in > the latest versions will register themselves with the service loader > mechanism, causing conflicting SLF4J implementations. > It is paramount that you: > * Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use > it, please check for it using reflection! > * Remove all static references to the Log4J classes. (In an ideal world you > wouldn't even be doing Log4J-specific things anyway.) If you must must must > do Log4J-specific things, access the classes via reflection; don't statically > link them in the code. > The current situation absolutely (and
[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Garret Wilson updated SPARK-40489: -- Description: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use the {{ServiceLoader}} mechanism. As [described in the FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be compatible; an application just needs to use the latest Log4J/Logback implementation which has the service loader. **Above all the application must _not_ use the low-level {{StaticLoggerBinder}} method, because it has been removed!** Unfortunately [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks completely just trying to get a Spark session: {noformat} Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder at org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.log(Logging.scala:53) at org.apache.spark.internal.Logging.log$(Logging.scala:51) at org.apache.spark.SparkContext.log(SparkContext.scala:84) at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) at org.apache.spark.SparkContext.(SparkContext.scala:195) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) {nofromat} This is because Spark is playing low-level tricks to find out if the logging platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark should not be using {{StaticLoggerBinder}} to do that detection. There are many other approaches. (The code itself suggest one approach: {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see if the root logger actually is a {{Log4jLogger}}. There may be even better approaches.) The other big problem is relying on the Log4J classes themselves. By relying on those classes, you force me to bring in Log4J as a dependency, which in the latest versions will register themselves with the service loader mechanism, causing conflicting SLF4J implementations. It is paramount that you: * Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, please check for it using reflection! * Remove all static references to the Log4J classes. (In an ideal world you wouldn't even be doing Log4J-specific things anyway.) If you must must must do Log4J-specific things, access the classes via reflection; don't statically link them in the code. The current situation absolutely (and unnecessarily) 100% breaks the use of SLF4J 2.x. was: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use
[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Garret Wilson updated SPARK-40489: -- Description: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use the {{ServiceLoader}} mechanism. As [described in the FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be compatible; an application just needs to use the latest Log4J/Logback implementation which has the service loader. *Above all the application must _not_ use the low-level {{StaticLoggerBinder}} method, because it has been removed!* Unfortunately [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks completely just trying to get a Spark session: {noformat} Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder at org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.log(Logging.scala:53) at org.apache.spark.internal.Logging.log$(Logging.scala:51) at org.apache.spark.SparkContext.log(SparkContext.scala:84) at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) at org.apache.spark.SparkContext.(SparkContext.scala:195) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) {noformat} This is because Spark is playing low-level tricks to find out if the logging platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark should not be using {{StaticLoggerBinder}} to do that detection. There are many other approaches. (The code itself suggest one approach: {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see if the root logger actually is a {{Log4jLogger}}. There may be even better approaches.) The other big problem is relying on the Log4J classes themselves. By relying on those classes, you force me to bring in Log4J as a dependency, which in the latest versions will register themselves with the service loader mechanism, causing conflicting SLF4J implementations. It is paramount that you: * Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, please check for it using reflection! * Remove all static references to the Log4J classes. (In an ideal world you wouldn't even be doing Log4J-specific things anyway.) If you must must must do Log4J-specific things, access the classes via reflection; don't statically link them in the code. The current situation absolutely (and unnecessarily) 100% breaks the use of SLF4J 2.x. was: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use the
[jira] [Updated] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.
[ https://issues.apache.org/jira/browse/SPARK-40489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Garret Wilson updated SPARK-40489: -- Description: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use the {{ServiceLoader}} mechanism. As [described in the FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be compatible; an application just needs to use the latest Log4J/Logback implementation which has the service loader. **Above all the application must _not_ use the low-level {{StaticLoggerBinder}} method, because it has been removed!** Unfortunately [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks completely just trying to get a Spark session: {noformat} Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder at org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.log(Logging.scala:53) at org.apache.spark.internal.Logging.log$(Logging.scala:51) at org.apache.spark.SparkContext.log(SparkContext.scala:84) at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) at org.apache.spark.SparkContext.(SparkContext.scala:195) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) {noformat} This is because Spark is playing low-level tricks to find out if the logging platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark should not be using {{StaticLoggerBinder}} to do that detection. There are many other approaches. (The code itself suggest one approach: {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see if the root logger actually is a {{Log4jLogger}}. There may be even better approaches.) The other big problem is relying on the Log4J classes themselves. By relying on those classes, you force me to bring in Log4J as a dependency, which in the latest versions will register themselves with the service loader mechanism, causing conflicting SLF4J implementations. It is paramount that you: * Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, please check for it using reflection! * Remove all static references to the Log4J classes. (In an ideal world you wouldn't even be doing Log4J-specific things anyway.) If you must must must do Log4J-specific things, access the classes via reflection; don't statically link them in the code. The current situation absolutely (and unnecessarily) 100% breaks the use of SLF4J 2.x. was: Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use
[jira] [Created] (SPARK-40489) Spark 3.3.0 breaks SFL4J 2.
Garret Wilson created SPARK-40489: - Summary: Spark 3.3.0 breaks SFL4J 2. Key: SPARK-40489 URL: https://issues.apache.org/jira/browse/SPARK-40489 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Garret Wilson Spark breaks fundamentally with SLF4J 2.x because it uses {{StaticLoggerBinder}}. SLF4J is the logging facade that is meant to shield the application from the implementation, whether it be Log4J or Logback or whatever. Historically SLF4J 1.x used a bad approach to configuration: it used a {{StaticLoggerBinder}} (a global static singleton instance) rather than the Java {{ServiceLoader}} mechanism. SLF4J 2.x, which has been in development for years, has been released. It finally switches to use the {{ServiceLoader}} mechanism. As [described in the FAQ|https://www.slf4j.org/faq.html#changesInVersion200], the API should be compatible; an application just needs to use the latest Log4J/Logback implementation which has the service loader. **Above all the application must _not_ use the low-level {{StaticLoggerBinder}} method, because it has been removed!** Unfortunately [{{org.apache.spark.internal.Logging}}|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala] uses {{StaticLoggerBinder}} and completely breaks any environment using SLF4J 2.x. For example, in my application, I have pulled in the SLF4J 2.x API and pulled in the Logback 1.4.x libraries (I'm not even using Log4J). Spark breaks completely just trying to get a Spark session: {noformat} Caused by: java.lang.NoClassDefFoundError: org/slf4j/impl/StaticLoggerBinder at org.apache.spark.internal.Logging$.org$apache$spark$internal$Logging$$isLog4j2(Logging.scala:232) at org.apache.spark.internal.Logging.initializeLogging(Logging.scala:129) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:115) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:109) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:106) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) at org.apache.spark.SparkContext.initializeLogIfNecessary(SparkContext.scala:84) at org.apache.spark.internal.Logging.log(Logging.scala:53) at org.apache.spark.internal.Logging.log$(Logging.scala:51) at org.apache.spark.SparkContext.log(SparkContext.scala:84) at org.apache.spark.internal.Logging.logInfo(Logging.scala:61) at org.apache.spark.internal.Logging.logInfo$(Logging.scala:60) at org.apache.spark.SparkContext.logInfo(SparkContext.scala:84) at org.apache.spark.SparkContext.(SparkContext.scala:195) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2704) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:953) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947) {noforomat} This is because Spark is playing low-level tricks to find out if the logging platform is Log4J, and relying on {{StaticLoggerBinder}} to do it. Whatever the wisdom of Spark's relying on Log4J-specific functionality, Spark should not be using {{StaticLoggerBinder}} to do that detection. There are many other approaches. (The code itself suggest one approach: {{LogManager.getRootLogger.asInstanceOf[Log4jLogger]}}. You could check to see if the root logger actually is a {{Log4jLogger}}. There may be even better approaches.) The other big problem is relying on the Log4J classes themselves. By relying on those classes, you force me to bring in Log4J as a dependency, which in the latest versions will register themselves with the service loader mechanism, causing conflicting SLF4J implementations. It is paramount that you: * Remove all reliance ton {{StaticLoggerBinder}}. If you must must must use it, please check for it using reflection! * Remove all static references to the Log4J classes. (In an ideal world you wouldn't even be doing Log4J-specific things anyway.) If you must must must do Log4J-specific things, access the classes via reflection; don't statically link them in the code. The current situation absolutely (and unnecessarily) 100% breaks the use of SLF4J 2.x. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40413) Column.isin produces non-boolean results
[ https://issues.apache.org/jira/browse/SPARK-40413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40413. -- Resolution: Invalid > Column.isin produces non-boolean results > > > Key: SPARK-40413 > URL: https://issues.apache.org/jira/browse/SPARK-40413 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Andreas Franz >Priority: Major > > I observed an inconsistent behaviour using the Column.isin function. The > [documentation|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html#isin(list:Any*):org.apache.spark.sql.Column] > states that an "up-cast" takes places when different data types are > involved. When working with _null_ values the results are confusing to me. > I prepared a small example demonstrating the issue > {code:java} > package example > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.functions._ > object Test { > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder() > .appName("App") > .master("local[*]") > .config("spark.driver.host", "localhost") > .config("spark.ui.enabled", "false") > .getOrCreate() > val schema = StructType( > Array( > StructField("name", StringType, nullable = true) > ) > ) > val data = Seq( > Row("a"), > Row("b"), > Row("c"), > Row(""), > Row(null) > ).toList > val list1 = Array("a", "d", "") > val list2 = Array("a", "d", "", null) > val dataFrame = > spark.createDataFrame(spark.sparkContext.parallelize(data), schema) > dataFrame > .withColumn("name_is_in_list_1", col("name").isin(list1: _*)) > .show(10, truncate = false) > /* > ++-+ > |name|name_is_in_list_1| > ++-+ > |a |true | > |b |false| > |c |false| > ||true | > |null|null | // check value null is not contained in > list1, why is null returned here? Expected result: false > ++-+ > */ > dataFrame > .withColumn("name_is_in_list_2", col("name").isin(list2: _*)) > .show(10, truncate = false) > /* > ++-+ > |name|name_is_in_list_2| > ++-+ > |a |true | > |b |null | // check value "b" is not contained in > list1, why is null returned here? Expected result: false > |c |null | // check value "c" is not contained in > list1, why is null returned here? Expected result: false > ||true | > |null|null | // check value null is in list1, why is > null returned here? Expected result: true > ++-+ > */ > val data2 = Seq( > Row("a"), > Row("b"), > Row("c"), > Row(""), > ).toList > val dataFrame2 = > spark.createDataFrame(spark.sparkContext.parallelize(data2), schema) > dataFrame2 > .withColumn("name_is_in_list_2", col("name").isin(list2: _*)) > .show(10, truncate = false) > > /* > ++-+ > |name|name_is_in_list_2| > ++-+ > |a |true | > |b |null | // check value "b" is not contained in > list2, why is null returned here? Expected result: false > |c |null | // check value "b" is not contained in > list2, why is null returned here? Expected result: false > ||true | > ++-+ > */ > } > }{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40456) PartitionIterator.hasNext should be cheap to call repeatedly
[ https://issues.apache.org/jira/browse/SPARK-40456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40456. - Fix Version/s: 3.4.0 Resolution: Fixed > PartitionIterator.hasNext should be cheap to call repeatedly > > > Key: SPARK-40456 > URL: https://issues.apache.org/jira/browse/SPARK-40456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Richard Chen >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40456) PartitionIterator.hasNext should be cheap to call repeatedly
[ https://issues.apache.org/jira/browse/SPARK-40456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40456: --- Assignee: Richard Chen (was: Wenchen Fan) > PartitionIterator.hasNext should be cheap to call repeatedly > > > Key: SPARK-40456 > URL: https://issues.apache.org/jira/browse/SPARK-40456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Richard Chen >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40456) PartitionIterator.hasNext should be cheap to call repeatedly
[ https://issues.apache.org/jira/browse/SPARK-40456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40456: - Priority: Minor (was: Major) > PartitionIterator.hasNext should be cheap to call repeatedly > > > Key: SPARK-40456 > URL: https://issues.apache.org/jira/browse/SPARK-40456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40294) Repeat calls to `PartitionIterator.hasNext` can timeout
[ https://issues.apache.org/jira/browse/SPARK-40294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40294. -- Resolution: Duplicate > Repeat calls to `PartitionIterator.hasNext` can timeout > --- > > Key: SPARK-40294 > URL: https://issues.apache.org/jira/browse/SPARK-40294 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Richard Chen >Priority: Major > > Repeat calls to {{PartitionIterator.hasNext}} where both calls return > {{false}} can result in timeouts. For example, > {{{}KafkaBatchPartitionReader.next(){}}}, which calls {{consumer.get}} (which > can potentially timeout with repeat calls), is called by > {{{}PartitionIterator.hasNext{}}}. Thus, repeat calls to > {{PartitionIterator.hasNext}} by its parent could timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
[ https://issues.apache.org/jira/browse/SPARK-40488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606544#comment-17606544 ] Apache Spark commented on SPARK-40488: -- User 'bozhang2820' has created a pull request for this issue: https://github.com/apache/spark/pull/37931 > Do not wrap exceptions thrown in FileFormatWriter.write with SparkException > --- > > Key: SPARK-40488 > URL: https://issues.apache.org/jira/browse/SPARK-40488 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bo Zhang >Priority: Major > > Exceptions thrown in FileFormatWriter.write are wrapped with > SparkException("Job aborted."). > This wrapping provides little extra information, but generates a long > stacktrace, which hinders debugging when error happens. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
[ https://issues.apache.org/jira/browse/SPARK-40488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40488: Assignee: Apache Spark > Do not wrap exceptions thrown in FileFormatWriter.write with SparkException > --- > > Key: SPARK-40488 > URL: https://issues.apache.org/jira/browse/SPARK-40488 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bo Zhang >Assignee: Apache Spark >Priority: Major > > Exceptions thrown in FileFormatWriter.write are wrapped with > SparkException("Job aborted."). > This wrapping provides little extra information, but generates a long > stacktrace, which hinders debugging when error happens. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
[ https://issues.apache.org/jira/browse/SPARK-40488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40488: Assignee: (was: Apache Spark) > Do not wrap exceptions thrown in FileFormatWriter.write with SparkException > --- > > Key: SPARK-40488 > URL: https://issues.apache.org/jira/browse/SPARK-40488 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bo Zhang >Priority: Major > > Exceptions thrown in FileFormatWriter.write are wrapped with > SparkException("Job aborted."). > This wrapping provides little extra information, but generates a long > stacktrace, which hinders debugging when error happens. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
[ https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40419. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37873 [https://github.com/apache/spark/pull/37873] > Integrate Grouped Aggregate Pandas UDFs into *.sql test cases > - > > Key: SPARK-40419 > URL: https://issues.apache.org/jira/browse/SPARK-40419 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases > from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at > all. > We should also leverage this to test pandas aggregate UDFs too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
[ https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40419: Assignee: Haejoon Lee > Integrate Grouped Aggregate Pandas UDFs into *.sql test cases > - > > Key: SPARK-40419 > URL: https://issues.apache.org/jira/browse/SPARK-40419 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases > from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at > all. > We should also leverage this to test pandas aggregate UDFs too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40488) Do not wrap exceptions thrown in FileFormatWriter.write with SparkException
Bo Zhang created SPARK-40488: Summary: Do not wrap exceptions thrown in FileFormatWriter.write with SparkException Key: SPARK-40488 URL: https://issues.apache.org/jira/browse/SPARK-40488 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Bo Zhang Exceptions thrown in FileFormatWriter.write are wrapped with SparkException("Job aborted."). This wrapping provides little extra information, but generates a long stacktrace, which hinders debugging when error happens. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
[ https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606537#comment-17606537 ] Apache Spark commented on SPARK-40487: -- User 'xingczhao' has created a pull request for this issue: https://github.com/apache/spark/pull/37930 > Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel > --- > > Key: SPARK-40487 > URL: https://issues.apache.org/jira/browse/SPARK-40487 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xingchao, Zhang >Priority: Major > > The 'Part 1' and 'Part 2' could run in parallel > {code:java} > /** >* The implementation for these joins: >* >* LeftOuter with BuildLeft >* RightOuter with BuildRight >* FullOuter >*/ > private def defaultJoin(relation: Broadcast[Array[InternalRow]]): > RDD[InternalRow] = { > val streamRdd = streamed.execute() > // Part 1 > val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, > relation) > val notMatchedBroadcastRows: Seq[InternalRow] = { > val nulls = new GenericInternalRow(streamed.output.size) > val buf: CompactBuffer[InternalRow] = new CompactBuffer() > val joinedRow = new JoinedRow > joinedRow.withLeft(nulls) > var i = 0 > val buildRows = relation.value > while (i < buildRows.length) { > if (!matchedBroadcastRows.get(i)) { > buf += joinedRow.withRight(buildRows(i)).copy() > } > i += 1 > } > buf > } > // Part 2 > val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter => > val buildRows = relation.value > val joinedRow = new JoinedRow > val nulls = new GenericInternalRow(broadcast.output.size) > streamedIter.flatMap { streamedRow => > var i = 0 > var foundMatch = false > val matchedRows = new CompactBuffer[InternalRow] > while (i < buildRows.length) { > if (boundCondition(joinedRow(streamedRow, buildRows(i { > matchedRows += joinedRow.copy() > foundMatch = true > } > i += 1 > } > if (!foundMatch && joinType == FullOuter) { > matchedRows += joinedRow(streamedRow, nulls).copy() > } > matchedRows.iterator > } > } > // Union > sparkContext.union( > matchedStreamRows, > sparkContext.makeRDD(notMatchedBroadcastRows) > ) > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
[ https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40487: Assignee: Apache Spark > Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel > --- > > Key: SPARK-40487 > URL: https://issues.apache.org/jira/browse/SPARK-40487 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xingchao, Zhang >Assignee: Apache Spark >Priority: Major > > The 'Part 1' and 'Part 2' could run in parallel > {code:java} > /** >* The implementation for these joins: >* >* LeftOuter with BuildLeft >* RightOuter with BuildRight >* FullOuter >*/ > private def defaultJoin(relation: Broadcast[Array[InternalRow]]): > RDD[InternalRow] = { > val streamRdd = streamed.execute() > // Part 1 > val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, > relation) > val notMatchedBroadcastRows: Seq[InternalRow] = { > val nulls = new GenericInternalRow(streamed.output.size) > val buf: CompactBuffer[InternalRow] = new CompactBuffer() > val joinedRow = new JoinedRow > joinedRow.withLeft(nulls) > var i = 0 > val buildRows = relation.value > while (i < buildRows.length) { > if (!matchedBroadcastRows.get(i)) { > buf += joinedRow.withRight(buildRows(i)).copy() > } > i += 1 > } > buf > } > // Part 2 > val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter => > val buildRows = relation.value > val joinedRow = new JoinedRow > val nulls = new GenericInternalRow(broadcast.output.size) > streamedIter.flatMap { streamedRow => > var i = 0 > var foundMatch = false > val matchedRows = new CompactBuffer[InternalRow] > while (i < buildRows.length) { > if (boundCondition(joinedRow(streamedRow, buildRows(i { > matchedRows += joinedRow.copy() > foundMatch = true > } > i += 1 > } > if (!foundMatch && joinType == FullOuter) { > matchedRows += joinedRow(streamedRow, nulls).copy() > } > matchedRows.iterator > } > } > // Union > sparkContext.union( > matchedStreamRows, > sparkContext.makeRDD(notMatchedBroadcastRows) > ) > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
[ https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606536#comment-17606536 ] Apache Spark commented on SPARK-40487: -- User 'xingczhao' has created a pull request for this issue: https://github.com/apache/spark/pull/37930 > Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel > --- > > Key: SPARK-40487 > URL: https://issues.apache.org/jira/browse/SPARK-40487 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xingchao, Zhang >Priority: Major > > The 'Part 1' and 'Part 2' could run in parallel > {code:java} > /** >* The implementation for these joins: >* >* LeftOuter with BuildLeft >* RightOuter with BuildRight >* FullOuter >*/ > private def defaultJoin(relation: Broadcast[Array[InternalRow]]): > RDD[InternalRow] = { > val streamRdd = streamed.execute() > // Part 1 > val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, > relation) > val notMatchedBroadcastRows: Seq[InternalRow] = { > val nulls = new GenericInternalRow(streamed.output.size) > val buf: CompactBuffer[InternalRow] = new CompactBuffer() > val joinedRow = new JoinedRow > joinedRow.withLeft(nulls) > var i = 0 > val buildRows = relation.value > while (i < buildRows.length) { > if (!matchedBroadcastRows.get(i)) { > buf += joinedRow.withRight(buildRows(i)).copy() > } > i += 1 > } > buf > } > // Part 2 > val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter => > val buildRows = relation.value > val joinedRow = new JoinedRow > val nulls = new GenericInternalRow(broadcast.output.size) > streamedIter.flatMap { streamedRow => > var i = 0 > var foundMatch = false > val matchedRows = new CompactBuffer[InternalRow] > while (i < buildRows.length) { > if (boundCondition(joinedRow(streamedRow, buildRows(i { > matchedRows += joinedRow.copy() > foundMatch = true > } > i += 1 > } > if (!foundMatch && joinType == FullOuter) { > matchedRows += joinedRow(streamedRow, nulls).copy() > } > matchedRows.iterator > } > } > // Union > sparkContext.union( > matchedStreamRows, > sparkContext.makeRDD(notMatchedBroadcastRows) > ) > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
[ https://issues.apache.org/jira/browse/SPARK-40487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40487: Assignee: (was: Apache Spark) > Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel > --- > > Key: SPARK-40487 > URL: https://issues.apache.org/jira/browse/SPARK-40487 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xingchao, Zhang >Priority: Major > > The 'Part 1' and 'Part 2' could run in parallel > {code:java} > /** >* The implementation for these joins: >* >* LeftOuter with BuildLeft >* RightOuter with BuildRight >* FullOuter >*/ > private def defaultJoin(relation: Broadcast[Array[InternalRow]]): > RDD[InternalRow] = { > val streamRdd = streamed.execute() > // Part 1 > val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, > relation) > val notMatchedBroadcastRows: Seq[InternalRow] = { > val nulls = new GenericInternalRow(streamed.output.size) > val buf: CompactBuffer[InternalRow] = new CompactBuffer() > val joinedRow = new JoinedRow > joinedRow.withLeft(nulls) > var i = 0 > val buildRows = relation.value > while (i < buildRows.length) { > if (!matchedBroadcastRows.get(i)) { > buf += joinedRow.withRight(buildRows(i)).copy() > } > i += 1 > } > buf > } > // Part 2 > val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter => > val buildRows = relation.value > val joinedRow = new JoinedRow > val nulls = new GenericInternalRow(broadcast.output.size) > streamedIter.flatMap { streamedRow => > var i = 0 > var foundMatch = false > val matchedRows = new CompactBuffer[InternalRow] > while (i < buildRows.length) { > if (boundCondition(joinedRow(streamedRow, buildRows(i { > matchedRows += joinedRow.copy() > foundMatch = true > } > i += 1 > } > if (!foundMatch && joinType == FullOuter) { > matchedRows += joinedRow(streamedRow, nulls).copy() > } > matchedRows.iterator > } > } > // Union > sparkContext.union( > matchedStreamRows, > sparkContext.makeRDD(notMatchedBroadcastRows) > ) > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40487) Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel
Xingchao, Zhang created SPARK-40487: --- Summary: Make defaultJoin in BroadcastNestedLoopJoinExec running in parallel Key: SPARK-40487 URL: https://issues.apache.org/jira/browse/SPARK-40487 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Xingchao, Zhang The 'Part 1' and 'Part 2' could run in parallel {code:java} /** * The implementation for these joins: * * LeftOuter with BuildLeft * RightOuter with BuildRight * FullOuter */ private def defaultJoin(relation: Broadcast[Array[InternalRow]]): RDD[InternalRow] = { val streamRdd = streamed.execute() // Part 1 val matchedBroadcastRows = getMatchedBroadcastRowsBitSet(streamRdd, relation) val notMatchedBroadcastRows: Seq[InternalRow] = { val nulls = new GenericInternalRow(streamed.output.size) val buf: CompactBuffer[InternalRow] = new CompactBuffer() val joinedRow = new JoinedRow joinedRow.withLeft(nulls) var i = 0 val buildRows = relation.value while (i < buildRows.length) { if (!matchedBroadcastRows.get(i)) { buf += joinedRow.withRight(buildRows(i)).copy() } i += 1 } buf } // Part 2 val matchedStreamRows = streamRdd.mapPartitionsInternal { streamedIter => val buildRows = relation.value val joinedRow = new JoinedRow val nulls = new GenericInternalRow(broadcast.output.size) streamedIter.flatMap { streamedRow => var i = 0 var foundMatch = false val matchedRows = new CompactBuffer[InternalRow] while (i < buildRows.length) { if (boundCondition(joinedRow(streamedRow, buildRows(i { matchedRows += joinedRow.copy() foundMatch = true } i += 1 } if (!foundMatch && joinType == FullOuter) { matchedRows += joinedRow(streamedRow, nulls).copy() } matchedRows.iterator } } // Union sparkContext.union( matchedStreamRows, sparkContext.makeRDD(notMatchedBroadcastRows) ) }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
[ https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606528#comment-17606528 ] Apache Spark commented on SPARK-40486: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37929 > Implement `spearman` and `kendall` in `DataFrame.corrwith` > -- > > Key: SPARK-40486 > URL: https://issues.apache.org/jira/browse/SPARK-40486 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
[ https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40486: Assignee: (was: Apache Spark) > Implement `spearman` and `kendall` in `DataFrame.corrwith` > -- > > Key: SPARK-40486 > URL: https://issues.apache.org/jira/browse/SPARK-40486 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
[ https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40486: Assignee: Apache Spark > Implement `spearman` and `kendall` in `DataFrame.corrwith` > -- > > Key: SPARK-40486 > URL: https://issues.apache.org/jira/browse/SPARK-40486 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
[ https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606527#comment-17606527 ] Apache Spark commented on SPARK-40486: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37929 > Implement `spearman` and `kendall` in `DataFrame.corrwith` > -- > > Key: SPARK-40486 > URL: https://issues.apache.org/jira/browse/SPARK-40486 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
[ https://issues.apache.org/jira/browse/SPARK-40486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-40486: -- Parent: SPARK-40327 Issue Type: Sub-task (was: Improvement) > Implement `spearman` and `kendall` in `DataFrame.corrwith` > -- > > Key: SPARK-40486 > URL: https://issues.apache.org/jira/browse/SPARK-40486 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40486) Implement `spearman` and `kendall` in `DataFrame.corrwith`
Ruifeng Zheng created SPARK-40486: - Summary: Implement `spearman` and `kendall` in `DataFrame.corrwith` Key: SPARK-40486 URL: https://issues.apache.org/jira/browse/SPARK-40486 Project: Spark Issue Type: Improvement Components: ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40483) Add `CONNECT` label
[ https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40483. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37925 [https://github.com/apache/spark/pull/37925] > Add `CONNECT` label > --- > > Key: SPARK-40483 > URL: https://issues.apache.org/jira/browse/SPARK-40483 > Project: Spark > Issue Type: Sub-task > Components: Connect, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40483) Add `CONNECT` label
[ https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40483: Assignee: Hyukjin Kwon > Add `CONNECT` label > --- > > Key: SPARK-40483 > URL: https://issues.apache.org/jira/browse/SPARK-40483 > Project: Spark > Issue Type: Sub-task > Components: Connect, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org