[jira] [Updated] (SPARK-47126) Re-enable Spark 3.4 test in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-47126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47126: -- Parent: (was: SPARK-47046) Issue Type: Bug (was: Sub-task) > Re-enable Spark 3.4 test in HiveExternalCatalogVersionsSuite > > > Key: SPARK-47126 > URL: https://issues.apache.org/jira/browse/SPARK-47126 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > HiveExternalCatalogVersionsSuite requires SPARK-46400 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44319) Migrate jersey 2 to jersey 3
[ https://issues.apache.org/jira/browse/SPARK-44319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820268#comment-17820268 ] HiuFung Kwok commented on SPARK-44319: -- [~dongjoon] FYI I marked this as resolved also. > Migrate jersey 2 to jersey 3 > > > Key: SPARK-44319 > URL: https://issues.apache.org/jira/browse/SPARK-44319 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44319) Migrate jersey 2 to jersey 3
[ https://issues.apache.org/jira/browse/SPARK-44319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HiuFung Kwok resolved SPARK-44319. -- Fix Version/s: 4.0.0 Resolution: Fixed The work is done under the scope of SPARK-47118. > Migrate jersey 2 to jersey 3 > > > Key: SPARK-44319 > URL: https://issues.apache.org/jira/browse/SPARK-44319 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47153) Guard serialize/deserialize in JavaSerializer with try-with-resource block
[ https://issues.apache.org/jira/browse/SPARK-47153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47153: --- Labels: pull-request-available (was: ) > Guard serialize/deserialize in JavaSerializer with try-with-resource block > -- > > Key: SPARK-47153 > URL: https://issues.apache.org/jira/browse/SPARK-47153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yan-Lin (Jared) Wang >Priority: Minor > Labels: pull-request-available > > It's a common practice to close unused resources as soon as we're done using > it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47153) Guard serialize/deserialize in JavaSerializer with try-with-resource block
Yan-Lin (Jared) Wang created SPARK-47153: Summary: Guard serialize/deserialize in JavaSerializer with try-with-resource block Key: SPARK-47153 URL: https://issues.apache.org/jira/browse/SPARK-47153 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Yan-Lin (Jared) Wang It's a common practice to close unused resources as soon as we're done using it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47151) Update pandas to 2.2.1
[ https://issues.apache.org/jira/browse/SPARK-47151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47151: - Assignee: Bjørn Jørgensen > Update pandas to 2.2.1 > -- > > Key: SPARK-47151 > URL: https://issues.apache.org/jira/browse/SPARK-47151 > Project: Spark > Issue Type: Dependency upgrade > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > [Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47151) Update pandas to 2.2.1
[ https://issues.apache.org/jira/browse/SPARK-47151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47151. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45236 [https://github.com/apache/spark/pull/45236] > Update pandas to 2.2.1 > -- > > Key: SPARK-47151 > URL: https://issues.apache.org/jira/browse/SPARK-47151 > Project: Spark > Issue Type: Dependency upgrade > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47152) Provide `CodeHaus Jackson` dependencies via a new optional directory
[ https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47152: - Assignee: Dongjoon Hyun > Provide `CodeHaus Jackson` dependencies via a new optional directory > > > Key: SPARK-47152 > URL: https://issues.apache.org/jira/browse/SPARK-47152 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47152) Provide `CodeHaus Jackson` dependencies via a new optional directory
[ https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47152: -- Summary: Provide `CodeHaus Jackson` dependencies via a new optional directory (was: Provide Apache Hive Jackson dependency via a new optional directory) > Provide `CodeHaus Jackson` dependencies via a new optional directory > > > Key: SPARK-47152 > URL: https://issues.apache.org/jira/browse/SPARK-47152 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47152) Provide Apache Hive Jackson dependency via a new optional directory
[ https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47152: --- Labels: pull-request-available (was: ) > Provide Apache Hive Jackson dependency via a new optional directory > --- > > Key: SPARK-47152 > URL: https://issues.apache.org/jira/browse/SPARK-47152 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47152) Provide Apache Hive Jackson dependency via a new optional directory
[ https://issues.apache.org/jira/browse/SPARK-47152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47152: -- Component/s: Build > Provide Apache Hive Jackson dependency via a new optional directory > --- > > Key: SPARK-47152 > URL: https://issues.apache.org/jira/browse/SPARK-47152 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47152) Provide Apache Hive Jackson dependency via a new optional directory
Dongjoon Hyun created SPARK-47152: - Summary: Provide Apache Hive Jackson dependency via a new optional directory Key: SPARK-47152 URL: https://issues.apache.org/jira/browse/SPARK-47152 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47151) Update pandas to 2.2.1
[ https://issues.apache.org/jira/browse/SPARK-47151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47151: --- Labels: pull-request-available (was: ) > Update pandas to 2.2.1 > -- > > Key: SPARK-47151 > URL: https://issues.apache.org/jira/browse/SPARK-47151 > Project: Spark > Issue Type: Dependency upgrade > Components: Pandas API on Spark >Affects Versions: 4.0.0 >Reporter: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > > [Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47151) Update pandas to 2.2.1
Bjørn Jørgensen created SPARK-47151: --- Summary: Update pandas to 2.2.1 Key: SPARK-47151 URL: https://issues.apache.org/jira/browse/SPARK-47151 Project: Spark Issue Type: Dependency upgrade Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Bjørn Jørgensen [Pandas 2.2.1|https://pandas.pydata.org/docs/whatsnew/v2.2.1.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47150) String length (...) exceeds the maximum length (20000000)
Sergii Mikhtoniuk created SPARK-47150: - Summary: String length (...) exceeds the maximum length (2000) Key: SPARK-47150 URL: https://issues.apache.org/jira/browse/SPARK-47150 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.5.0 Reporter: Sergii Mikhtoniuk Upgrading to Spark 3.5.0 introduced a regression for us where our query gateway (Livy) fails with an error: {code:java} com.fasterxml.jackson.core.exc.StreamConstraintsException: String length (20054016) exceeds the maximum length (2000) (sorry, unable to provide full stack trace){code} The root of this problem is the breaking change in {{jackson}} that (in the name of "safety") introduced some JSON size limits, see: [https://github.com/FasterXML/jackson-core/issues/1014] Looks like {{JSONOptions}} in Spark already [support configuring this limit|https://github.com/apache/spark/blob/c2dbb6d04bc9c781fb4a7673e5acf2c67b99c203/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L55-L58], but there seems to be no way to set it globally or pass it down to [{{DataFrame::toJSON()}}|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toJSON.html] which our Apache Livy server is using when transmitting data. Livy is an old project and transferring dataframes via JSON is super inefficient, and we really should move to something like Spark Connect, but I believe this issue can happen to many people working with basic GeoJSON data. Spark can handle very large strings, and this arbitrary limit just gets in a way of output serialization for no good reason. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47035) Protocol for client side StreamingQueryListener
[ https://issues.apache.org/jira/browse/SPARK-47035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-47035. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45091 [https://github.com/apache/spark/pull/45091] > Protocol for client side StreamingQueryListener > --- > > Key: SPARK-47035 > URL: https://issues.apache.org/jira/browse/SPARK-47035 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 4.0.0 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47149) Add Use Pandas API on Spark section on Pandas Scaling to large datasets page
Bjørn Jørgensen created SPARK-47149: --- Summary: Add Use Pandas API on Spark section on Pandas Scaling to large datasets page Key: SPARK-47149 URL: https://issues.apache.org/jira/browse/SPARK-47149 Project: Spark Issue Type: Documentation Components: Pandas API on Spark Affects Versions: 4.0.0 Reporter: Bjørn Jørgensen We should make a PR like [DOC: Add Use Modin section on Scaling to large datasets page|https://github.com/pandas-dev/pandas/issues/57585] We can wait to see if pandas does accepts it. I hope it will be a good thing and that it can attract more users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47118) Upgrade Jetty to 11
[ https://issues.apache.org/jira/browse/SPARK-47118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47118. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45154 [https://github.com/apache/spark/pull/45154] > Upgrade Jetty to 11 > --- > > Key: SPARK-47118 > URL: https://issues.apache.org/jira/browse/SPARK-47118 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: HiuFung Kwok >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46975) Support dedicated fallback methods
[ https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng resolved SPARK-46975. -- Resolution: Done Resolved by https://github.com/apache/spark/pull/45026 > Support dedicated fallback methods > -- > > Key: SPARK-46975 > URL: https://issues.apache.org/jira/browse/SPARK-46975 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45101) Spark UI: A stage is still active even when all of it's tasks are succeeded
[ https://issues.apache.org/jira/browse/SPARK-45101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820182#comment-17820182 ] Bjørn Jørgensen commented on SPARK-45101: - did you use spark.stop() > Spark UI: A stage is still active even when all of it's tasks are succeeded > --- > > Key: SPARK-45101 > URL: https://issues.apache.org/jira/browse/SPARK-45101 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: RickyMa >Priority: Critical > Attachments: 1.png, 2.png, 3.png > > > In the stage UI, we can see all the tasks' statuses are SUCCESS. > But the stage is still marked as active. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46975) Support dedicated fallback methods
[ https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng reassigned SPARK-46975: Assignee: Ruifeng Zheng > Support dedicated fallback methods > -- > > Key: SPARK-46975 > URL: https://issues.apache.org/jira/browse/SPARK-46975 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47148) Avoid to materialize AQE ShuffleQueryStage on the cancellation
[ https://issues.apache.org/jira/browse/SPARK-47148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-47148: --- Description: AQE can materialize *ShuffleQueryStage* on the cancellation. This causes unnecessary stage materialization by submitting Shuffle Job. Under normal circumstances, if the stage is already non-materialized (a.k.a ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be skipped without materializing it. Please find sample use-case: *1- Stage Materialization Steps:* When stage materialization is failed: {code:java} 1.1- ShuffleQueryStage1 - is materialized successfully, 1.2- ShuffleQueryStage2 - materialization is failed, 1.3- ShuffleQueryStage3 - Not materialized yet so ShuffleQueryStage3.shuffleFuture is not initialized yet{code} *2- Stage Cancellation Steps:* {code:java} 2.1- ShuffleQueryStage1 - is canceled due to already materialized, 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as default by AQE because it could not be materialized, 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet but currently, it is also tried to cancel and this stage requires to be materialized first.{code} was: AQE can materialize *ShuffleQueryStage* on the cancellation. This causes unnecessary stage materialization by submitting Shuffle Job. Under normal circumstances, if the stage is already non-materialized (a.k.a ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be skipped without materializing it. Please find sample use-case: *1- Stage Materialization Steps:* When stage materialization is failed: {code:java} 1.1- ShuffleQueryStage1 - is materialized successfully, 1.2- ShuffleQueryStage2 - materialization is failed, 1.3- ShuffleQueryStage3 - Not materialized yet so ShuffleQueryStage3.shuffleFuture is not initialized yet{code} *2- Stage Cancellation Steps:* {code:java} 2.1- ShuffleQueryStage1 - is canceled due to already materialized, 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as default because it could not be materialized, 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet but currently, it is also tried to cancel and this stage requires to be materialized first.{code} > Avoid to materialize AQE ShuffleQueryStage on the cancellation > -- > > Key: SPARK-47148 > URL: https://issues.apache.org/jira/browse/SPARK-47148 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 4.0.0 >Reporter: Eren Avsarogullari >Priority: Major > Labels: pull-request-available > > AQE can materialize *ShuffleQueryStage* on the cancellation. This causes > unnecessary stage materialization by submitting Shuffle Job. Under normal > circumstances, if the stage is already non-materialized (a.k.a > ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be > skipped without materializing it. > Please find sample use-case: > *1- Stage Materialization Steps:* > When stage materialization is failed: > {code:java} > 1.1- ShuffleQueryStage1 - is materialized successfully, > 1.2- ShuffleQueryStage2 - materialization is failed, > 1.3- ShuffleQueryStage3 - Not materialized yet so > ShuffleQueryStage3.shuffleFuture is not initialized yet{code} > *2- Stage Cancellation Steps:* > {code:java} > 2.1- ShuffleQueryStage1 - is canceled due to already materialized, > 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as > default by AQE because it could not be materialized, > 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet > but currently, it is also tried to cancel and this stage requires to be > materialized first.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46639) Add WindowExec SQLMetrics
[ https://issues.apache.org/jira/browse/SPARK-46639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-46639: --- Description: Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira aims to add following SQLMetrics to provide more information from WindowExec usage during query execution: {code:java} numOfOutputRows: Number of total output rows. numOfPartitions: Number of processed input partitions. numOfWindowPartitions: Number of generated window partitions. spilledRows: Number of total spilled rows. spillSizeOnDisk: Total spilled data size on disk.{code} As an example use-case, WindowExec spilling behavior depends on multiple factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling to disk so it is hard to track without SQL Metrics such as: *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) per task (a.k.a child RDD partition) *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds spark.sql.windowExec.buffer.in.memory.threshold=4096, ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as spillableArray by moving its all buffered rows into UnsafeExternalSorter and ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a UnsafeInMemorySorter). *3-* UnsafeExternalSorter is being created per window partition. When UnsafeExternalSorter' buffer size exceeds spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. In this case, UnsafeExternalSorter will continue to buffer next records until exceeding spark.sql.windowExec.buffer.spill.threshold. *New WindowExec SQLMetrics Sample Screenshot:* !WindowExec SQLMetrics.png|width=257,height=152! was: Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira aims to add following SQLMetrics to provide more information from WindowExec usage during query execution: {code:java} numOfOutputRows: Number of total output rows. numOfPartitions: Number of processed input partitions. numOfWindowPartitions: Number of generated window partitions. spilledRows: Number of total spilled rows. spillSizeOnDisk: Total spilled data size on disk.{code} As an example use-case, WindowExec spilling behavior depends on multiple factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling to disk so it is hard to track without SQL Metrics such as: *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) per task (a.k.a child RDD partition) *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds spark.sql.windowExec.buffer.in.memory.threshold=4096, ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as spillableArray by moving its all buffered rows into UnsafeExternalSorter and ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a UnsafeInMemorySorter). *3-* UnsafeExternalSorter is being created per window partition. When UnsafeExternalSorter' buffer size exceeds spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. In this case, UnsafeExternalSorter will continue to buffer next records until exceeding spark.sql.windowExec.buffer.spill.threshold. Sample UI Screenshot: !WindowExec SQLMetrics.png|width=257,height=152! > Add WindowExec SQLMetrics > - > > Key: SPARK-46639 > URL: https://issues.apache.org/jira/browse/SPARK-46639 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eren Avsarogullari >Priority: Major > Labels: pull-request-available > Attachments: WindowExec SQLMetrics.png > > > Currently, WindowExec Physical Operator has only spillSize SQLMetric. This > jira aims to add following SQLMetrics to provide more information from > WindowExec usage during query execution: > {code:java} > numOfOutputRows: Number of total output rows. > numOfPartitions: Number of processed input partitions. > numOfWindowPartitions: Number of generated window partitions. > spilledRows: Number of total spilled rows. > spillSizeOnDisk: Total spilled data size on disk.{code} > As an example use-case, WindowExec spilling behavior depends on multiple > factors and it can sometime cause {{SparkOutOfMemoryError}} instead of > spilling to disk so it is hard to track without SQL Metrics such as: > *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal > ArrayBuffer) per task (a.k.a child RDD partition) > *2-* When ExternalAppendOnlyUnsafeRowArray size
[jira] [Updated] (SPARK-46639) Add WindowExec SQLMetrics
[ https://issues.apache.org/jira/browse/SPARK-46639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-46639: --- Description: Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira aims to add following SQLMetrics to provide more information from WindowExec usage during query execution: {code:java} numOfOutputRows: Number of total output rows. numOfPartitions: Number of processed input partitions. numOfWindowPartitions: Number of generated window partitions. spilledRows: Number of total spilled rows. spillSizeOnDisk: Total spilled data size on disk.{code} As an example use-case, WindowExec spilling behavior depends on multiple factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling to disk so it is hard to track without SQL Metrics such as: *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) per task (a.k.a child RDD partition) *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds spark.sql.windowExec.buffer.in.memory.threshold=4096, ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as spillableArray by moving its all buffered rows into UnsafeExternalSorter and ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a UnsafeInMemorySorter). *3-* UnsafeExternalSorter is being created per window partition. When UnsafeExternalSorter' buffer size exceeds spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. In this case, UnsafeExternalSorter will continue to buffer next records until exceeding spark.sql.windowExec.buffer.spill.threshold. Sample UI Screenshot: !WindowExec SQLMetrics.png|width=257,height=152! was: Currently, WindowExec Physical Operator has only spillSize SQLMetric. This jira aims to add following SQLMetrics to provide more information from WindowExec usage during query execution: {code:java} numOfOutputRows: Number of total output rows. numOfPartitions: Number of processed input partitions. numOfWindowPartitions: Number of generated window partitions. spilledRows: Number of total spilled rows. spillSizeOnDisk: Total spilled data size on disk.{code} As an example use-case, WindowExec spilling behavior depends on multiple factors and it can sometime cause {{SparkOutOfMemoryError}} instead of spilling to disk so it is hard to track without SQL Metrics such as: *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) per task (a.k.a child RDD partition) *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds spark.sql.windowExec.buffer.in.memory.threshold=4096, ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as spillableArray by moving its all buffered rows into UnsafeExternalSorter and ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a UnsafeInMemorySorter). *3-* UnsafeExternalSorter is being created per window partition. When UnsafeExternalSorter' buffer size exceeds spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) content. In this case, UnsafeExternalSorter will continue to buffer next records until exceeding spark.sql.windowExec.buffer.spill.threshold. > Add WindowExec SQLMetrics > - > > Key: SPARK-46639 > URL: https://issues.apache.org/jira/browse/SPARK-46639 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eren Avsarogullari >Priority: Major > Labels: pull-request-available > Attachments: WindowExec SQLMetrics.png > > > Currently, WindowExec Physical Operator has only spillSize SQLMetric. This > jira aims to add following SQLMetrics to provide more information from > WindowExec usage during query execution: > {code:java} > numOfOutputRows: Number of total output rows. > numOfPartitions: Number of processed input partitions. > numOfWindowPartitions: Number of generated window partitions. > spilledRows: Number of total spilled rows. > spillSizeOnDisk: Total spilled data size on disk.{code} > As an example use-case, WindowExec spilling behavior depends on multiple > factors and it can sometime cause {{SparkOutOfMemoryError}} instead of > spilling to disk so it is hard to track without SQL Metrics such as: > *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal > ArrayBuffer) per task (a.k.a child RDD partition) > *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds > spark.sql.windowExec.buffer.in.memory.threshold=4096, >
[jira] [Updated] (SPARK-46639) Add WindowExec SQLMetrics
[ https://issues.apache.org/jira/browse/SPARK-46639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-46639: --- Attachment: WindowExec SQLMetrics.png > Add WindowExec SQLMetrics > - > > Key: SPARK-46639 > URL: https://issues.apache.org/jira/browse/SPARK-46639 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Eren Avsarogullari >Priority: Major > Labels: pull-request-available > Attachments: WindowExec SQLMetrics.png > > > Currently, WindowExec Physical Operator has only spillSize SQLMetric. This > jira aims to add following SQLMetrics to provide more information from > WindowExec usage during query execution: > {code:java} > numOfOutputRows: Number of total output rows. > numOfPartitions: Number of processed input partitions. > numOfWindowPartitions: Number of generated window partitions. > spilledRows: Number of total spilled rows. > spillSizeOnDisk: Total spilled data size on disk.{code} > As an example use-case, WindowExec spilling behavior depends on multiple > factors and it can sometime cause {{SparkOutOfMemoryError}} instead of > spilling to disk so it is hard to track without SQL Metrics such as: > *1-* WindowExec creates ExternalAppendOnlyUnsafeRowArray (internal > ArrayBuffer) per task (a.k.a child RDD partition) > *2-* When ExternalAppendOnlyUnsafeRowArray size exceeds > spark.sql.windowExec.buffer.in.memory.threshold=4096, > ExternalAppendOnlyUnsafeRowArray switches to UnsafeExternalSorter as > spillableArray by moving its all buffered rows into UnsafeExternalSorter and > ExternalAppendOnlyUnsafeRowArray (internal ArrayBuffer) is cleared. In this > case, WindowExec starts to write UnsafeExternalSorter' s buffer (a.k.a > UnsafeInMemorySorter). > *3-* UnsafeExternalSorter is being created per window partition. When > UnsafeExternalSorter' buffer size exceeds > spark.sql.windowExec.buffer.spill.threshold=Integer.MAX_VALUE, it starts to > write to disk and get cleared all buffer (a.k.a UnsafeInMemorySorter) > content. In this case, UnsafeExternalSorter will continue to buffer next > records until exceeding spark.sql.windowExec.buffer.spill.threshold. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47148) Avoid to materialize AQE ShuffleQueryStage on the cancellation
[ https://issues.apache.org/jira/browse/SPARK-47148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eren Avsarogullari updated SPARK-47148: --- Summary: Avoid to materialize AQE ShuffleQueryStage on the cancellation (was: [AQE] Avoid to materialize ShuffleQueryStage on the cancellation) > Avoid to materialize AQE ShuffleQueryStage on the cancellation > -- > > Key: SPARK-47148 > URL: https://issues.apache.org/jira/browse/SPARK-47148 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 4.0.0 >Reporter: Eren Avsarogullari >Priority: Major > Labels: pull-request-available > > AQE can materialize *ShuffleQueryStage* on the cancellation. This causes > unnecessary stage materialization by submitting Shuffle Job. Under normal > circumstances, if the stage is already non-materialized (a.k.a > ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be > skipped without materializing it. > Please find sample use-case: > *1- Stage Materialization Steps:* > When stage materialization is failed: > {code:java} > 1.1- ShuffleQueryStage1 - is materialized successfully, > 1.2- ShuffleQueryStage2 - materialization is failed, > 1.3- ShuffleQueryStage3 - Not materialized yet so > ShuffleQueryStage3.shuffleFuture is not initialized yet{code} > *2- Stage Cancellation Steps:* > {code:java} > 2.1- ShuffleQueryStage1 - is canceled due to already materialized, > 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as > default because it could not be materialized, > 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet > but currently, it is also tried to cancel and this stage requires to be > materialized first.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47148) [AQE] Avoid to materialize ShuffleQueryStage on the cancellation
Eren Avsarogullari created SPARK-47148: -- Summary: [AQE] Avoid to materialize ShuffleQueryStage on the cancellation Key: SPARK-47148 URL: https://issues.apache.org/jira/browse/SPARK-47148 Project: Spark Issue Type: Bug Components: Shuffle, SQL Affects Versions: 4.0.0 Reporter: Eren Avsarogullari AQE can materialize *ShuffleQueryStage* on the cancellation. This causes unnecessary stage materialization by submitting Shuffle Job. Under normal circumstances, if the stage is already non-materialized (a.k.a ShuffleQueryStage.shuffleFuture is not initialized yet), it should just be skipped without materializing it. Please find sample use-case: *1- Stage Materialization Steps:* When stage materialization is failed: {code:java} 1.1- ShuffleQueryStage1 - is materialized successfully, 1.2- ShuffleQueryStage2 - materialization is failed, 1.3- ShuffleQueryStage3 - Not materialized yet so ShuffleQueryStage3.shuffleFuture is not initialized yet{code} *2- Stage Cancellation Steps:* {code:java} 2.1- ShuffleQueryStage1 - is canceled due to already materialized, 2.2- ShuffleQueryStage2 - is earlyFailedStage so currently, it is skipped as default because it could not be materialized, 2.3- ShuffleQueryStage3 - Problem is here: This stage is not materialized yet but currently, it is also tried to cancel and this stage requires to be materialized first.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47129) Make ResolveRelations cache connect plan properly
[ https://issues.apache.org/jira/browse/SPARK-47129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47129. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45214 [https://github.com/apache/spark/pull/45214] > Make ResolveRelations cache connect plan properly > - > > Key: SPARK-47129 > URL: https://issues.apache.org/jira/browse/SPARK-47129 > Project: Spark > Issue Type: Improvement > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47129) Make ResolveRelations cache connect plan properly
[ https://issues.apache.org/jira/browse/SPARK-47129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47129: - Assignee: Ruifeng Zheng > Make ResolveRelations cache connect plan properly > - > > Key: SPARK-47129 > URL: https://issues.apache.org/jira/browse/SPARK-47129 > Project: Spark > Issue Type: Improvement > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44914) Upgrade Ivy to 2.5.2
[ https://issues.apache.org/jira/browse/SPARK-44914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44914. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45075 [https://github.com/apache/spark/pull/45075] > Upgrade Ivy to 2.5.2 > > > Key: SPARK-44914 > URL: https://issues.apache.org/jira/browse/SPARK-44914 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: Bjørn Jørgensen >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [CVE-2022-46751|https://www.cve.org/CVERecord?id=CVE-2022-46751] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47146) Possible thread leak when doing sort merge join
JacobZheng created SPARK-47146: -- Summary: Possible thread leak when doing sort merge join Key: SPARK-47146 URL: https://issues.apache.org/jira/browse/SPARK-47146 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0, 3.3.0, 3.2.0 Reporter: JacobZheng I have a long-running spark job. stumbled upon executor taking up a lot of threads, resulting in no threads available on the server. Querying thread details via jstack, there are tons of threads named read-ahead. Checking the code confirms that these threads are created by ReadAheadInputStream. This class is initialized to create a single-threaded thread pool {code:java} private final ExecutorService executorService = ThreadUtils.newDaemonSingleThreadExecutor("read-ahead"); {code} This thread pool is closed by ReadAheadInputStream#close(). The call stack for the normal case close() method is {code:java} ts=2024-02-21 17:36:18;thread_name=Executor task launch worker for task 60.0 in stage 71.0 (TID 258);id=330;is_daemon=true;priority=5;TCCL=org.apache.spark.util.MutableURLClassLoader@17233230 @org.apache.spark.io.ReadAheadInputStream.close() at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.close(UnsafeSorterSpillReader.java:149) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:121) at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$1.loadNext(UnsafeSorterSpillMerger.java:87) at org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.advanceNext(UnsafeExternalRowSorter.java:187) at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:67) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage27.processNext(null:-1) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.smj_findNextJoinRows_0$(null:-1) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_1$(null:-1) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.hashAgg_doAggregateWithKeys_0$(null:-1) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage29.processNext(null:-1) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:779) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:829) {code} As shown in UnsafeSorterSpillReader#close, the stream is only closed when the data in the stream is read through. {code:java} @Override public void loadNext() throws IOException { // Kill the task in case it has been marked as killed. This logic is from // InterruptibleIterator, but we inline it here instead of wrapping the iterator in order // to avoid performance overhead. This check is added here in `loadNext()` instead of in // `hasNext()` because it's technically possible for the caller to be relying on // `getNumRecords()` instead of `hasNext()` to know when to stop. if (taskContext != null) { taskContext.killTaskIfInterrupted(); } recordLength = din.readInt(); keyPrefix = din.readLong(); if (recordLength > arr.length) { arr = new byte[recordLength]; baseObject = arr; } ByteStreams.readFully(in, arr, 0, recordLength); numRecordsRemaining--; if (numRecordsRemaining == 0) {
[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue
[ https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47144: --- Labels: pull-request-available (was: ) > Fix Spark Connect collation issue > - > > Key: SPARK-47144 > URL: https://issues.apache.org/jira/browse/SPARK-47144 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when > connecting to sever using Spark Connect: > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support > convert string(UCS_BASIC_LCASE) to connect proto types.{code} > When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47023) Upgrade `aircompressor` to 0.26
[ https://issues.apache.org/jira/browse/SPARK-47023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-47023: - Description: `aircompressor` is a transitive dependency from Apache ORC and Parquet. `aircompressor` v0.26 reported the following bug fixes recently. - [Fix out of bounds read/write in Snappy decompressor]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2]) - [Fix ZstdOutputStream corruption on double close]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2]) was: `aircompressor` is a transitive dependency from Apache ORC and Parquet. `aircompressor` v1.26 reported the following bug fixes recently. - [Fix out of bounds read/write in Snappy decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) - [Fix ZstdOutputStream corruption on double close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) > Upgrade `aircompressor` to 0.26 > --- > > Key: SPARK-47023 > URL: https://issues.apache.org/jira/browse/SPARK-47023 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > `aircompressor` is a transitive dependency from Apache ORC and Parquet. > `aircompressor` v0.26 reported the following bug fixes recently. > - [Fix out of bounds read/write in Snappy > decompressor]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2]) > - [Fix ZstdOutputStream corruption on double > close]([https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2]) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47023) Upgrade `aircompressor` to 0.26
[ https://issues.apache.org/jira/browse/SPARK-47023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-47023: - Summary: Upgrade `aircompressor` to 0.26 (was: Upgrade `aircompressor` to 1.26) > Upgrade `aircompressor` to 0.26 > --- > > Key: SPARK-47023 > URL: https://issues.apache.org/jira/browse/SPARK-47023 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.5.0, 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > `aircompressor` is a transitive dependency from Apache ORC and Parquet. > `aircompressor` v1.26 reported the following bug fixes recently. > > - [Fix out of bounds read/write in Snappy > decompressor](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) > - [Fix ZstdOutputStream corruption on double > close](https://github.com/airlift/aircompressor/commit/b89db180bb97debe025b640dc40ed43816e8c7d2) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47145) Provide table identifier to scan node when DS v2 strategy is applied
Uros Stankovic created SPARK-47145: -- Summary: Provide table identifier to scan node when DS v2 strategy is applied Key: SPARK-47145 URL: https://issues.apache.org/jira/browse/SPARK-47145 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.5.0 Reporter: Uros Stankovic Currently, DataSourceScanExec node can accept table identifier, and that information can be useful for later logging, debugging, etc, but DataSourceV2Strategy does not provide that information to scan node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue
[ https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47144: -- Epic Link: SPARK-46830 > Fix Spark Connect collation issue > - > > Key: SPARK-47144 > URL: https://issues.apache.org/jira/browse/SPARK-47144 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Fix For: 4.0.0 > > > Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when > connecting to sever using Spark Connect: > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support > convert string(UCS_BASIC_LCASE) to connect proto types.{code} > When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47144) Fix Spark Connect collation issue
[ https://issues.apache.org/jira/browse/SPARK-47144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nikola Mandic updated SPARK-47144: -- Component/s: SQL > Fix Spark Connect collation issue > - > > Key: SPARK-47144 > URL: https://issues.apache.org/jira/browse/SPARK-47144 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > Fix For: 4.0.0 > > > Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when > connecting to sever using Spark Connect: > {code:java} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support > convert string(UCS_BASIC_LCASE) to connect proto types.{code} > When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47144) Fix Spark Connect collation issue
Nikola Mandic created SPARK-47144: - Summary: Fix Spark Connect collation issue Key: SPARK-47144 URL: https://issues.apache.org/jira/browse/SPARK-47144 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 4.0.0 Reporter: Nikola Mandic Fix For: 4.0.0 Collated expression "SELECT 'abc' COLLATE 'UCS_BASIC_LCASE'" is failing when connecting to sever using Spark Connect: {code:java} pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) Does not support convert string(UCS_BASIC_LCASE) to connect proto types.{code} When using default collation "UCS_BASIC", the error is not occurring. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47102: -- Assignee: (was: Apache Spark) > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage > of feature under development. > *Why are the changes needed?* > We want to make collations configurable on this flag. These changes disable > usage of `collate` and `collation` functions, along with any `COLLATE` syntax > when the flag is set to false. By default, the flag is set to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47102: -- Assignee: Apache Spark > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage > of feature under development. > *Why are the changes needed?* > We want to make collations configurable on this flag. These changes disable > usage of `collate` and `collation` functions, along with any `COLLATE` syntax > when the flag is set to false. By default, the flag is set to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47102: -- Assignee: Apache Spark > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage > of feature under development. > *Why are the changes needed?* > We want to make collations configurable on this flag. These changes disable > usage of `collate` and `collation` functions, along with any `COLLATE` syntax > when the flag is set to false. By default, the flag is set to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47102: -- Assignee: (was: Apache Spark) > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage > of feature under development. > *Why are the changes needed?* > We want to make collations configurable on this flag. These changes disable > usage of `collate` and `collation` functions, along with any `COLLATE` syntax > when the flag is set to false. By default, the flag is set to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47135) Implement error classes for Kafka data loss exceptions
[ https://issues.apache.org/jira/browse/SPARK-47135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47135: -- Assignee: (was: Apache Spark) > Implement error classes for Kafka data loss exceptions > --- > > Key: SPARK-47135 > URL: https://issues.apache.org/jira/browse/SPARK-47135 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: B. Micheal Okutubo >Priority: Major > Labels: pull-request-available > > In the kafka connector code, we have several code that throws the java > *IllegalStateException* to report data loss, while reading from Kafka. We > want to properly classify those exceptions using the new error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47135) Implement error classes for Kafka data loss exceptions
[ https://issues.apache.org/jira/browse/SPARK-47135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47135: -- Assignee: Apache Spark > Implement error classes for Kafka data loss exceptions > --- > > Key: SPARK-47135 > URL: https://issues.apache.org/jira/browse/SPARK-47135 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: B. Micheal Okutubo >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > In the kafka connector code, we have several code that throws the java > *IllegalStateException* to report data loss, while reading from Kafka. We > want to properly classify those exceptions using the new error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47102: -- Assignee: Apache Spark > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_DISABLED` to appropriately report error on usage of > feature under development. > *Why are the changes needed?* > We want to make collations configurable on this some flag. These changes > disable usage of `collate` and `collation` functions, along with any > `COLLATE` syntax when the flag is set to false. By default, the flag is set > to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47102: -- Description: *What changes were proposed in this pull request?* This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage of feature under development. *Why are the changes needed?* We want to make collations configurable on this flag. These changes disable usage of `collate` and `collation` functions, along with any `COLLATE` syntax when the flag is set to false. By default, the flag is set to false. was: *What changes were proposed in this pull request?* This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error class `COLLATION_SUPPORT_DISABLED` to appropriately report error on usage of feature under development. *Why are the changes needed?* We want to make collations configurable on this some flag. These changes disable usage of `collate` and `collation` functions, along with any `COLLATE` syntax when the flag is set to false. By default, the flag is set to false. > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_NOT_ENABLED` to appropriately report error on usage > of feature under development. > *Why are the changes needed?* > We want to make collations configurable on this flag. These changes disable > usage of `collate` and `collation` functions, along with any `COLLATE` syntax > when the flag is set to false. By default, the flag is set to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47102) Add COLLATION_ENABLED config flag
[ https://issues.apache.org/jira/browse/SPARK-47102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47102: -- Assignee: (was: Apache Spark) > Add COLLATION_ENABLED config flag > - > > Key: SPARK-47102 > URL: https://issues.apache.org/jira/browse/SPARK-47102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > > *What changes were proposed in this pull request?* > This PR adds COLLATION_ENABLED config to `SQLConf` and introduces new error > class `COLLATION_SUPPORT_DISABLED` to appropriately report error on usage of > feature under development. > *Why are the changes needed?* > We want to make collations configurable on this some flag. These changes > disable usage of `collate` and `collation` functions, along with any > `COLLATE` syntax when the flag is set to false. By default, the flag is set > to false. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chhavi Bansal closed SPARK-47104. - > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Assignee: Bruce Robbins >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why this happens just when using `show()` since > this is failing queries in production for me. -- This
[jira] [Commented] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819970#comment-17819970 ] Chhavi Bansal commented on SPARK-47104: --- Thanks Team for looking into the issue and rolling out a fix. > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.3, 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Assignee: Bruce Robbins >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why