[spark] branch master updated (61c4057c3fe -> 99739ae068d)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 61c4057c3fe [SPARK-41905][CONNECT] Support name as strings in slice add 99739ae068d [SPARK-41840][CONNECT][TESTS] Remove the invalid JIRA in the comment No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/connect/test_parity_dataframe.py | 9 +++-- python/pyspark/sql/tests/connect/test_parity_functions.py | 14 ++ 2 files changed, 9 insertions(+), 14 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b7dbfa2c376 -> 61c4057c3fe)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b7dbfa2c376 [SPARK-41921][CONNECT][TESTS] Enable doctests in connect.column and connect.functions add 61c4057c3fe [SPARK-41905][CONNECT] Support name as strings in slice No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/functions.py | 10 +- python/pyspark/sql/tests/connect/test_parity_functions.py | 5 - 2 files changed, 5 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ec60670818e -> b7dbfa2c376)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ec60670818e [SPARK-41906][CONNECT][TESTS] Reenable rand test in Spark Connect add b7dbfa2c376 [SPARK-41921][CONNECT][TESTS] Enable doctests in connect.column and connect.functions No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/column.py| 4 python/pyspark/sql/connect/functions.py | 3 --- 2 files changed, 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b5d162bb59c -> ec60670818e)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b5d162bb59c [SPARK-41869][CONNECT] Reject single string in dropDuplicates add ec60670818e [SPARK-41906][CONNECT][TESTS] Reenable rand test in Spark Connect No new revisions were added by this update. Summary of changes: python/pyspark/sql/tests/connect/test_parity_functions.py | 5 - python/pyspark/sql/tests/test_functions.py| 3 +-- 2 files changed, 1 insertion(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (3a3bc77f3de -> b5d162bb59c)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3a3bc77f3de Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for StreamingQueryProgressWrapper" add b5d162bb59c [SPARK-41869][CONNECT] Reject single string in dropDuplicates No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/dataframe.py | 3 +++ python/pyspark/sql/tests/connect/test_parity_dataframe.py | 5 - 2 files changed, 3 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for StreamingQueryProgressWrapper"
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3a3bc77f3de Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for StreamingQueryProgressWrapper" 3a3bc77f3de is described below commit 3a3bc77f3dea368ca0b434a3f8a9629b5d69a5ca Author: Gengliang Wang AuthorDate: Thu Jan 5 20:28:55 2023 -0800 Revert "[SPARK-41677][CORE][SQL][SS][UI] Add Protobuf serializer for StreamingQueryProgressWrapper" ### What changes were proposed in this pull request? This reverts commit 915e9c67a9581a1f66e70321879092d854c9fb3b. ### Why are the changes needed? When running end-to-end tests, there are 5 NPE errors from string fields: - SourceProgress.latestOffset - SourceProgress.endOffset - SourceProgress.startOffset - StreamingQueryData.name - StreamingQueryProgress.name After fixing them, there is following error: ``` java.lang.UnsupportedOperationException at java.base/java.util.Collections$UnmodifiableMap.remove(Collections.java:1460) at org.apache.spark.sql.streaming.ui.StreamingQueryStatisticsPage.$anonfun$generateStatTable$27(StreamingQueryStatisticsPage.scala:401) ``` The deserialized map `StreamingQueryProgress.durationMs` needs to be mutable. Give the StreamingQueryProgressWrapper contains nullable fields and mutable map, I suggest using the default JSON serailizer for this class. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests Closes #39416 from gengliangwang/revertSS. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../apache/spark/status/protobuf/store_types.proto | 51 --- .../org.apache.spark.status.protobuf.ProtobufSerDe | 1 - .../org/apache/spark/sql/streaming/progress.scala | 8 +- .../ui/StreamingQueryStatusListener.scala | 2 +- .../protobuf/sql/SinkProgressSerializer.scala | 42 - .../protobuf/sql/SourceProgressSerializer.scala| 65 .../sql/StateOperatorProgressSerializer.scala | 75 - .../sql/StreamingQueryProgressSerializer.scala | 89 --- .../StreamingQueryProgressWrapperSerializer.scala | 40 - .../sql/KVStoreProtobufSerializerSuite.scala | 170 + 10 files changed, 6 insertions(+), 537 deletions(-) diff --git a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto index 1c3e5bfc49a..499fda34174 100644 --- a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto +++ b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto @@ -686,54 +686,3 @@ message ExecutorPeakMetricsDistributions { repeated double quantiles = 1; repeated ExecutorMetrics executor_metrics = 2; } - -message StateOperatorProgress { - string operator_name = 1; - int64 num_rows_total = 2; - int64 num_rows_updated = 3; - int64 all_updates_time_ms = 4; - int64 num_rows_removed = 5; - int64 all_removals_time_ms = 6; - int64 commit_time_ms = 7; - int64 memory_used_bytes = 8; - int64 num_rows_dropped_by_watermark = 9; - int64 num_shuffle_partitions = 10; - int64 num_state_store_instances = 11; - map custom_metrics = 12; -} - -message SourceProgress { - string description = 1; - string start_offset = 2; - string end_offset = 3; - string latest_offset = 4; - int64 num_input_rows = 5; - double input_rows_per_second = 6; - double processed_rows_per_second = 7; - map metrics = 8; -} - -message SinkProgress { - string description = 1; - int64 num_output_rows = 2; - map metrics = 3; -} - -message StreamingQueryProgress { - string id = 1; - string run_id = 2; - string name = 3; - string timestamp = 4; - int64 batch_id = 5; - int64 batch_duration = 6; - map duration_ms = 7; - map event_time = 8; - repeated StateOperatorProgress state_operators = 9; - repeated SourceProgress sources = 10; - SinkProgress sink = 11; - map observed_metrics = 12; -} - -message StreamingQueryProgressWrapper { - StreamingQueryProgress progress = 1; -} diff --git a/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe b/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe index e907d559349..7beff87d7ec 100644 --- a/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe +++ b/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe @@ -18,4 +18,3 @@ org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer org.apache.spark.status.protobuf.sql.SparkPlanGraphWrapperSerializer
[spark] branch master updated: [SPARK-41708][SQL] Pull v1write information to `WriteFiles`
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 27e20fe9eb1 [SPARK-41708][SQL] Pull v1write information to `WriteFiles` 27e20fe9eb1 is described below commit 27e20fe9eb1b1ef1b3d32e180de55931f31fc345 Author: ulysses-you AuthorDate: Fri Jan 6 12:13:30 2023 +0800 [SPARK-41708][SQL] Pull v1write information to `WriteFiles` ### What changes were proposed in this pull request? This pr aims to pull out the v1write information from `V1WriteCommand` to `WriteFiles`: ```scala case class WriteFiles(child: LogicalPlan) => case class WriteFiles( child: LogicalPlan, fileFormat: FileFormat, partitionColumns: Seq[Attribute], bucketSpec: Option[BucketSpec], options: Map[String, String], staticPartitions: TablePartitionSpec) ``` Also, this pr do a cleanup for `WriteSpec` which is unnecessary. ### Why are the changes needed? After this pr, `WriteFiles` will hold write information that can help developers ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass CI Closes #39277 from ulysses-you/SPARK-41708. Authored-by: ulysses-you Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/internal/WriteSpec.java | 33 --- .../org/apache/spark/sql/execution/SparkPlan.scala | 9 +- .../spark/sql/execution/SparkStrategies.scala | 5 +- .../spark/sql/execution/datasources/V1Writes.scala | 24 ++- .../sql/execution/datasources/WriteFiles.scala | 26 ++- .../org/apache/spark/sql/hive/HiveStrategies.scala | 3 +- .../{SaveAsHiveFile.scala => HiveTempPath.scala} | 204 ++- .../hive/execution/InsertIntoHiveDirCommand.scala | 13 +- .../sql/hive/execution/InsertIntoHiveTable.scala | 88 +--- .../spark/sql/hive/execution/SaveAsHiveFile.scala | 221 + .../sql/hive/execution/V1WritesHiveUtils.scala | 33 ++- .../org/apache/spark/sql/hive/InsertSuite.scala| 15 +- 12 files changed, 224 insertions(+), 450 deletions(-) diff --git a/sql/core/src/main/java/org/apache/spark/sql/internal/WriteSpec.java b/sql/core/src/main/java/org/apache/spark/sql/internal/WriteSpec.java deleted file mode 100644 index c51a3ed7dc6..000 --- a/sql/core/src/main/java/org/apache/spark/sql/internal/WriteSpec.java +++ /dev/null @@ -1,33 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.internal; - -import java.io.Serializable; - -/** - * Write spec is a input parameter of - * {@link org.apache.spark.sql.execution.SparkPlan#executeWrite}. - * - * - * This is an empty interface, the concrete class which implements - * {@link org.apache.spark.sql.execution.SparkPlan#doExecuteWrite} - * should define its own class and use it. - * - * @since 3.4.0 - */ -public interface WriteSpec extends Serializable {} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala index 401302e5bde..5ca36a8a216 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala @@ -36,8 +36,9 @@ import org.apache.spark.sql.catalyst.plans.physical._ import org.apache.spark.sql.catalyst.trees.{BinaryLike, LeafLike, TreeNodeTag, UnaryLike} import org.apache.spark.sql.connector.write.WriterCommitMessage import org.apache.spark.sql.errors.QueryExecutionErrors +import org.apache.spark.sql.execution.datasources.WriteFilesSpec import org.apache.spark.sql.execution.metric.SQLMetric -import org.apache.spark.sql.internal.{SQLConf, WriteSpec} +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.util.NextIterator import org.apache.spark.util.io.{ChunkedByteBuffer, ChunkedByteBufferOutputStream} @@ -230,11 +231,11 @@ abstract class
[spark] branch branch-3.2 updated: [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 7eca60d4f30 [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations 7eca60d4f30 is described below commit 7eca60d4f304d4a1a66add9fd04166d8eed1dd4f Author: Enrico Minack AuthorDate: Fri Jan 6 11:32:45 2023 +0800 [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations ### What changes were proposed in this pull request? Backport #39131 to branch-3.3. Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an `Aggregate` when the join condition references an attribute that exists in its right plan and its left plan's child. This usually happens when the anti-join / semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those attributes (in this example due to the projection of `value` to `id`). This behaviour already exists for `Project` and `Union`, but `Aggregate` lacks this safety guard. ### Why are the changes needed? Without this change, the optimizer creates an incorrect plan. This example fails with `distinct()` (an aggregation), and succeeds without `distinct()`, but both queries are identical: ```scala val ids = Seq(1, 2, 3).toDF("id").distinct() val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect() assert(result.length == 1) ``` With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition `(value#907 + 1) = value#907`, which can never be true. This effectively removes the anti-join. **Before this PR:** The anti-join is fully removed from the plan. ``` == Physical Plan == AdaptiveSparkPlan (16) +- == Final Plan == LocalTableScan (1) (16) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` This is caused by `PushDownLeftSemiAntiJoin` adding join condition `(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 1) AS id#912` exists in the right child of the join as well as in the left grandchild: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === !Join LeftAnti, (id#912 = id#910) Aggregate [id#910], [(id#910 + 1) AS id#912] !:- Aggregate [id#910], [(id#910 + 1) AS id#912] +- Project [value#907 AS id#910] !: +- Project [value#907 AS id#910] +- Join LeftAnti, ((value#907 + 1) = value#907) !: +- LocalRelation [value#907] :- LocalRelation [value#907] !+- Aggregate [id#910], [id#910] +- Aggregate [id#910], [id#910] ! +- Project [value#914 AS id#910]+- Project [value#914 AS id#910] ! +- LocalRelation [value#914]+- LocalRelation [value#914] ``` The right child of the join and in the left grandchild would become the children of the pushed-down join, which creates an invalid join condition. **After this PR:** Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous as both sides of the prospect join contain `id#910`. Hence, the join is not pushed down. The rule is then not applied any more. The final plan contains the anti-join: ``` == Physical Plan == AdaptiveSparkPlan (24) +- == Final Plan == * BroadcastHashJoin LeftSemi BuildRight (14) :- * HashAggregate (7) : +- AQEShuffleRead (6) : +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, rowCount=3) :+- Exchange (4) : +- * HashAggregate (3) : +- * Project (2) : +- * LocalTableScan (1) +- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, rowCount=3) +- BroadcastExchange (12) +- * HashAggregate (11) +- AQEShuffleRead (10) +- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, rowCount=3) +- ReusedExchange (8) (8) ReusedExchange [Reuses operator id: 4] Output [1]: [id#898] (24) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` ### Does this PR introduce _any_ user-facing change? It fixes correctness. ### How was this patch tested? Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`. Closes #39409 from EnricoMi/branch-antijoin-selfjoin-fix-3.3. Authored-by: Enrico Minack Signed-off-by: Wenchen Fan (cherry picked from commit b97f79da04acc9bde1cb4def7dc33c22cfc11372) Signed-off-by: Wenchen Fan --- .../optimizer/PushDownLeftSemiAntiJoin.scala
[spark] branch branch-3.3 updated: [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new b97f79da04a [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations b97f79da04a is described below commit b97f79da04acc9bde1cb4def7dc33c22cfc11372 Author: Enrico Minack AuthorDate: Fri Jan 6 11:32:45 2023 +0800 [SPARK-41162][SQL][3.3] Fix anti- and semi-join for self-join with aggregations ### What changes were proposed in this pull request? Backport #39131 to branch-3.3. Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an `Aggregate` when the join condition references an attribute that exists in its right plan and its left plan's child. This usually happens when the anti-join / semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those attributes (in this example due to the projection of `value` to `id`). This behaviour already exists for `Project` and `Union`, but `Aggregate` lacks this safety guard. ### Why are the changes needed? Without this change, the optimizer creates an incorrect plan. This example fails with `distinct()` (an aggregation), and succeeds without `distinct()`, but both queries are identical: ```scala val ids = Seq(1, 2, 3).toDF("id").distinct() val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect() assert(result.length == 1) ``` With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition `(value#907 + 1) = value#907`, which can never be true. This effectively removes the anti-join. **Before this PR:** The anti-join is fully removed from the plan. ``` == Physical Plan == AdaptiveSparkPlan (16) +- == Final Plan == LocalTableScan (1) (16) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` This is caused by `PushDownLeftSemiAntiJoin` adding join condition `(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 1) AS id#912` exists in the right child of the join as well as in the left grandchild: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === !Join LeftAnti, (id#912 = id#910) Aggregate [id#910], [(id#910 + 1) AS id#912] !:- Aggregate [id#910], [(id#910 + 1) AS id#912] +- Project [value#907 AS id#910] !: +- Project [value#907 AS id#910] +- Join LeftAnti, ((value#907 + 1) = value#907) !: +- LocalRelation [value#907] :- LocalRelation [value#907] !+- Aggregate [id#910], [id#910] +- Aggregate [id#910], [id#910] ! +- Project [value#914 AS id#910]+- Project [value#914 AS id#910] ! +- LocalRelation [value#914]+- LocalRelation [value#914] ``` The right child of the join and in the left grandchild would become the children of the pushed-down join, which creates an invalid join condition. **After this PR:** Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous as both sides of the prospect join contain `id#910`. Hence, the join is not pushed down. The rule is then not applied any more. The final plan contains the anti-join: ``` == Physical Plan == AdaptiveSparkPlan (24) +- == Final Plan == * BroadcastHashJoin LeftSemi BuildRight (14) :- * HashAggregate (7) : +- AQEShuffleRead (6) : +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, rowCount=3) :+- Exchange (4) : +- * HashAggregate (3) : +- * Project (2) : +- * LocalTableScan (1) +- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, rowCount=3) +- BroadcastExchange (12) +- * HashAggregate (11) +- AQEShuffleRead (10) +- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, rowCount=3) +- ReusedExchange (8) (8) ReusedExchange [Reuses operator id: 4] Output [1]: [id#898] (24) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` ### Does this PR introduce _any_ user-facing change? It fixes correctness. ### How was this patch tested? Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`. Closes #39409 from EnricoMi/branch-antijoin-selfjoin-fix-3.3. Authored-by: Enrico Minack Signed-off-by: Wenchen Fan --- .../optimizer/PushDownLeftSemiAntiJoin.scala | 13 ++--- .../optimizer/LeftSemiAntiJoinPushDownSuite.scala | 57 ++
[spark] branch master updated: [SPARK-41912][SQL] Subquery should not validate CTE
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 89666d44a39 [SPARK-41912][SQL] Subquery should not validate CTE 89666d44a39 is described below commit 89666d44a39c48df841a0102ff6f54eaeb4c6140 Author: Rui Wang AuthorDate: Fri Jan 6 11:30:48 2023 +0800 [SPARK-41912][SQL] Subquery should not validate CTE ### What changes were proposed in this pull request? The commit https://github.com/apache/spark/pull/38029 actually intended to do the right thing: it checks CTE more aggressively even if a CTE is not used, which is ok. However, it triggers an existing issue where a subquery checks itself but in the CTE case if the subquery contains a CTE which is defined outside of the subquery, the check will fail as CTE not found (e.g. key not found). So it is: the commit checks more thus in the repro examples, every CTE is checked now (in the past only used CTE is checked). One of the CTE that is checked after the commit in the example contains subquery. The subquery contains another CTE which is defined outside of the subquery. The subquery checks itself thus fail due to CTE not found. This PR fixes the issue by removing the subquery self-validation on CTE case. ### Why are the changes needed? This fixed a regression that ``` val df = sql(""" |WITH |cte1 as (SELECT 1 col1), |cte2 as (SELECT (SELECT MAX(col1) FROM cte1)) |SELECT * FROM cte1 |""".stripMargin ) checkAnswer(df, Row(1) :: Nil) ``` cannot pass analyzer anymore. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #39414 from amaliujia/fix_subquery_validate. Authored-by: Rui Wang Signed-off-by: Wenchen Fan --- .../apache/spark/sql/catalyst/analysis/CheckAnalysis.scala| 2 +- .../src/test/scala/org/apache/spark/sql/SubquerySuite.scala | 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 8309186d566..4dc0bf98a54 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -923,7 +923,7 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB } // Validate the subquery plan. -checkAnalysis(expr.plan) +checkAnalysis0(expr.plan) // Check if there is outer attribute that cannot be found from the plan. checkOuterReference(plan, expr) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala index 3d4a629f7a9..86a0c4d1799 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala @@ -1019,6 +1019,17 @@ class SubquerySuite extends QueryTest } } + test("SPARK-41912: Subquery does not validate CTE") { +val df = sql(""" + |WITH + |cte1 as (SELECT 1 col1), + |cte2 as (SELECT (SELECT MAX(col1) FROM cte1)) + |SELECT * FROM cte1 + |""".stripMargin +) +checkAnswer(df, Row(1) :: Nil) + } + test("SPARK-21835: Join in correlated subquery should be duplicateResolved: case 1") { withTable("t1") { withTempPath { path => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (729c4bff167 -> 80ee385cc38)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 729c4bff167 [SPARK-41861][SQL] Make v2 ScanBuilders' build() return typed scan add 80ee385cc38 [SPARK-41831][CONNECT][PYTHON] Make `DataFrame.select` accept column list No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/dataframe.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (eba31a8de3f -> 729c4bff167)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from eba31a8de3f [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT INTO by name for DSV2 add 729c4bff167 [SPARK-41861][SQL] Make v2 ScanBuilders' build() return typed scan No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala | 3 +-- .../spark/sql/execution/datasources/v2/csv/CSVScanBuilder.scala | 3 +-- .../spark/sql/execution/datasources/v2/jdbc/JDBCScanBuilder.scala | 4 ++-- .../spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala | 3 +-- .../spark/sql/execution/datasources/v2/orc/OrcScanBuilder.scala | 4 ++-- .../sql/execution/datasources/v2/parquet/ParquetScanBuilder.scala | 4 ++-- .../spark/sql/execution/datasources/v2/text/TextScanBuilder.scala | 3 +-- 7 files changed, 10 insertions(+), 14 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT INTO by name for DSV2
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new eba31a8de3f [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT INTO by name for DSV2 eba31a8de3f is described below commit eba31a8de3fb79f96255a0feb58db19842c9d16d Author: Allison Portis AuthorDate: Fri Jan 6 10:42:16 2023 +0800 [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT INTO by name for DSV2 ### What changes were proposed in this pull request? Use DSv2 AppendData.byName for INSERT INTO by name instead of reordering and converting to AppendData.byOrdinal ### Why are the changes needed? Currently for INSERT INTO by name we reorder the value list and convert it to INSERT INTO by ordinal. Since DSv2 logical nodes have the `isByName` flag we don't need to do this. The current approach is limiting in that - Users must provide the full list of table columns (this limits the functionality for features like generated columns see [SPARK-41290](https://issues.apache.org/jira/browse/SPARK-41290)) - It allows ambiguous queries such as `INSERT OVERWRITE t PARTITION (c='1') (c) VALUES ('2')` where the user provides both the static partition column 'c' and the column 'c' in the column list. We should check that the static partition column is not in the column list. See the added test for more detailed example. ### Does this PR introduce _any_ user-facing change? For versions 3.3 and below: ```sql CREATE TABLE t(i STRING, c string) USING PARQUET PARTITIONED BY (c); INSERT OVERWRITE t PARTITION (c='1') (c) VALUES ('2') SELECT * FROM t ``` ``` +---+---+ | i| c| +---+---+ | 2| 1| +---+---+ ``` For versions 3.4 and above: ```sql CREATE TABLE t(i STRING, c string) USING PARQUET PARTITIONED BY (c); INSERT OVERWRITE t PARTITION (c='1') (c) VALUES ('2') ``` ``` AnalysisException: [STATIC_PARTITION_COLUMN_IN_COLUMN_LIST] Static partition column c is also specified in the column list. ``` ### How was this patch tested? Unit tests are added. Closes #39334 from allisonport-db/insert-into-by-name. Authored-by: Allison Portis Signed-off-by: Wenchen Fan --- core/src/main/resources/error/error-classes.json | 5 ++ .../spark/sql/catalyst/analysis/Analyzer.scala | 99 +++--- .../spark/sql/errors/QueryCompilationErrors.scala | 6 ++ .../org/apache/spark/sql/SQLInsertTestSuite.scala | 16 +++- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 96 + .../execution/command/PlanResolutionSuite.scala| 30 ++- 6 files changed, 239 insertions(+), 13 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 29cafdcc1b6..1d1952dce1b 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -1145,6 +1145,11 @@ "Star (*) is not allowed in a select list when GROUP BY an ordinal position is used." ] }, + "STATIC_PARTITION_COLUMN_IN_INSERT_COLUMN_LIST" : { +"message" : [ + "Static partition column is also specified in the column list." +] + }, "STREAM_FAILED" : { "message" : [ "Query [id = , runId = ] terminated with exception: " diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 1ebbfb9a39a..8fff0d41add 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1291,28 +1291,92 @@ class Analyzer(override val catalogManager: CatalogManager) } } + /** Handle INSERT INTO for DSv2 */ object ResolveInsertInto extends Rule[LogicalPlan] { + +/** Add a project to use the table column names for INSERT INTO BY NAME */ +private def createProjectForByNameQuery(i: InsertIntoStatement): LogicalPlan = { + SchemaUtils.checkColumnNameDuplication(i.userSpecifiedCols, resolver) + + if (i.userSpecifiedCols.size != i.query.output.size) { +throw QueryCompilationErrors.writeTableWithMismatchedColumnsError( + i.userSpecifiedCols.size, i.query.output.size, i.query) + } + val projectByName = i.userSpecifiedCols.zip(i.query.output) +.map { case (userSpecifiedCol, queryOutputCol) => + val resolvedCol = i.table.resolve(Seq(userSpecifiedCol), resolver) +.getOrElse( + throw QueryCompilationErrors.unresolvedAttributeError( +"UNRESOLVED_COLUMN", userSpecifiedCol, i.table.output.map(_.name),
[spark] branch master updated (ca8bd4ec49f -> b5347737767)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ca8bd4ec49f [SPARK-41827][CONNECT][PYTHON] Make `GroupBy` accept column list add b5347737767 [SPARK-41567][BUILD] Move configuration of `versions-maven-plugin` to parent pom No new revisions were added by this update. Summary of changes: dev/test-dependencies.sh | 5 ++--- pom.xml | 6 ++ 2 files changed, 8 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (334bc188387 -> ca8bd4ec49f)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 334bc188387 [SPARK-41849][CONNECT] Implement DataFrameReader.text add ca8bd4ec49f [SPARK-41827][CONNECT][PYTHON] Make `GroupBy` accept column list No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/dataframe.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (542644075b3 -> 334bc188387)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 542644075b3 [SPARK-41892][CONNECT][TESTS] pyspark.sql.tests.test_functions - Add JIRAs or messages for skipped messages add 334bc188387 [SPARK-41849][CONNECT] Implement DataFrameReader.text No new revisions were added by this update. Summary of changes: python/pyspark/sql/connect/functions.py| 3 --- python/pyspark/sql/connect/readwriter.py | 24 ++ python/pyspark/sql/readwriter.py | 3 +++ .../sql/tests/connect/test_connect_basic.py| 10 + 4 files changed, 37 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41892][CONNECT][TESTS] pyspark.sql.tests.test_functions - Add JIRAs or messages for skipped messages
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 542644075b3 [SPARK-41892][CONNECT][TESTS] pyspark.sql.tests.test_functions - Add JIRAs or messages for skipped messages 542644075b3 is described below commit 542644075b3fe92ca2b6675111237fd2fc177ba1 Author: Sandeep Singh AuthorDate: Fri Jan 6 09:42:44 2023 +0900 [SPARK-41892][CONNECT][TESTS] pyspark.sql.tests.test_functions - Add JIRAs or messages for skipped messages ### What changes were proposed in this pull request? This PR enables the reused PySpark tests in Spark Connect that pass now. And add JIRAs/ Messages to the skipped ones ### Why are the changes needed? To make sure on the test coverage. ### Does this PR introduce any user-facing change? No, test-only. ### How was this patch tested? Enabling tests Closes #39412 from techaddict/SPARK-41892. Authored-by: Sandeep Singh Signed-off-by: Hyukjin Kwon --- .../sql/tests/connect/test_parity_functions.py | 26 ++ 1 file changed, 26 insertions(+) diff --git a/python/pyspark/sql/tests/connect/test_parity_functions.py b/python/pyspark/sql/tests/connect/test_parity_functions.py index 3c616b5c864..a85acf7f6ec 100644 --- a/python/pyspark/sql/tests/connect/test_parity_functions.py +++ b/python/pyspark/sql/tests/connect/test_parity_functions.py @@ -44,106 +44,132 @@ class FunctionsParityTests(ReusedSQLTestCase, FunctionsTestsMixin): cls.spark = cls._spark.stop() del os.environ["SPARK_REMOTE"] +# TODO(SPARK-41897): Parity in Error types between pyspark and connect functions @unittest.skip("Fails in Spark Connect, should enable.") def test_assert_true(self): super().test_assert_true() +# Spark Connect does not support Spark Context but the test depends on that. @unittest.skip("Fails in Spark Connect, should enable.") def test_basic_functions(self): super().test_basic_functions() +# TODO(SPARK-41899): DataFrame.createDataFrame converting int to bigint @unittest.skip("Fails in Spark Connect, should enable.") def test_date_add_function(self): super().test_date_add_function() +# TODO(SPARK-41899): DataFrame.createDataFrame converting int to bigint @unittest.skip("Fails in Spark Connect, should enable.") def test_date_sub_function(self): super().test_date_sub_function() +# TODO(SPARK-41847): DataFrame mapfield,structlist invalid type @unittest.skip("Fails in Spark Connect, should enable.") def test_explode(self): super().test_explode() +# Spark Connect does not support Spark Context but the test depends on that. @unittest.skip("Fails in Spark Connect, should enable.") def test_function_parity(self): super().test_function_parity() +# Spark Connect does not support Spark Context, _jdf but the test depends on that. @unittest.skip("Fails in Spark Connect, should enable.") def test_functions_broadcast(self): super().test_functions_broadcast() +# Spark Connect does not support Spark Context but the test depends on that. @unittest.skip("Fails in Spark Connect, should enable.") def test_input_file_name_reset_for_rdd(self): super().test_input_file_name_reset_for_rdd() +# TODO(SPARK-41849): Implement DataFrameReader.text @unittest.skip("Fails in Spark Connect, should enable.") def test_input_file_name_udf(self): super().test_input_file_name_udf() +# TODO(SPARK-41901): Parity in String representation of Column @unittest.skip("Fails in Spark Connect, should enable.") def test_inverse_trig_functions(self): super().test_inverse_trig_functions() +# TODO(SPARK-41834): Implement SparkSession.conf @unittest.skip("Fails in Spark Connect, should enable.") def test_lit_list(self): super().test_lit_list() +# TODO(SPARK-41900): support Data Type int8 @unittest.skip("Fails in Spark Connect, should enable.") def test_lit_np_scalar(self): super().test_lit_np_scalar() +# TODO(SPARK-41902): Fix String representation of maps created by `map_from_arrays` @unittest.skip("Fails in Spark Connect, should enable.") def test_map_functions(self): super().test_map_functions() +# TODO(SPARK-41903): Support data type ndarray @unittest.skip("Fails in Spark Connect, should enable.") def test_ndarray_input(self): super().test_ndarray_input() +# TODO(SPARK-41902): Parity in String representation of higher_order_function's output @unittest.skip("Fails in Spark Connect, should enable.") def test_nested_higher_order_function(self):
[spark] branch master updated: [SPARK-41893][BUILD] Publish SBOM artifacts
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f123179c0fe [SPARK-41893][BUILD] Publish SBOM artifacts f123179c0fe is described below commit f123179c0fe5517ebe3ed3f9668c3970fb491064 Author: Dongjoon Hyun AuthorDate: Thu Jan 5 16:22:48 2023 -0800 [SPARK-41893][BUILD] Publish SBOM artifacts ### What changes were proposed in this pull request? This PR aims to publish `SBOM` artifacts. ### Why are the changes needed? Here is an article to give some context. - https://www.activestate.com/blog/why-the-us-government-is-mandating-software-bill-of-materials-sbom/ Software Bill of Materials (SBOM) are additional artifacts containing the aggregate of all direct and transitive dependencies of a project. The US Government (based on NIST recommendations) currently accepts only the three most popular SBOM standards as valid, namely: [CycloneDX](https://cyclonedx.org/), [Software Identification (SWID) tag](https://csrc.nist.gov/projects/Software-Identification-SWID), [Software Package Data Exchange® (SPDX)](https://spdx.dev/). This PR uses [CycloneDX maven plugin](https://github.com/CycloneDX/cyclonedx-maven-plugin), a lightweight software bill of materials (SBOM) standard designed for use in application security contexts and supply chain component analysis. For example, `spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.xml` and `spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.json` files are attached to `spark-tags_2.12-3.4.0-SNAPSHOT.jar`. ``` $ ls -al ~/.m2/repository/org/apache/spark/spark-tags_2.12/3.4.0-SNAPSHOT total 2488 drwxr-xr-x 12 dongjoon staff 384 Jan 4 23:36 . drwxr-xr-x 4 dongjoon staff 128 Jan 4 23:36 .. -rw-r--r-- 1 dongjoon staff 492 Jan 4 23:36 _remote.repositories -rw-r--r-- 1 dongjoon staff 1955 Jan 4 23:36 maven-metadata-local.xml -rw-r--r-- 1 dongjoon staff16310 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.json -rw-r--r-- 1 dongjoon staff14045 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT-cyclonedx.xml -rw-r--r-- 1 dongjoon staff 1162027 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT-javadoc.jar -rw-r--r-- 1 dongjoon staff16272 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT-sources.jar -rw-r--r-- 1 dongjoon staff12453 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT-test-sources.jar -rw-r--r-- 1 dongjoon staff10387 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff15181 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 5822 Jan 4 23:36 spark-tags_2.12-3.4.0-SNAPSHOT.pom ``` ### Does this PR introduce _any_ user-facing change? Yes, but dev-only changes. ### How was this patch tested? Manually test. ``` $ mvn install -DskipTests ... [INFO] [INFO] Reactor Summary for Spark Project Parent POM 3.4.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 10.501 s] [INFO] Spark Project Tags . SUCCESS [ 12.900 s] [INFO] Spark Project Sketch ... SUCCESS [ 24.315 s] [INFO] Spark Project Local DB . SUCCESS [ 25.406 s] [INFO] Spark Project Networking ... SUCCESS [ 36.217 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 31.532 s] [INFO] Spark Project Unsafe ... SUCCESS [ 33.338 s] [INFO] Spark Project Launcher . SUCCESS [ 19.204 s] [INFO] Spark Project Core . SUCCESS [05:24 min] [INFO] Spark Project ML Local Library . SUCCESS [01:20 min] [INFO] Spark Project GraphX ... SUCCESS [01:41 min] [INFO] Spark Project Streaming SUCCESS [02:36 min] [INFO] Spark Project Catalyst . SUCCESS [06:44 min] [INFO] Spark Project SQL .. SUCCESS [07:10 min] [INFO] Spark Project ML Library ... SUCCESS [05:48 min] [INFO] Spark Project Tools SUCCESS [ 17.132 s] [INFO] Spark Project Hive . SUCCESS [02:49 min] [INFO] Spark Project REPL . SUCCESS [ 50.149 s] [INFO] Spark Project Assembly . SUCCESS [ 6.706 s] [INFO] Kafka 0.10+ Token Provider for Streaming ... SUCCESS [ 44.131 s]
[spark] branch master updated: [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 545cf87dc72 [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs 545cf87dc72 is described below commit 545cf87dc723342fd0f7f1a222c1a94d4b4c91a0 Author: yangjie01 AuthorDate: Thu Jan 5 10:12:41 2023 -0800 [SPARK-41882][CORE][SQL][UI] Add tests for `SQLAppStatusStore` with RocksDB backend and fix some bugs ### What changes were proposed in this pull request? This pr add the following new test suites for `SQLAppStatusStore` with RocksDB backend: - `SQLAppStatusListenerWithRocksDBBackendSuite` base on `SQLAppStatusListenerSuite` - `AllExecutionsPageWithRocksDBBackendSuite` base on `AllExecutionsPageSuite` and fix bugs in `SQLExecutionUIDataSerializer` and `SparkPlanGraphWrapperSerializer` to make the new test pass. adds protection to `SparkPlanGraphWrapperSerializer#serializeSparkPlanGraphNodeWrapper` to avoid throwing NPE. ### Why are the changes needed? Add more test for `SQLAppStatusStore` with RocksDB backend and fix bugs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Add new tests Closes #39385 from LuciferYang/SPARK-41432-FOLLOWUP. Lead-authored-by: yangjie01 Co-authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../apache/spark/status/protobuf/store_types.proto | 3 +- .../sql/SQLExecutionUIDataSerializer.scala | 15 ++-- .../sql/SparkPlanGraphWrapperSerializer.scala | 5 ++- .../sql/execution/ui/AllExecutionsPageSuite.scala | 41 + .../execution/ui/SQLAppStatusListenerSuite.scala | 43 +- .../sql/KVStoreProtobufSerializerSuite.scala | 4 +- 6 files changed, 84 insertions(+), 27 deletions(-) diff --git a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto index 2a45b5da1d8..1c3e5bfc49a 100644 --- a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto +++ b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto @@ -395,7 +395,8 @@ message SQLExecutionUIData { optional string error_message = 9; map jobs = 10; repeated int64 stages = 11; - map metric_values = 12; + bool metric_values_is_null = 12; + map metric_values = 13; } message SparkPlanGraphNode { diff --git a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala index 7a4a3e2a55d..09cef9663c0 100644 --- a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala +++ b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala @@ -49,7 +49,10 @@ class SQLExecutionUIDataSerializer extends ProtobufSerDe { } ui.stages.foreach(stageId => builder.addStages(stageId.toLong)) val metricValues = ui.metricValues -if (metricValues != null) { +if (metricValues == null) { + builder.setMetricValuesIsNull(true) +} else { + builder.setMetricValuesIsNull(false) metricValues.foreach { case (k, v) => builder.putMetricValues(k, v) } @@ -67,9 +70,13 @@ class SQLExecutionUIDataSerializer extends ProtobufSerDe { val jobs = ui.getJobsMap.asScala.map { case (jobId, status) => jobId.toInt -> JobExecutionStatusSerializer.deserialize(status) }.toMap -val metricValues = ui.getMetricValuesMap.asScala.map { - case (k, v) => k.toLong -> v -}.toMap +val metricValues = if (ui.getMetricValuesIsNull) { + null +} else { + ui.getMetricValuesMap.asScala.map { +case (k, v) => k.toLong -> v + }.toMap +} new SQLExecutionUIData( executionId = ui.getExecutionId, diff --git a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala index 49debedbb68..a8f715564fc 100644 --- a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala +++ b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SparkPlanGraphWrapperSerializer.scala @@ -53,8 +53,9 @@ class SparkPlanGraphWrapperSerializer extends ProtobufSerDe { StoreTypes.SparkPlanGraphNodeWrapper = { val builder = StoreTypes.SparkPlanGraphNodeWrapper.newBuilder() -builder.setNode(serializeSparkPlanGraphNode(input.node)) -
[spark] branch master updated: [SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new eee9428ea76 [SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15 eee9428ea76 is described below commit eee9428ea76f8f5603117d8be58028b11d75ff24 Author: yangjie01 AuthorDate: Thu Jan 5 07:32:00 2023 -0600 [SPARK-41883][BUILD] Upgrade dropwizard metrics 4.2.15 ### What changes were proposed in this pull request? This pr aims upgrade dropwizard metrics to 4.2.15. ### Why are the changes needed? The release notes as follows: - https://github.com/dropwizard/metrics/releases/tag/v4.2.14 - https://github.com/dropwizard/metrics/releases/tag/v4.2.15 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions Closes #39391 from LuciferYang/SPARK-41883. Authored-by: yangjie01 Signed-off-by: Sean Owen --- dev/deps/spark-deps-hadoop-2-hive-2.3 | 10 +- dev/deps/spark-deps-hadoop-3-hive-2.3 | 10 +- pom.xml | 2 +- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 b/dev/deps/spark-deps-hadoop-2-hive-2.3 index e1141fbc558..a1fd06003bb 100644 --- a/dev/deps/spark-deps-hadoop-2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3 @@ -194,11 +194,11 @@ log4j-slf4j2-impl/2.19.0//log4j-slf4j2-impl-2.19.0.jar logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar lz4-java/1.8.0//lz4-java-1.8.0.jar mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar -metrics-core/4.2.13//metrics-core-4.2.13.jar -metrics-graphite/4.2.13//metrics-graphite-4.2.13.jar -metrics-jmx/4.2.13//metrics-jmx-4.2.13.jar -metrics-json/4.2.13//metrics-json-4.2.13.jar -metrics-jvm/4.2.13//metrics-jvm-4.2.13.jar +metrics-core/4.2.15//metrics-core-4.2.15.jar +metrics-graphite/4.2.15//metrics-graphite-4.2.15.jar +metrics-jmx/4.2.15//metrics-jmx-4.2.15.jar +metrics-json/4.2.15//metrics-json-4.2.15.jar +metrics-jvm/4.2.15//metrics-jvm-4.2.15.jar minlog/1.3.0//minlog-1.3.0.jar netty-all/4.1.86.Final//netty-all-4.1.86.Final.jar netty-buffer/4.1.86.Final//netty-buffer-4.1.86.Final.jar diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index d4157917e43..adf9ec9452b 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -178,11 +178,11 @@ log4j-slf4j2-impl/2.19.0//log4j-slf4j2-impl-2.19.0.jar logging-interceptor/3.12.12//logging-interceptor-3.12.12.jar lz4-java/1.8.0//lz4-java-1.8.0.jar mesos/1.4.3/shaded-protobuf/mesos-1.4.3-shaded-protobuf.jar -metrics-core/4.2.13//metrics-core-4.2.13.jar -metrics-graphite/4.2.13//metrics-graphite-4.2.13.jar -metrics-jmx/4.2.13//metrics-jmx-4.2.13.jar -metrics-json/4.2.13//metrics-json-4.2.13.jar -metrics-jvm/4.2.13//metrics-jvm-4.2.13.jar +metrics-core/4.2.15//metrics-core-4.2.15.jar +metrics-graphite/4.2.15//metrics-graphite-4.2.15.jar +metrics-jmx/4.2.15//metrics-jmx-4.2.15.jar +metrics-json/4.2.15//metrics-json-4.2.15.jar +metrics-jvm/4.2.15//metrics-jvm-4.2.15.jar minlog/1.3.0//minlog-1.3.0.jar netty-all/4.1.86.Final//netty-all-4.1.86.Final.jar netty-buffer/4.1.86.Final//netty-buffer-4.1.86.Final.jar diff --git a/pom.xml b/pom.xml index 20a334e8c4b..e2ae0631f80 100644 --- a/pom.xml +++ b/pom.xml @@ -152,7 +152,7 @@ If you changes codahale.metrics.version, you also need to change the link to metrics.dropwizard.io in docs/monitoring.md. --> -4.2.13 +4.2.15 1.11.1 1.12.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (737eecded4d -> 3dc881afcfc)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 737eecded4d [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations add 3dc881afcfc [SPARK-41791] Add new file source metadata column types No new revisions were added by this update. Summary of changes: .../catalyst/expressions/namedExpressions.scala| 95 +++--- .../org/apache/spark/sql/types/StructField.scala | 4 + .../org/apache/spark/sql/types/StructType.scala| 3 +- .../spark/sql/execution/DataSourceScanExec.scala | 34 .../sql/execution/datasources/FileFormat.scala | 9 -- .../execution/datasources/FileSourceStrategy.scala | 68 +--- .../datasources/PartitioningAwareFileIndex.scala | 3 +- .../datasources/FileMetadataStructSuite.scala | 10 +-- 8 files changed, 145 insertions(+), 81 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 737eecded4d [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations 737eecded4d is described below commit 737eecded4dc2a828c978147a396f8808b09566f Author: Enrico Minack AuthorDate: Thu Jan 5 18:55:11 2023 +0800 [SPARK-41162][SQL] Fix anti- and semi-join for self-join with aggregations ### What changes were proposed in this pull request? Rule `PushDownLeftSemiAntiJoin` should not push an anti-join below an `Aggregate` when the join condition references an attribute that exists in its right plan and its left plan's child. This usually happens when the anti-join / semi-join is a self-join while `DeduplicateRelations` cannot deduplicate those attributes (in this example due to the projection of `value` to `id`). This behaviour already exists for `Project` and `Union`, but `Aggregate` lacks this safety guard. ### Why are the changes needed? Without this change, the optimizer creates an incorrect plan. This example fails with `distinct()` (an aggregation), and succeeds without `distinct()`, but both queries are identical: ```scala val ids = Seq(1, 2, 3).toDF("id").distinct() val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect() assert(result.length == 1) ``` With `distinct()`, rule `PushDownLeftSemiAntiJoin` creates a join condition `(value#907 + 1) = value#907`, which can never be true. This effectively removes the anti-join. **Before this PR:** The anti-join is fully removed from the plan. ``` == Physical Plan == AdaptiveSparkPlan (16) +- == Final Plan == LocalTableScan (1) (16) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` This is caused by `PushDownLeftSemiAntiJoin` adding join condition `(value#907 + 1) = value#907`, which is wrong as because `id#910` in `(id#910 + 1) AS id#912` exists in the right child of the join as well as in the left grandchild: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === !Join LeftAnti, (id#912 = id#910) Aggregate [id#910], [(id#910 + 1) AS id#912] !:- Aggregate [id#910], [(id#910 + 1) AS id#912] +- Project [value#907 AS id#910] !: +- Project [value#907 AS id#910] +- Join LeftAnti, ((value#907 + 1) = value#907) !: +- LocalRelation [value#907] :- LocalRelation [value#907] !+- Aggregate [id#910], [id#910] +- Aggregate [id#910], [id#910] ! +- Project [value#914 AS id#910]+- Project [value#914 AS id#910] ! +- LocalRelation [value#914]+- LocalRelation [value#914] ``` The right child of the join and in the left grandchild would become the children of the pushed-down join, which creates an invalid join condition. **After this PR:** Join condition `(id#910 + 1) AS id#912` is understood to become ambiguous as both sides of the prospect join contain `id#910`. Hence, the join is not pushed down. The rule is then not applied any more. The final plan contains the anti-join: ``` == Physical Plan == AdaptiveSparkPlan (24) +- == Final Plan == * BroadcastHashJoin LeftSemi BuildRight (14) :- * HashAggregate (7) : +- AQEShuffleRead (6) : +- ShuffleQueryStage (5), Statistics(sizeInBytes=48.0 B, rowCount=3) :+- Exchange (4) : +- * HashAggregate (3) : +- * Project (2) : +- * LocalTableScan (1) +- BroadcastQueryStage (13), Statistics(sizeInBytes=1024.0 KiB, rowCount=3) +- BroadcastExchange (12) +- * HashAggregate (11) +- AQEShuffleRead (10) +- ShuffleQueryStage (9), Statistics(sizeInBytes=48.0 B, rowCount=3) +- ReusedExchange (8) (8) ReusedExchange [Reuses operator id: 4] Output [1]: [id#898] (24) AdaptiveSparkPlan Output [1]: [id#900] Arguments: isFinalPlan=true ``` ### Does this PR introduce _any_ user-facing change? It fixes correctness. ### How was this patch tested? Unit tests in `DataFrameJoinSuite` and `LeftSemiAntiJoinPushDownSuite`. Closes #39131 from EnricoMi/branch-antijoin-selfjoin-fix. Authored-by: Enrico Minack Signed-off-by: Wenchen Fan --- .../optimizer/PushDownLeftSemiAntiJoin.scala | 13 ++--- .../optimizer/LeftSemiAntiJoinPushDownSuite.scala | 57 ++ .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +++ 3 files
[spark] branch master updated: [SPARK-41842][CONNECT][PYTHON][TESTS] Enable doctests for time functions
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 272634b4240 [SPARK-41842][CONNECT][PYTHON][TESTS] Enable doctests for time functions 272634b4240 is described below commit 272634b42406aa60d3a6fb818427aa3078ec9c00 Author: Ruifeng Zheng AuthorDate: Thu Jan 5 19:20:35 2023 +0900 [SPARK-41842][CONNECT][PYTHON][TESTS] Enable doctests for time functions ### What changes were proposed in this pull request? Enable doctests for time functions ### Why are the changes needed? for test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? enable tests Closes #39407 from zhengruifeng/connect_fix_41842. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/functions.py | 7 --- 1 file changed, 7 deletions(-) diff --git a/python/pyspark/sql/connect/functions.py b/python/pyspark/sql/connect/functions.py index f2603d477cb..d9665cd1a8e 100644 --- a/python/pyspark/sql/connect/functions.py +++ b/python/pyspark/sql/connect/functions.py @@ -2368,13 +2368,6 @@ def _test() -> None: # TODO(SPARK-41757): Fix String representation for Column class del pyspark.sql.connect.functions.col.__doc__ -# TODO(SPARK-41842): support data type: Timestamp(NANOSECOND, null) -del pyspark.sql.connect.functions.hour.__doc__ -del pyspark.sql.connect.functions.minute.__doc__ -del pyspark.sql.connect.functions.second.__doc__ -del pyspark.sql.connect.functions.window.__doc__ -del pyspark.sql.connect.functions.window_time.__doc__ - # TODO(SPARK-41838): fix dataset.show del pyspark.sql.connect.functions.posexplode_outer.__doc__ del pyspark.sql.connect.functions.explode_outer.__doc__ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41830][CONNECT][PYTHON] Make `DataFrame.sample` accept the same parameters as PySpark
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 44d165113dd [SPARK-41830][CONNECT][PYTHON] Make `DataFrame.sample` accept the same parameters as PySpark 44d165113dd is described below commit 44d165113ddce621f0090d89624bcff554ae49bb Author: Ruifeng Zheng AuthorDate: Thu Jan 5 19:19:00 2023 +0900 [SPARK-41830][CONNECT][PYTHON] Make `DataFrame.sample` accept the same parameters as PySpark ### What changes were proposed in this pull request? Make `DataFrame.sample` accept the same parameters as PySpark. ### Why are the changes needed? For consistency ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? enabled doctests Closes #39403 from zhengruifeng/connect_fix_41830. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/dataframe.py | 55 - 1 file changed, 41 insertions(+), 14 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 13a421ca72a..639e3faa748 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -405,26 +405,56 @@ class DataFrame: def sample( self, -fraction: float, -*, -withReplacement: bool = False, +withReplacement: Optional[Union[float, bool]] = None, +fraction: Optional[Union[int, float]] = None, seed: Optional[int] = None, ) -> "DataFrame": -if not isinstance(fraction, float): -raise TypeError(f"'fraction' must be float, but got {type(fraction).__name__}") -if not isinstance(withReplacement, bool): + +# For the cases below: +# sample(True, 0.5 [, seed]) +# sample(True, fraction=0.5 [, seed]) +# sample(withReplacement=False, fraction=0.5 [, seed]) +is_withReplacement_set = type(withReplacement) == bool and isinstance(fraction, float) + +# For the case below: +# sample(faction=0.5 [, seed]) +is_withReplacement_omitted_kwargs = withReplacement is None and isinstance(fraction, float) + +# For the case below: +# sample(0.5 [, seed]) +is_withReplacement_omitted_args = isinstance(withReplacement, float) + +if not ( +is_withReplacement_set +or is_withReplacement_omitted_kwargs +or is_withReplacement_omitted_args +): +argtypes = [ +str(type(arg)) for arg in [withReplacement, fraction, seed] if arg is not None +] raise TypeError( -f"'withReplacement' must be bool, but got {type(withReplacement).__name__}" +"withReplacement (optional), fraction (required) and seed (optional)" +" should be a bool, float and number; however, " +"got [%s]." % ", ".join(argtypes) ) -if seed is not None and not isinstance(seed, int): -raise TypeError(f"'seed' must be None or int, but got {type(seed).__name__}") + +if is_withReplacement_omitted_args: +if fraction is not None: +seed = cast(int, fraction) +fraction = withReplacement +withReplacement = None + +if withReplacement is None: +withReplacement = False + +seed = int(seed) if seed is not None else None return DataFrame.withPlan( plan.Sample( child=self._plan, lower_bound=0.0, -upper_bound=fraction, -with_replacement=withReplacement, +upper_bound=fraction, # type: ignore[arg-type] +with_replacement=withReplacement, # type: ignore[arg-type] seed=seed, ), session=self._session, @@ -1485,9 +1515,6 @@ def _test() -> None: # TODO(SPARK-41827): groupBy requires all cols be Column or str del pyspark.sql.connect.dataframe.DataFrame.groupBy.__doc__ -# TODO(SPARK-41830): fix sample parameters -del pyspark.sql.connect.dataframe.DataFrame.sample.__doc__ - # TODO(SPARK-41831): fix transform to accept ColumnReference del pyspark.sql.connect.dataframe.DataFrame.transform.__doc__ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ede8c52de88 [SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly ede8c52de88 is described below commit ede8c52de8878cbcd098284d5c632ea8fa4ebf67 Author: maming AuthorDate: Thu Jan 5 17:52:19 2023 +0900 [SPARK-41790][SQL] Set TRANSFORM reader and writer's format correctly ### What changes were proposed in this pull request? We'll get wrong data when transform only specify reader or writer 's row format delimited, the reason is using the wrong format to feed/fetch data to/from running script now. we should set the format correctly. Currently in Spark: ```sql spark-sql> CREATE TABLE t1 (a string, b string); spark-sql> INSERT OVERWRITE t1 VALUES("1", "2"), ("3", "4"); spark-sql> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c spark-sql> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 1234 ``` The same sql in hive: ```sql hive> SELECT TRANSFORM(a, b) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > USING 'cat' > AS (c) > FROM t1; c 1,2 3,4 hive> SELECT TRANSFORM(a, b) > USING 'cat' > AS (c) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > FROM t1; c 12 34 ``` ### Why are the changes needed? Fix transform writer format and reader format. ### Does this PR introduce _any_ user-facing change? When we set transform's row format delimited in the sql, we may get the wrong data. ### How was this patch tested? New tests. Closes #39315 from mattshma/SPARK-41790. Lead-authored-by: maming Co-authored-by: mattshma Signed-off-by: Hyukjin Kwon --- .../apache/spark/sql/execution/SparkSqlParser.scala | 15 +-- .../execution/HiveScriptTransformationSuite.scala| 20 2 files changed, 29 insertions(+), 6 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala index ad0599775de..e67ffa606ef 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala @@ -751,15 +751,18 @@ class SparkSqlAstBuilder extends AstBuilder { (Nil, Option(name), props, recordHandler) } - val (inFormat, inSerdeClass, inSerdeProps, reader) = + // The Writer uses inFormat to feed input data into the running script and + // the reader uses outFormat to read the output from the running script, + // this behavior is same with hive. + val (inFormat, inSerdeClass, inSerdeProps, writer) = format( - inRowFormat, "hive.script.recordreader", - "org.apache.hadoop.hive.ql.exec.TextRecordReader") + inRowFormat, "hive.script.recordwriter", + "org.apache.hadoop.hive.ql.exec.TextRecordWriter") - val (outFormat, outSerdeClass, outSerdeProps, writer) = + val (outFormat, outSerdeClass, outSerdeProps, reader) = format( - outRowFormat, "hive.script.recordwriter", - "org.apache.hadoop.hive.ql.exec.TextRecordWriter") + outRowFormat, "hive.script.recordreader", + "org.apache.hadoop.hive.ql.exec.TextRecordReader") ScriptInputOutputSchema( inFormat, outFormat, diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala index ad4a311528a..4e8a62acddd 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala @@ -638,4 +638,24 @@ class HiveScriptTransformationSuite extends BaseScriptTransformationSuite with T Row("1") :: Row("2") :: Row("3") :: Nil) } } + + test("SPARK-41790: Set TRANSFORM reader and writer's format correctly") { +withTempView("v") { + val df = Seq( +(1, 2) + ).toDF("a", "b") + df.createTempView("v") + + checkAnswer( +sql( + s""" + |SELECT TRANSFORM(a, b) + | ROW FORMAT DELIMITED + |