(spark) branch branch-3.5 updated: [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 4b33d2874fb3 [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite 4b33d2874fb3 is described below commit 4b33d2874fb3d73a4a35155b9d8b515518153321 Author: Kent Yao AuthorDate: Fri Feb 2 15:17:43 2024 +0800 [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite ### What changes were proposed in this pull request? The table is not cleaned up after this test; test retries or upcoming new tests reused 't' as the table name will fail with TAEE. ### Why are the changes needed? fix tests as FOLLOWUP of SPARK-43742 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? this test itself ### Was this patch authored or co-authored using generative AI tooling? no Closes #44993 from yaooqinn/SPARK-43742. Authored-by: Kent Yao Signed-off-by: Kent Yao (cherry picked from commit a3432428e760fc16610cfe3380d3bdea7654f75d) Signed-off-by: Kent Yao --- .../spark/sql/ResolveDefaultColumnsSuite.scala | 104 +++-- 1 file changed, 53 insertions(+), 51 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala index 29b2796d25aa..00529559a485 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala @@ -76,57 +76,59 @@ class ResolveDefaultColumnsSuite extends QueryTest with SharedSparkSession { } test("INSERT into partitioned tables") { -sql("create table t(c1 int, c2 int, c3 int, c4 int) using parquet partitioned by (c3, c4)") - -// INSERT without static partitions -checkError( - exception = intercept[AnalysisException] { -sql("insert into t values (1, 2, 3)") - }, - errorClass = "INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS", - parameters = Map( -"tableName" -> "`spark_catalog`.`default`.`t`", -"tableColumns" -> "`c1`, `c2`, `c3`, `c4`", -"dataColumns" -> "`col1`, `col2`, `col3`")) - -// INSERT without static partitions but with column list -sql("truncate table t") -sql("insert into t (c2, c1, c4) values (1, 2, 3)") -checkAnswer(spark.table("t"), Row(2, 1, null, 3)) - -// INSERT with static partitions -sql("truncate table t") -checkError( - exception = intercept[AnalysisException] { -sql("insert into t partition(c3=3, c4=4) values (1)") - }, - errorClass = "INSERT_PARTITION_COLUMN_ARITY_MISMATCH", - parameters = Map( -"tableName" -> "`spark_catalog`.`default`.`t`", -"tableColumns" -> "`c1`, `c2`, `c3`, `c4`", -"dataColumns" -> "`col1`", -"staticPartCols" -> "`c3`, `c4`")) - -// INSERT with static partitions and with column list -sql("truncate table t") -sql("insert into t partition(c3=3, c4=4) (c2) values (1)") -checkAnswer(spark.table("t"), Row(null, 1, 3, 4)) - -// INSERT with partial static partitions -sql("truncate table t") -checkError( - exception = intercept[AnalysisException] { -sql("insert into t partition(c3=3, c4) values (1, 2)") - }, - errorClass = "INSERT_PARTITION_COLUMN_ARITY_MISMATCH", - parameters = Map( -"tableName" -> "`spark_catalog`.`default`.`t`", -"tableColumns" -> "`c1`, `c2`, `c3`, `c4`", -"dataColumns" -> "`col1`, `col2`", -"staticPartCols" -> "`c3`")) - -// INSERT with partial static partitions and with column list is not allowed -intercept[AnalysisException](sql("insert into t partition(c3=3, c4) (c1) values (1, 4)")) +withTable("t") { + sql("create table t(c1 int, c2 int, c3 int, c4 int) using parquet partitioned by (c3, c4)") + + // INSERT without static partitions + checkError( +exception = intercept[AnalysisException] { + sql("insert into t values (1, 2, 3)") +}, +errorClass = "INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS", +parameters = Map( + "tableName" -> "`spark_catalog`.`default`.`t`", + "tableColumns" -> "`c1`, `c2`, `c3`, `c4`", + "dataColumns" -> "`col1`, `col2`, `col3`")) + + // INSERT without static partitions but with column list + sql("truncate table t") + sql("insert into t (c2, c1, c4) values (1, 2, 3)") + checkAnswer(spark.table("t"), Row(2, 1, null, 3)) + + // INSERT with static partitions + sql("truncate table t") + checkError( +exception = intercept[AnalysisException] { +
(spark) branch master updated: [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a3432428e760 [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite a3432428e760 is described below commit a3432428e760fc16610cfe3380d3bdea7654f75d Author: Kent Yao AuthorDate: Fri Feb 2 15:17:43 2024 +0800 [SPARK-46953][TEST] Wrap withTable for a test in ResolveDefaultColumnsSuite ### What changes were proposed in this pull request? The table is not cleaned up after this test; test retries or upcoming new tests reused 't' as the table name will fail with TAEE. ### Why are the changes needed? fix tests as FOLLOWUP of SPARK-43742 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? this test itself ### Was this patch authored or co-authored using generative AI tooling? no Closes #44993 from yaooqinn/SPARK-43742. Authored-by: Kent Yao Signed-off-by: Kent Yao --- .../spark/sql/ResolveDefaultColumnsSuite.scala | 104 +++-- 1 file changed, 53 insertions(+), 51 deletions(-) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala index 29b2796d25aa..00529559a485 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ResolveDefaultColumnsSuite.scala @@ -76,57 +76,59 @@ class ResolveDefaultColumnsSuite extends QueryTest with SharedSparkSession { } test("INSERT into partitioned tables") { -sql("create table t(c1 int, c2 int, c3 int, c4 int) using parquet partitioned by (c3, c4)") - -// INSERT without static partitions -checkError( - exception = intercept[AnalysisException] { -sql("insert into t values (1, 2, 3)") - }, - errorClass = "INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS", - parameters = Map( -"tableName" -> "`spark_catalog`.`default`.`t`", -"tableColumns" -> "`c1`, `c2`, `c3`, `c4`", -"dataColumns" -> "`col1`, `col2`, `col3`")) - -// INSERT without static partitions but with column list -sql("truncate table t") -sql("insert into t (c2, c1, c4) values (1, 2, 3)") -checkAnswer(spark.table("t"), Row(2, 1, null, 3)) - -// INSERT with static partitions -sql("truncate table t") -checkError( - exception = intercept[AnalysisException] { -sql("insert into t partition(c3=3, c4=4) values (1)") - }, - errorClass = "INSERT_PARTITION_COLUMN_ARITY_MISMATCH", - parameters = Map( -"tableName" -> "`spark_catalog`.`default`.`t`", -"tableColumns" -> "`c1`, `c2`, `c3`, `c4`", -"dataColumns" -> "`col1`", -"staticPartCols" -> "`c3`, `c4`")) - -// INSERT with static partitions and with column list -sql("truncate table t") -sql("insert into t partition(c3=3, c4=4) (c2) values (1)") -checkAnswer(spark.table("t"), Row(null, 1, 3, 4)) - -// INSERT with partial static partitions -sql("truncate table t") -checkError( - exception = intercept[AnalysisException] { -sql("insert into t partition(c3=3, c4) values (1, 2)") - }, - errorClass = "INSERT_PARTITION_COLUMN_ARITY_MISMATCH", - parameters = Map( -"tableName" -> "`spark_catalog`.`default`.`t`", -"tableColumns" -> "`c1`, `c2`, `c3`, `c4`", -"dataColumns" -> "`col1`, `col2`", -"staticPartCols" -> "`c3`")) - -// INSERT with partial static partitions and with column list is not allowed -intercept[AnalysisException](sql("insert into t partition(c3=3, c4) (c1) values (1, 4)")) +withTable("t") { + sql("create table t(c1 int, c2 int, c3 int, c4 int) using parquet partitioned by (c3, c4)") + + // INSERT without static partitions + checkError( +exception = intercept[AnalysisException] { + sql("insert into t values (1, 2, 3)") +}, +errorClass = "INSERT_COLUMN_ARITY_MISMATCH.NOT_ENOUGH_DATA_COLUMNS", +parameters = Map( + "tableName" -> "`spark_catalog`.`default`.`t`", + "tableColumns" -> "`c1`, `c2`, `c3`, `c4`", + "dataColumns" -> "`col1`, `col2`, `col3`")) + + // INSERT without static partitions but with column list + sql("truncate table t") + sql("insert into t (c2, c1, c4) values (1, 2, 3)") + checkAnswer(spark.table("t"), Row(2, 1, null, 3)) + + // INSERT with static partitions + sql("truncate table t") + checkError( +exception = intercept[AnalysisException] { + sql("insert into t partition(c3=3, c4=4) values (1)") +}, +errorClass =
(spark) branch master updated: [SPARK-46952][SQL] XML: Limit size of corrupt record
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2e07662e24e2 [SPARK-46952][SQL] XML: Limit size of corrupt record 2e07662e24e2 is described below commit 2e07662e24e243e7d1760ea063c9e88417bc873f Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> AuthorDate: Fri Feb 2 15:45:09 2024 +0900 [SPARK-46952][SQL] XML: Limit size of corrupt record ### What changes were proposed in this pull request? Limit the size of malformed XML string that gets stored in the corrupt column. ### Why are the changes needed? A large corrupt XML record can be arbitrarily large and may cause OOM. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #44994 from sandip-db/xml_limit_corrupt_record_size. Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala index 2458d1772dab..674d5f63b039 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala @@ -145,7 +145,7 @@ class StaxXmlParser( def doParseColumn(xml: String, parseMode: ParseMode, xsdSchema: Option[Schema]): Option[InternalRow] = { -val xmlRecord = UTF8String.fromString(xml) +lazy val xmlRecord = UTF8String.fromString(xml) try { xsdSchema.foreach { schema => schema.newValidator().validate(new StreamSource(new StringReader(xml))) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46683][SQL][TESTS][FOLLOW-UP] Fix typo, use queries in partition set
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 58c9b5ac8ff4 [SPARK-46683][SQL][TESTS][FOLLOW-UP] Fix typo, use queries in partition set 58c9b5ac8ff4 is described below commit 58c9b5ac8ff4aec1ee5b2ae7e0d5702df2ad273c Author: Andy Lam AuthorDate: Fri Feb 2 12:23:19 2024 +0900 [SPARK-46683][SQL][TESTS][FOLLOW-UP] Fix typo, use queries in partition set ### What changes were proposed in this pull request? Fix a typo in GeneratedSubquerySuite, where it is using the set of ALL queries instead of the partitioned set. ### Why are the changes needed? The partitioned set will correspond to the test name. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? NA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44956 from andylam-db/generated-subqueries-fix. Authored-by: Andy Lam Signed-off-by: Hyukjin Kwon --- .../org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala index 23d61c532899..ff1afbd16865 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/querytest/GeneratedSubquerySuite.scala @@ -418,7 +418,7 @@ class GeneratedSubquerySuite extends DockerJDBCIntegrationSuite with QueryGenera // Enable ANSI so that { NULL IN { } } behavior is correct in Spark. localSparkSession.conf.set(SQLConf.ANSI_ENABLED.key, true) -val generatedQueries = generatedQuerySpecs.map(_.query).toSeq +val generatedQueries = querySpec.map(_.query).toSeq // Randomize query order because we are taking a subset of queries. val shuffledQueries = scala.util.Random.shuffle(generatedQueries) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46941][SQL] Can't insert window group limit node for top-k computation if contains SizeBasedWindowFunction
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c426b7285b58 [SPARK-46941][SQL] Can't insert window group limit node for top-k computation if contains SizeBasedWindowFunction c426b7285b58 is described below commit c426b7285b588924eaa8325cb83c868389e94bc3 Author: zml1206 AuthorDate: Fri Feb 2 12:18:49 2024 +0900 [SPARK-46941][SQL] Can't insert window group limit node for top-k computation if contains SizeBasedWindowFunction ### What changes were proposed in this pull request? Don't insert window group limit node for top-k computation if contains `SizeBasedWindowFunction`. ### Why are the changes needed? Bug fix, Insert window group limit node for top-k computation contains `SizeBasedWindowFunction` will cause wrong result of the SizeBasedWindowFunction`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. Before this pr UT will not pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44980 from zml1206/SPARK-46941. Authored-by: zml1206 Signed-off-by: Hyukjin Kwon --- .../catalyst/optimizer/InferWindowGroupLimit.scala | 11 + .../optimizer/InferWindowGroupLimitSuite.scala | 18 ++- .../spark/sql/DataFrameWindowFunctionsSuite.scala | 27 ++ 3 files changed, 50 insertions(+), 6 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimit.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimit.scala index 04204c6a2e10..f2e99721e926 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimit.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimit.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.catalyst.optimizer -import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, CurrentRow, DenseRank, EqualTo, Expression, GreaterThan, GreaterThanOrEqual, IntegerLiteral, LessThan, LessThanOrEqual, Literal, NamedExpression, PredicateHelper, Rank, RowFrame, RowNumber, SpecifiedWindowFrame, UnboundedPreceding, WindowExpression, WindowSpecDefinition} +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, CurrentRow, DenseRank, EqualTo, Expression, GreaterThan, GreaterThanOrEqual, IntegerLiteral, LessThan, LessThanOrEqual, Literal, NamedExpression, PredicateHelper, Rank, RowFrame, RowNumber, SizeBasedWindowFunction, SpecifiedWindowFrame, UnboundedPreceding, WindowExpression, WindowSpecDefinition} import org.apache.spark.sql.catalyst.plans.logical.{Filter, Limit, LocalRelation, LogicalPlan, Window, WindowGroupLimit} import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.catalyst.trees.TreePattern.{FILTER, WINDOW} @@ -53,13 +53,14 @@ object InferWindowGroupLimit extends Rule[LogicalPlan] with PredicateHelper { } /** - * All window expressions should use the same expanding window, so that - * we can safely do the early stop. + * All window expressions should use the same expanding window and do not contains + * `SizeBasedWindowFunction`, so that we can safely do the early stop. */ private def isExpandingWindow( windowExpression: NamedExpression): Boolean = windowExpression match { -case Alias(WindowExpression(_, WindowSpecDefinition(_, _, -SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _) => true +case Alias(WindowExpression(windowFunction, WindowSpecDefinition(_, _, +SpecifiedWindowFrame(RowFrame, UnboundedPreceding, CurrentRow))), _) + if !windowFunction.isInstanceOf[SizeBasedWindowFunction] => true case _ => false } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimitSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimitSuite.scala index 3b185adabc3f..5aa7a27f65fb 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimitSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/InferWindowGroupLimitSuite.scala @@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.optimizer import org.apache.spark.sql.Row import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ -import org.apache.spark.sql.catalyst.expressions.{CurrentRow, DenseRank, Literal, NthValue, NTile, Rank, RowFrame, RowNumber, SpecifiedWindowFrame, UnboundedPreceding} +import org.apache.spark.sql.catalyst.expressions.{CurrentRow, DenseRank, Literal, NthValue, NTile, PercentRank,
(spark) branch master updated: [MINOR][DOCS] Fix outgoing links from SELECT SQL reference page
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 47fb5041b014 [MINOR][DOCS] Fix outgoing links from SELECT SQL reference page 47fb5041b014 is described below commit 47fb5041b0140eab57a9398419872dc8f3b57166 Author: Nicholas Chammas AuthorDate: Fri Feb 2 12:13:10 2024 +0900 [MINOR][DOCS] Fix outgoing links from SELECT SQL reference page ### What changes were proposed in this pull request? Fix a couple of links that were pointing to the raw Markdown files (which doesn't work) rather than the compiled HTML files. ### Why are the changes needed? Documentation links should work, especially internal links. ### Does this PR introduce _any_ user-facing change? Yes, it changes the documentation. ### How was this patch tested? I built the docs and confirmed the links work. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44984 from nchammas/sql-query-ref. Authored-by: Nicholas Chammas Signed-off-by: Hyukjin Kwon --- docs/sql-ref-syntax-qry-select.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sql-ref-syntax-qry-select.md b/docs/sql-ref-syntax-qry-select.md index 6bcf3ab65b99..1d5532898c65 100644 --- a/docs/sql-ref-syntax-qry-select.md +++ b/docs/sql-ref-syntax-qry-select.md @@ -87,8 +87,8 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ [ named_expression | regex_column_ Specifies a source of input for the query. It can be one of the following: * Table relation * [Join relation](sql-ref-syntax-qry-select-join.html) - * [Pivot relation](sql-ref-syntax-qry-select-pivot.md) - * [Unpivot relation](sql-ref-syntax-qry-select-unpivot.md) + * [Pivot relation](sql-ref-syntax-qry-select-pivot.html) + * [Unpivot relation](sql-ref-syntax-qry-select-unpivot.html) * [Table-value function](sql-ref-syntax-qry-select-tvf.html) * [Inline table](sql-ref-syntax-qry-select-inline-table.html) * [ [LATERAL](sql-ref-syntax-qry-select-lateral-subquery.html) ] ( Subquery ) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated (64115d9829ec -> 9c36cfaffb1b)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git from 64115d9829ec [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects add 9c36cfaffb1b [SPARK-46945][K8S][3.4] Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters No new revisions were added by this update. Summary of changes: .../core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala | 9 + .../spark/deploy/k8s/features/MountVolumesFeatureStep.scala | 8 +++- 2 files changed, 16 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated (d3b4537f8bd5 -> 547edb2b4ac7)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git from d3b4537f8bd5 [SPARK-46747][SQL] Avoid scan in getTableExistsQuery for JDBC Dialects add 547edb2b4ac7 [SPARK-46945][K8S][3.5] Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters No new revisions were added by this update. Summary of changes: .../core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala | 9 + .../spark/deploy/k8s/features/MountVolumesFeatureStep.scala | 8 +++- 2 files changed, 16 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (cedef63faf14 -> 457f6f59b1b7)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from cedef63faf14 [SPARK-45807][SQL] Return View after calling replaceView(..) add 457f6f59b1b7 [SPARK-46945][K8S] Add `spark.kubernetes.legacy.useReadWriteOnceAccessMode` for old K8s clusters No new revisions were added by this update. Summary of changes: docs/core-migration-guide.md | 2 ++ .../core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala | 9 + .../spark/deploy/k8s/features/MountVolumesFeatureStep.scala | 8 +++- 3 files changed, 18 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45807][SQL] Return View after calling replaceView(..)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cedef63faf14 [SPARK-45807][SQL] Return View after calling replaceView(..) cedef63faf14 is described below commit cedef63faf14ce41ea9f4540faa4a1c18cf7cea8 Author: Eduard Tudenhoefner AuthorDate: Fri Feb 2 10:35:56 2024 +0800 [SPARK-45807][SQL] Return View after calling replaceView(..) ### What changes were proposed in this pull request? Follow-up API improvements based on from https://github.com/apache/spark/pull/43677 ### Why are the changes needed? Required for DataSourceV2 view support. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? N/A Closes #44970 from nastra/SPARK-45807-return-type. Authored-by: Eduard Tudenhoefner Signed-off-by: Wenchen Fan --- .../java/org/apache/spark/sql/connector/catalog/ViewCatalog.java | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java index 933289cab40b..abe5fb3148d0 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewCatalog.java @@ -115,7 +115,7 @@ public interface ViewCatalog extends CatalogPlugin { * Create a view in the catalog. * * @param viewInfo the info class holding all view information - * @return the view created + * @return the created view. This can be null if getting the metadata for the view is expensive * @throws ViewAlreadyExistsException If a view or table already exists for the identifier * @throws NoSuchNamespaceException If the identifier namespace does not exist (optional) */ @@ -129,10 +129,12 @@ public interface ViewCatalog extends CatalogPlugin { * * @param viewInfo the info class holding all view information * @param orCreate create the view if it doesn't exist + * @return the created/replaced view. This can be null if getting the metadata + * for the view is expensive * @throws NoSuchViewException If the view doesn't exist or is a table * @throws NoSuchNamespaceException If the identifier namespace does not exist (optional) */ - default void replaceView( + default View replaceView( ViewInfo viewInfo, boolean orCreate) throws NoSuchViewException, NoSuchNamespaceException { @@ -143,7 +145,7 @@ public interface ViewCatalog extends CatalogPlugin { } try { - createView(viewInfo); + return createView(viewInfo); } catch (ViewAlreadyExistsException e) { throw new RuntimeException("Race condition when creating/replacing view", e); } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (53cbaeb20293 -> 49e3f5e3bad4)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 53cbaeb20293 [SPARK-46936][PS] Implement `Frame.to_feather` add 49e3f5e3bad4 [SPARK-46908][SQL] Support star clause in WHERE clause No new revisions were added by this update. Summary of changes: docs/sql-migration-guide.md| 1 + docs/sql-ref-function-invocation.md| 112 +++ docs/sql-ref-syntax-qry-select.md | 7 +- docs/sql-ref-syntax-qry-star.md| 97 + docs/sql-ref-syntax.md | 1 + docs/sql-ref.md| 1 + .../spark/sql/catalyst/analysis/Analyzer.scala | 10 +- .../spark/sql/catalyst/parser/AstBuilder.scala | 7 +- .../org/apache/spark/sql/internal/SQLConf.scala| 13 ++ .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 4 +- .../analyzer-results/selectExcept.sql.out | 218 + .../subquery/in-subquery/in-group-by.sql.out | 2 +- .../resources/sql-tests/inputs/selectExcept.sql| 29 +++ .../inputs/subquery/in-subquery/in-group-by.sql| 2 +- .../sql-tests/results/selectExcept.sql.out | 168 .../subquery/in-subquery/in-group-by.sql.out | 2 +- 16 files changed, 666 insertions(+), 8 deletions(-) create mode 100644 docs/sql-ref-function-invocation.md create mode 100644 docs/sql-ref-syntax-qry-star.md - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46936][PS] Implement `Frame.to_feather`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 53cbaeb20293 [SPARK-46936][PS] Implement `Frame.to_feather` 53cbaeb20293 is described below commit 53cbaeb202931a918864ca4aecd9826144ad8307 Author: Ruifeng Zheng AuthorDate: Fri Feb 2 09:04:26 2024 +0800 [SPARK-46936][PS] Implement `Frame.to_feather` ### What changes were proposed in this pull request? Implement `Frame.to_feather` ### Why are the changes needed? for pandas parity ### Does this PR introduce _any_ user-facing change? yes ``` In [3]: pdf = pd.DataFrame( ...: [[1, 1.0, "a"]], ...: columns=["x", "y", "z"], ...: ) In [4]: pdf Out[4]: xy z 0 1 1.0 a In [5]: psdf = ps.from_pandas(pdf) In [6]: psdf xy z 0 1 1.0 a In [7]: pdf.to_feather("/tmp/file1.feather") In [8]: psdf.to_feather("/tmp/file2.feather") In [9]: f1 = pd.read_feather("/tmp/file1.feather") In [10]: f1 Out[10]: xy z 0 1 1.0 a In [11]: f2 = pd.read_feather("/tmp/file2.feather") In [12]: f2 Out[12]: xy z 0 1 1.0 a ``` ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? no Closes #44972 from zhengruifeng/ps_to_feather. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- dev/sparktestsupport/modules.py| 2 + .../docs/source/reference/pyspark.pandas/frame.rst | 1 + python/pyspark/pandas/frame.py | 35 +++ python/pyspark/pandas/missing/frame.py | 1 - .../pandas/tests/connect/io/test_parity_feather.py | 42 + python/pyspark/pandas/tests/io/test_feather.py | 68 ++ 6 files changed, 148 insertions(+), 1 deletion(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 508cf56b9c87..233dcf4e54b6 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -816,6 +816,7 @@ pyspark_pandas = Module( "pyspark.pandas.tests.series.test_stat", "pyspark.pandas.tests.io.test_io", "pyspark.pandas.tests.io.test_csv", +"pyspark.pandas.tests.io.test_feather", "pyspark.pandas.tests.io.test_dataframe_conversion", "pyspark.pandas.tests.io.test_dataframe_spark_io", "pyspark.pandas.tests.io.test_series_conversion", @@ -1297,6 +1298,7 @@ pyspark_pandas_connect_part3 = Module( # pandas-on-Spark unittests "pyspark.pandas.tests.connect.io.test_parity_io", "pyspark.pandas.tests.connect.io.test_parity_csv", +"pyspark.pandas.tests.connect.io.test_parity_feather", "pyspark.pandas.tests.connect.io.test_parity_dataframe_conversion", "pyspark.pandas.tests.connect.io.test_parity_dataframe_spark_io", "pyspark.pandas.tests.connect.io.test_parity_series_conversion", diff --git a/python/docs/source/reference/pyspark.pandas/frame.rst b/python/docs/source/reference/pyspark.pandas/frame.rst index 77b60468b8fb..564ddb607a19 100644 --- a/python/docs/source/reference/pyspark.pandas/frame.rst +++ b/python/docs/source/reference/pyspark.pandas/frame.rst @@ -283,6 +283,7 @@ Serialization / IO / Conversion DataFrame.to_numpy DataFrame.to_spark DataFrame.to_string + DataFrame.to_feather DataFrame.to_json DataFrame.to_dict DataFrame.to_excel diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 7222b877bba1..3b3565f7ea9f 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -2648,6 +2648,41 @@ defaultdict(, {'col..., 'col...})] psdf._to_internal_pandas(), self.to_latex, pd.DataFrame.to_latex, args ) +def to_feather( +self, +path: Union[str, IO[str]], +**kwargs: Any, +) -> None: +""" +Write a DataFrame to the binary Feather format. + +.. note:: This method should only be used if the resulting DataFrame is expected + to be small, as all the data is loaded into the driver's memory. + +.. versionadded:: 4.0.0 + +Parameters +-- +path : str, path object, file-like object +String, path object (implementing ``os.PathLike[str]``), or file-like +object implementing a binary ``write()`` function. +**kwargs : +Additional keywords passed to :func:`pyarrow.feather.write_feather`. +This includes the `compression`, `compression_level`, `chunksize` +and `version` keywords. + +Examples
(spark) branch master updated: [SPARK-46864][SS] Onboard Arbitrary StateV2 onto New Error Class Framework
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 00e63d63f9af [SPARK-46864][SS] Onboard Arbitrary StateV2 onto New Error Class Framework 00e63d63f9af is described below commit 00e63d63f9af6ef186e14159ddbe8bb8d1c8690b Author: Eric Marnadi AuthorDate: Fri Feb 2 05:38:15 2024 +0900 [SPARK-46864][SS] Onboard Arbitrary StateV2 onto New Error Class Framework ### What changes were proposed in this pull request? This PR proposes to apply error class framework to the new data source, State API V2. ### Why are the changes needed? Error class framework is a standard to represent all exceptions in Spark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Refactored unit tests to check that the right error class was being thrown in certain situations ### Was this patch authored or co-authored using generative AI tooling? No Closes #44883 from ericm-db/state-v2-error-class. Lead-authored-by: Eric Marnadi Co-authored-by: ericm-db <132308037+ericm...@users.noreply.github.com> Signed-off-by: Jungtaek Lim --- .../src/main/resources/error/error-classes.json| 29 +++ ...r-conditions-unsupported-feature-error-class.md | 4 ++ docs/sql-error-conditions.md | 24 ++ .../sql/execution/streaming/ValueStateImpl.scala | 5 +- .../state/HDFSBackedStateStoreProvider.scala | 3 +- .../streaming/state/StateStoreChangelog.scala | 16 +++ .../streaming/state/StateStoreErrors.scala | 56 ++ .../streaming/state/MemoryStateStore.scala | 2 +- .../execution/streaming/state/RocksDBSuite.scala | 37 +- .../streaming/state/ValueStateSuite.scala | 43 +++-- 10 files changed, 199 insertions(+), 20 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 8e47490f5a61..baefb05a7070 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -1656,6 +1656,12 @@ ], "sqlState" : "XX000" }, + "INTERNAL_ERROR_TWS" : { +"message" : [ + "" +], +"sqlState" : "XX000" + }, "INTERVAL_ARITHMETIC_OVERFLOW" : { "message" : [ "." @@ -3235,6 +3241,18 @@ ], "sqlState" : "0A000" }, + "STATE_STORE_MULTIPLE_VALUES_PER_KEY" : { +"message" : [ + "Store does not support multiple values per key" +], +"sqlState" : "42802" + }, + "STATE_STORE_UNSUPPORTED_OPERATION" : { +"message" : [ + " operation not supported with " +], +"sqlState" : "XXKST" + }, "STATIC_PARTITION_COLUMN_IN_INSERT_COLUMN_LIST" : { "message" : [ "Static partition column is also specified in the column list." @@ -3388,6 +3406,12 @@ ], "sqlState" : "428EK" }, + "TWS_VALUE_SHOULD_NOT_BE_NULL" : { +"message" : [ + "New value should be non-null for " +], +"sqlState" : "22004" + }, "UDTF_ALIAS_NUMBER_MISMATCH" : { "message" : [ "The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF.", @@ -3921,6 +3945,11 @@ " is a VARIABLE and cannot be updated using the SET statement. Use SET VARIABLE = ... instead." ] }, + "STATE_STORE_MULTIPLE_COLUMN_FAMILIES" : { +"message" : [ + "Creating multiple column families with is not supported." +] + }, "TABLE_OPERATION" : { "message" : [ "Table does not support . Please check the current catalog and namespace to make sure the qualified table name is expected, and also check the catalog implementation which is configured by \"spark.sql.catalog\"." diff --git a/docs/sql-error-conditions-unsupported-feature-error-class.md b/docs/sql-error-conditions-unsupported-feature-error-class.md index d90d2b2a109f..1b12c4bfc1b3 100644 --- a/docs/sql-error-conditions-unsupported-feature-error-class.md +++ b/docs/sql-error-conditions-unsupported-feature-error-class.md @@ -190,6 +190,10 @@ set PROPERTIES and DBPROPERTIES at the same time. `` is a VARIABLE and cannot be updated using the SET statement. Use SET VARIABLE `` = ... instead. +## STATE_STORE_MULTIPLE_COLUMN_FAMILIES + +Creating multiple column families with `` is not supported. + ## TABLE_OPERATION Table `` does not support ``. Please check the current catalog and namespace to make sure the qualified table name is expected, and also check the catalog implementation which is configured by "spark.sql.catalog". diff --git a/docs/sql-error-conditions.md
(spark) branch master updated: [MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 704e9a0785f4 [MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata` 704e9a0785f4 is described below commit 704e9a0785f4fc4dd86b950a649114e807a826a1 Author: yangjie01 AuthorDate: Thu Feb 1 09:31:24 2024 -0800 [MINOR][SQL] Clean up outdated comments from `hash` function in `Metadata` ### What changes were proposed in this pull request? This pr just clean up outdated comments from `hash` function in `Metadata` ### Why are the changes needed? Clean up outdated comments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44978 from LuciferYang/minior-remove-comments. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala | 2 -- 1 file changed, 2 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala b/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala index 17be8cfa12b5..2ffd0f13ca10 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/types/Metadata.scala @@ -208,8 +208,6 @@ object Metadata { /** Computes the hash code for the types we support. */ private def hash(obj: Any): Int = { obj match { - // `map.mapValues` return `Map` in Scala 2.12 and return `MapView` in Scala 2.13, call - // `toMap` for Scala version compatibility. case map: Map[_, _] => map.transform((_, v) => hash(v)).## case arr: Array[_] => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d5ca61692c34 [SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils` d5ca61692c34 is described below commit d5ca61692c34449bc602db6cf0919010ec5a50a3 Author: panbingkun AuthorDate: Thu Feb 1 09:30:07 2024 -0800 [SPARK-46940][CORE] Remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils` ### What changes were proposed in this pull request? The pr aims to remove unused `updateSparkConfigFromProperties` and `isAbsoluteURI` in `o.a.s.u.Utils`. ### Why are the changes needed? Keep the code cleanly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44979 from panbingkun/SPARK-46940. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/util/Utils.scala | 25 -- 1 file changed, 25 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/util/Utils.scala b/core/src/main/scala/org/apache/spark/util/Utils.scala index a55539c0a235..b49f97aed05e 100644 --- a/core/src/main/scala/org/apache/spark/util/Utils.scala +++ b/core/src/main/scala/org/apache/spark/util/Utils.scala @@ -1884,17 +1884,6 @@ private[spark] object Utils } } - /** Check whether a path is an absolute URI. */ - def isAbsoluteURI(path: String): Boolean = { -try { - val uri = new URI(path: String) - uri.isAbsolute -} catch { - case _: URISyntaxException => -false -} - } - /** Return all non-local paths from a comma-separated list of paths. */ def nonLocalPaths(paths: String, testWindows: Boolean = false): Array[String] = { val windows = isWindows || testWindows @@ -1931,20 +1920,6 @@ private[spark] object Utils path } - /** - * Updates Spark config with properties from a set of Properties. - * Provided properties have the highest priority. - */ - def updateSparkConfigFromProperties( - conf: SparkConf, - properties: Map[String, String]) : Unit = { -properties.filter { case (k, v) => - k.startsWith("spark.") -}.foreach { case (k, v) => - conf.set(k, v) -} - } - /** * Implements the same logic as JDK `java.lang.String#trim` by removing leading and trailing * non-printable characters less or equal to '\u0020' (SPACE) but preserves natural line - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46933][SQL] Add query execution time metric to connectors which use JDBCRDD
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 22586efb3cd4 [SPARK-46933][SQL] Add query execution time metric to connectors which use JDBCRDD 22586efb3cd4 is described below commit 22586efb3cd4a19969d92b249a14326fba3244d2 Author: Uros Stankovic AuthorDate: Thu Feb 1 22:48:50 2024 +0800 [SPARK-46933][SQL] Add query execution time metric to connectors which use JDBCRDD ### What changes were proposed in this pull request? This pull request should add measuring query execution time on external JDBC data source. Another change is changing access right for JDBCRDD class, that is needed for adding another metric (SQL text) which will be done in some next PR. ### Why are the changes needed? Query execution time is very important metric to have ### Does this PR introduce _any_ user-facing change? User can see query execution time on SparkPlan graph under node metrics tab ### How was this patch tested? Tested using custom image ### Was this patch authored or co-authored using generative AI tooling? No Closes #44969 from urosstan-db/SPARK-46933-Add-scan-metrics-to-jdbc-connector. Lead-authored-by: Uros Stankovic Co-authored-by: Uros Stankovic <155642965+urosstan...@users.noreply.github.com> Signed-off-by: Wenchen Fan --- .../spark/sql/execution/DataSourceScanExec.scala | 17 -- .../datasources/DataSourceMetricsMixin.scala | 24 .../sql/execution/datasources/jdbc/JDBCRDD.scala | 26 -- 3 files changed, 63 insertions(+), 4 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala index 2622eadaefb3..ec265f4eaea4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala @@ -99,6 +99,10 @@ trait DataSourceScanExec extends LeafExecNode { def inputRDDs(): Seq[RDD[InternalRow]] } +object DataSourceScanExec { + val numOutputRowsKey = "numOutputRows" +} + /** Physical plan node for scanning data from a relation. */ case class RowDataSourceScanExec( output: Seq[Attribute], @@ -111,8 +115,17 @@ case class RowDataSourceScanExec( tableIdentifier: Option[TableIdentifier]) extends DataSourceScanExec with InputRDDCodegen { - override lazy val metrics = -Map("numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override lazy val metrics: Map[String, SQLMetric] = { +val metrics = Map( + DataSourceScanExec.numOutputRowsKey -> +SQLMetrics.createMetric(sparkContext, "number of output rows") +) + +rdd match { + case rddWithDSMetrics: DataSourceMetricsMixin => metrics ++ rddWithDSMetrics.getMetrics + case _ => metrics +} + } protected override def doExecute(): RDD[InternalRow] = { val numOutputRows = longMetric("numOutputRows") diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceMetricsMixin.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceMetricsMixin.scala new file mode 100644 index ..6c1e5e876e99 --- /dev/null +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceMetricsMixin.scala @@ -0,0 +1,24 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.execution.metric.SQLMetric + +trait DataSourceMetricsMixin { + def getMetrics: Seq[(String, SQLMetric)] +} diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala index 934ed9ac2a1b..a436627fd117 100644 ---
(spark) branch master updated: [SPARK-46852][SS] Remove use of explicit key encoder and pass it implicitly to the operator for transformWithState operator
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e610d1d8f79b [SPARK-46852][SS] Remove use of explicit key encoder and pass it implicitly to the operator for transformWithState operator e610d1d8f79b is described below commit e610d1d8f79b913cb9ee9236a6325202c58d8397 Author: Anish Shrigondekar AuthorDate: Thu Feb 1 22:31:07 2024 +0900 [SPARK-46852][SS] Remove use of explicit key encoder and pass it implicitly to the operator for transformWithState operator ### What changes were proposed in this pull request? Remove use of explicit key encoder and pass it implicitly to the operator for transformWithState operator ### Why are the changes needed? Changes needed to avoid asking users to provide explicit key encoder and we also might need them for subsequent timer related changes ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #44974 from anishshri-db/task/SPARK-46852. Authored-by: Anish Shrigondekar Signed-off-by: Jungtaek Lim --- .../sql/streaming/StatefulProcessorHandle.scala| 5 + .../spark/sql/catalyst/plans/logical/object.scala | 3 +++ .../spark/sql/execution/SparkStrategies.scala | 3 ++- .../streaming/StatefulProcessorHandleImpl.scala| 13 + .../streaming/TransformWithStateExec.scala | 6 +- .../sql/execution/streaming/ValueStateImpl.scala | 12 +--- .../streaming/state/ValueStateSuite.scala | 22 +++--- .../sql/streaming/TransformWithStateSuite.scala| 8 +++- 8 files changed, 39 insertions(+), 33 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessorHandle.scala b/sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessorHandle.scala index 302de4a3c947..5eaccceb947c 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessorHandle.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessorHandle.scala @@ -19,7 +19,6 @@ package org.apache.spark.sql.streaming import java.io.Serializable import org.apache.spark.annotation.{Evolving, Experimental} -import org.apache.spark.sql.Encoder /** * Represents the operation handle provided to the stateful processor used in the @@ -34,12 +33,10 @@ private[sql] trait StatefulProcessorHandle extends Serializable { * The user must ensure to call this function only within the `init()` method of the * StatefulProcessor. * @param stateName - name of the state variable - * @param keyEncoder - Spark SQL Encoder for key - * @tparam K - type of key * @tparam T - type of state variable * @return - instance of ValueState of type T that can be used to store state persistently */ - def getValueState[K, T](stateName: String, keyEncoder: Encoder[K]): ValueState[T] + def getValueState[T](stateName: String): ValueState[T] /** Function to return queryInfo for currently running task */ def getQueryInfo(): QueryInfo diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala index 8f937dd5a777..cb8673d20ed3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala @@ -577,6 +577,7 @@ object TransformWithState { timeoutMode: TimeoutMode, outputMode: OutputMode, child: LogicalPlan): LogicalPlan = { +val keyEncoder = encoderFor[K] val mapped = new TransformWithState( UnresolvedDeserializer(encoderFor[K].deserializer, groupingAttributes), UnresolvedDeserializer(encoderFor[V].deserializer, dataAttributes), @@ -585,6 +586,7 @@ object TransformWithState { statefulProcessor.asInstanceOf[StatefulProcessor[Any, Any, Any]], timeoutMode, outputMode, + keyEncoder.asInstanceOf[ExpressionEncoder[Any]], CatalystSerde.generateObjAttr[U], child ) @@ -600,6 +602,7 @@ case class TransformWithState( statefulProcessor: StatefulProcessor[Any, Any, Any], timeoutMode: TimeoutMode, outputMode: OutputMode, +keyEncoder: ExpressionEncoder[Any], outputObjAttr: Attribute, child: LogicalPlan) extends UnaryNode with ObjectProducer { diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala index 5d4063d125c8..f5c2f17f8826 100644 ---
(spark) branch master updated: [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.8.1
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1870de0b329a [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.8.1 1870de0b329a is described below commit 1870de0b329ac5ef35a331a653b4debd85eaa684 Author: panbingkun AuthorDate: Thu Feb 1 06:37:00 2024 -0600 [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.8.1 ### What changes were proposed in this pull request? The pr aims to upgrade rocksdbjni from `8.3.2` to `8.8.1`. Why version `8.8.1`? Because so far, `32` tests have been conducted based on version `8.6.7` or `8.8.1`, and no previous core issues have been found. The later versions have not been rigorously validated. ### Why are the changes needed? 1.The full release notes: - https://github.com/facebook/rocksdb/releases/tag/v8.8.1 - https://github.com/facebook/rocksdb/releases/tag/v8.7.3 - https://github.com/facebook/rocksdb/releases/tag/v8.6.7 - https://github.com/facebook/rocksdb/releases/tag/v8.5.4 - https://github.com/facebook/rocksdb/releases/tag/v8.5.3 - https://github.com/facebook/rocksdb/releases/tag/v8.4.4 - https://github.com/facebook/rocksdb/releases/tag/v8.3.3 2.Bug Fixes, eg: - Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than Status::NotSupported() - Fix a bug with atomic_flush=true that can cause DB to stuck after a flush fails - Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results. - Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test: ``` ./build/mvn clean install -pl core -am -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -fn ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43924 from panbingkun/upgrade_rocksdbjni. Lead-authored-by: panbingkun Co-authored-by: panbingkun Signed-off-by: Sean Owen --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml| 2 +- ...StoreBasicOperationsBenchmark-jdk21-results.txt | 70 ++--- .../StateStoreBasicOperationsBenchmark-results.txt | 72 +++--- 4 files changed, 73 insertions(+), 73 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index fcb3350e5de2..e02733883642 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -239,7 +239,7 @@ parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar pickle/1.3//pickle-1.3.jar py4j/0.10.9.7//py4j-0.10.9.7.jar remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar -rocksdbjni/8.3.2//rocksdbjni-8.3.2.jar +rocksdbjni/8.8.1//rocksdbjni-8.8.1.jar scala-collection-compat_2.13/2.7.0//scala-collection-compat_2.13-2.7.0.jar scala-compiler/2.13.12//scala-compiler-2.13.12.jar scala-library/2.13.12//scala-library-2.13.12.jar diff --git a/pom.xml b/pom.xml index 6e118bb27f5a..2fc14a4cdede 100644 --- a/pom.xml +++ b/pom.xml @@ -677,7 +677,7 @@ org.rocksdb rocksdbjni -8.3.2 +8.8.1 ${leveldbjni.group} diff --git a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt index f92ae8668e16..c0d710873aed 100644 --- a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt +++ b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk21-results.txt @@ -6,33 +6,33 @@ OpenJDK 64-Bit Server VM 21.0.1+12-LTS on Linux 5.15.0-1053-azure AMD EPYC 7763 64-Core Processor putting 1 rows (1 rows to overwrite - rate 100): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative --- -In-memory5 6 0 1.8 541.4 1.0X -RocksDB (trackTotalNumberOfRows: true) 40 41 2 0.24023.4 0.1X -RocksDB (trackTotalNumberOfRows: false) 15 15 1 0.71452.5 0.4X +In-memory