[spark] branch master updated: [SPARK-43914][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437]
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1c8c47cb55d [SPARK-43914][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437] 1c8c47cb55d is described below commit 1c8c47cb55da75526fef4dd41ed0734b01e71814 Author: Jiaan Geng AuthorDate: Wed Jun 28 08:22:01 2023 +0300 [SPARK-43914][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437] ### What changes were proposed in this pull request? The pr aims to assign names to the error class _LEGACY_ERROR_TEMP_[2433-2437]. ### Why are the changes needed? Improve the error framework. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Exists test cases updated. Closes #41476 from beliefer/SPARK-43914. Authored-by: Jiaan Geng Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 34 -- .../sql/catalyst/analysis/CheckAnalysis.scala | 65 +++ .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 74 -- .../org/apache/spark/sql/DataFrameSuite.scala | 14 4 files changed, 120 insertions(+), 67 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 342af0ffa6c..e441686432a 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -5637,40 +5637,6 @@ "Cannot change nullable column to non-nullable: ." ] }, - "_LEGACY_ERROR_TEMP_2433" : { -"message" : [ - "Only a single table generating function is allowed in a SELECT clause, found:", - "." -] - }, - "_LEGACY_ERROR_TEMP_2434" : { -"message" : [ - "Failure when resolving conflicting references in Join:", - "", - "Conflicting attributes: ." -] - }, - "_LEGACY_ERROR_TEMP_2435" : { -"message" : [ - "Failure when resolving conflicting references in Intersect:", - "", - "Conflicting attributes: ." -] - }, - "_LEGACY_ERROR_TEMP_2436" : { -"message" : [ - "Failure when resolving conflicting references in Except:", - "", - "Conflicting attributes: ." -] - }, - "_LEGACY_ERROR_TEMP_2437" : { -"message" : [ - "Failure when resolving conflicting references in AsOfJoin:", - "", - "Conflicting attributes: ." -] - }, "_LEGACY_ERROR_TEMP_2446" : { "message" : [ "Operation not allowed: only works on table with location provided: " diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 7c0e8f1490d..a0296d27361 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -674,9 +674,8 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB } case p @ Project(exprs, _) if containsMultipleGenerators(exprs) => -p.failAnalysis( - errorClass = "_LEGACY_ERROR_TEMP_2433", - messageParameters = Map("sqlExprs" -> exprs.map(_.sql).mkString(","))) +val generators = exprs.filter(expr => expr.exists(_.isInstanceOf[Generator])) +throw QueryCompilationErrors.moreThanOneGeneratorError(generators, "SELECT") case p @ Project(projectList, _) => projectList.foreach(_.transformDownWithPruning( @@ -686,36 +685,48 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB }) case j: Join if !j.duplicateResolved => -val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet) -j.failAnalysis( - errorClass = "_LEGACY_ERROR_TEMP_2434", - messageParameters = Map( -"plan" -> plan.toString, -"conflictingAttributes" -> conflictingAttributes.mkString(","))) +val conflictingAttributes = + j.left.outputSet.intersect(j.right.outputSet).map(toSQLExpr(_)).mkString(", ") +throw SparkException.internalError( + msg = s""" + |Failure when resolving conflicting references in ${j.nodeName}: + |${planToString(plan)} + |Conflicting attributes: $conflictingAttributes.""".stripMargin, + context = j.origin.getQueryContext, + summary = j.origin.context.summary) case i: Intersect if !i.duplicateResolved => -val conflictingAttributes =
[spark] branch master updated (f00de6f77a8 -> 3cd486070bf)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f00de6f77a8 [SPARK-44182][DOCS] Use Spark version variables in Python and Spark Connect installation docs add 3cd486070bf [SPARK-44221][BUILD] Upgrade RoaringBitmap from 0.9.44 to 0.9.45 No new revisions were added by this update. Summary of changes: core/benchmarks/MapStatusesConvertBenchmark-jdk11-results.txt | 10 +- core/benchmarks/MapStatusesConvertBenchmark-jdk17-results.txt | 10 +- core/benchmarks/MapStatusesConvertBenchmark-results.txt | 10 +- dev/deps/spark-deps-hadoop-3-hive-2.3 | 4 ++-- pom.xml | 2 +- 5 files changed, 18 insertions(+), 18 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44182][DOCS] Use Spark version variables in Python and Spark Connect installation docs
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f00de6f77a8 [SPARK-44182][DOCS] Use Spark version variables in Python and Spark Connect installation docs f00de6f77a8 is described below commit f00de6f77a80182215d0f3c07441849f2654b210 Author: Dongjoon Hyun AuthorDate: Wed Jun 28 11:16:42 2023 +0800 [SPARK-44182][DOCS] Use Spark version variables in Python and Spark Connect installation docs ### What changes were proposed in this pull request? This PR aims to use Spark version placeholders in Python and Spark Connect installation docs - `site.SPARK_VERSION_SHORT` in `md` files - `|release|` in `rst` files ### Why are the changes needed? To provide an up-to-date Apache Spark docs document always. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ![Screenshot 2023-06-26 at 1 51 42 PM](https://github.com/apache/spark/assets/9700541/d4bc8166-e5cf-4c61-a1ab-0aa65810dc51) ![Screenshot 2023-06-27 at 9 21 23 AM](https://github.com/apache/spark/assets/9700541/a5a5ed98-c37e-47c4-ba14-69923c50dfd7) Closes #41728 from dongjoon-hyun/SPARK-44182. Authored-by: Dongjoon Hyun Signed-off-by: Kent Yao --- docs/spark-connect-overview.md | 10 +- python/docs/source/getting_started/install.rst | 8 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/spark-connect-overview.md b/docs/spark-connect-overview.md index 55cc825a148..1e1464cfba0 100644 --- a/docs/spark-connect-overview.md +++ b/docs/spark-connect-overview.md @@ -93,7 +93,7 @@ the release drop down at the top of the page. Then choose your package type, typ Now extract the Spark package you just downloaded on your computer, for example: {% highlight bash %} -tar -xvf spark-3.4.0-bin-hadoop3.tgz +tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz {% endhighlight %} In a terminal window, go to the `spark` folder in the location where you extracted @@ -101,13 +101,13 @@ Spark before and run the `start-connect-server.sh` script to start Spark server Spark Connect, like in this example: {% highlight bash %} -./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0 +./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:{{site.SPARK_VERSION_SHORT}} {% endhighlight %} -Note that we include a Spark Connect package (`spark-connect_2.12:3.4.0`), when starting +Note that we include a Spark Connect package (`spark-connect_2.12:{{site.SPARK_VERSION_SHORT}}`), when starting Spark server. This is required to use Spark Connect. Make sure to use the same version of the package as the Spark version you downloaded previously. In this example, -Spark 3.4.0 with Scala 2.12. +Spark {{site.SPARK_VERSION_SHORT}} with Scala 2.12. Now Spark server is running and ready to accept Spark Connect sessions from client applications. In the next section we will walk through how to use Spark Connect @@ -270,4 +270,4 @@ APIs you are using are available before migrating existing code to Spark Connect [functions](api/scala/org/apache/spark/sql/functions$.html), and [Column](api/scala/org/apache/spark/sql/Column.html). -Support for more APIs is planned for upcoming Spark releases. \ No newline at end of file +Support for more APIs is planned for upcoming Spark releases. diff --git a/python/docs/source/getting_started/install.rst b/python/docs/source/getting_started/install.rst index b5256f2f2cb..eb296dc16d6 100644 --- a/python/docs/source/getting_started/install.rst +++ b/python/docs/source/getting_started/install.rst @@ -129,17 +129,17 @@ PySpark is included in the distributions available at the `Apache Spark website You can download a distribution you want from the site. After that, uncompress the tar file into the directory where you want to install Spark, for example, as below: -.. code-block:: bash +.. parsed-literal:: -tar xzvf spark-3.4.0-bin-hadoop3.tgz +tar xzvf spark-\ |release|\-bin-hadoop3.tgz Ensure the ``SPARK_HOME`` environment variable points to the directory where the tar file has been extracted. Update ``PYTHONPATH`` environment variable such that it can find the PySpark and Py4J under ``SPARK_HOME/python/lib``. One example of doing this is shown below: -.. code-block:: bash +.. parsed-literal:: -cd spark-3.4.0-bin-hadoop3 +cd spark-\ |release|\-bin-hadoop3 export SPARK_HOME=`pwd` export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional
[spark] branch master updated: [SPARK-44206][SQL] DataSet.selectExpr scope Session.active
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1a1ae1f8cf6 [SPARK-44206][SQL] DataSet.selectExpr scope Session.active 1a1ae1f8cf6 is described below commit 1a1ae1f8cf6c1ba6f8fc40081db4a0782a54dd66 Author: zml1206 AuthorDate: Wed Jun 28 11:02:09 2023 +0800 [SPARK-44206][SQL] DataSet.selectExpr scope Session.active ### What changes were proposed in this pull request? `Dataset.selectExpr` are covered by withActive, to scope Session.active. ### Why are the changes needed? [SPARK-30798](https://issues.apache.org/jira/browse/SPARK-30798) mentioned all SparkSession dataset methods should covered by withActive, but `selectExpr` not. For example: ``` val clone = spark.cloneSession() clone.conf.set("spark.sql.legacy.interval.enabled", "true") // sql1 clone.sql("select '2023-01-01'+ INTERVAL 1 YEAR as b").show() // sql2 clone.sql("select '2023-01-01' as a").selectExpr("a + INTERVAL 1 YEAR as b").show() ``` sql1 can be executed successfully, sql2 failed. Error message: ``` [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' YEAR)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0; 'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4] +- Project [2023-01-01 AS a#2] +- OneRowRelation org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' YEAR)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0; 'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4] +- Project [2023-01-01 AS a#2] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:280) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:267) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:267) at scala.collection.immutable.Stream.foreach(Stream.scala:533) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:267) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:187) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:187) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:210) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:207) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) at
[spark] branch branch-3.4 updated: [SPARK-44206][SQL] DataSet.selectExpr scope Session.active
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 03043823c3e [SPARK-44206][SQL] DataSet.selectExpr scope Session.active 03043823c3e is described below commit 03043823c3ebf824d9423d5a1dd0c57d0ba8147b Author: zml1206 AuthorDate: Wed Jun 28 11:02:09 2023 +0800 [SPARK-44206][SQL] DataSet.selectExpr scope Session.active ### What changes were proposed in this pull request? `Dataset.selectExpr` are covered by withActive, to scope Session.active. ### Why are the changes needed? [SPARK-30798](https://issues.apache.org/jira/browse/SPARK-30798) mentioned all SparkSession dataset methods should covered by withActive, but `selectExpr` not. For example: ``` val clone = spark.cloneSession() clone.conf.set("spark.sql.legacy.interval.enabled", "true") // sql1 clone.sql("select '2023-01-01'+ INTERVAL 1 YEAR as b").show() // sql2 clone.sql("select '2023-01-01' as a").selectExpr("a + INTERVAL 1 YEAR as b").show() ``` sql1 can be executed successfully, sql2 failed. Error message: ``` [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' YEAR)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0; 'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4] +- Project [2023-01-01 AS a#2] +- OneRowRelation org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(a + INTERVAL '1' YEAR)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("DOUBLE" and "INTERVAL YEAR").; line 1 pos 0; 'Project [(cast(a#2 as double) + INTERVAL '1' YEAR) AS b#4] +- Project [2023-01-01 AS a#2] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:280) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:294) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:294) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:294) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:267) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:267) at scala.collection.immutable.Stream.foreach(Stream.scala:533) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2(CheckAnalysis.scala:267) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$2$adapted(CheckAnalysis.scala:182) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:295) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0(CheckAnalysis.scala:182) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis0$(CheckAnalysis.scala:164) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis0(Analyzer.scala:187) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:160) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:150) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:187) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:210) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:207) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) at
[spark] branch master updated: [SPARK-44039][CONNECT][TESTS] Improve for PlanGenerationTestSuite & ProtoToParsedPlanTestSuite
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c8c51d04741 [SPARK-44039][CONNECT][TESTS] Improve for PlanGenerationTestSuite & ProtoToParsedPlanTestSuite c8c51d04741 is described below commit c8c51d047411959ad6c648246a5bd6ea4ae13ce8 Author: panbingkun AuthorDate: Tue Jun 27 21:37:33 2023 -0400 [SPARK-44039][CONNECT][TESTS] Improve for PlanGenerationTestSuite & ProtoToParsedPlanTestSuite ### What changes were proposed in this pull request? The pr aims to improve for PlanGenerationTestSuite & ProtoToParsedPlanTestSuite, include: - When generating `GOLDEN` files, we should first delete the corresponding directories and generate new ones to avoid submitting some redundant files during the review process. eg: When we write a test named `make_timestamp_ltz` for the overloaded method, and during the review process, the reviewer wishes to add more tests for the method. The name of this method has changed during the next submission process, such as `make_timestamp_ltz without timezone`.At this point, if the `queries/function_make_timestamp_ltz.json`, `queries/function_make_timestamp_ltz.proto.bin` and `explain-results/function_make_timestamp_ltz.explain` files of `function_make_timestamp_ltz` [...] - Clear and update some redundant files submitted incorrectly ### Why are the changes needed? Make code clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #41572 from panbingkun/SPARK-44039. Authored-by: panbingkun Signed-off-by: Herman van Hovell --- .../apache/spark/sql/PlanGenerationTestSuite.scala | 28 ++ .../explain-results/function_percentile.explain| 2 - .../function_regexp_extract_all.explain| 2 - .../explain-results/function_regexp_instr.explain | 2 - .../query-tests/explain-results/read_path.explain | 1 - .../query-tests/queries/function_lit_array.json| 58 ++--- .../query-tests/queries/function_percentile.json | 29 --- .../queries/function_percentile.proto.bin | Bin 192 -> 0 bytes .../queries/function_regexp_extract_all.json | 33 .../queries/function_regexp_extract_all.proto.bin | Bin 212 -> 0 bytes .../query-tests/queries/function_regexp_instr.json | 33 .../queries/function_regexp_instr.proto.bin| Bin 203 -> 0 bytes .../resources/query-tests/queries/read_path.json | 11 .../query-tests/queries/read_path.proto.bin| 3 -- .../queries/streaming_table_API_with_options.json | 3 +- .../sql/connect/ProtoToParsedPlanTestSuite.scala | 23 16 files changed, 82 insertions(+), 146 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala index e8d04f37d7f..ecb7092b8d9 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala @@ -44,6 +44,7 @@ import org.apache.spark.sql.functions.lit import org.apache.spark.sql.protobuf.{functions => pbFn} import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.CalendarInterval +import org.apache.spark.util.Utils // scalastyle:off /** @@ -61,6 +62,14 @@ import org.apache.spark.unsafe.types.CalendarInterval * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite" * }}} * + * If you need to clean the orphaned golden files, you need to set the + * SPARK_CLEAN_ORPHANED_GOLDEN_FILES=1 environment variable before running this test, e.g.: + * {{{ + * SPARK_CLEAN_ORPHANED_GOLDEN_FILES=1 build/sbt "connect-client-jvm/testOnly org.apache.spark.sql.PlanGenerationTestSuite" + * }}} + * Note: not all orphaned golden files should be cleaned, some are reserved for testing backups + * compatibility. + * * Note that the plan protos are used as the input for the `ProtoToParsedPlanTestSuite` in the * `connector/connect/server` module */ @@ -74,6 +83,9 @@ class PlanGenerationTestSuite // Borrowed from SparkFunSuite private val regenerateGoldenFiles: Boolean = System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1" + private val cleanOrphanedGoldenFiles: Boolean = +System.getenv("SPARK_CLEAN_ORPHANED_GOLDEN_FILES") == "1" + protected val queryFilePath: Path = commonResourcePath.resolve("query-tests/queries") // A relative path to /connector/connect/server, used by `ProtoToParsedPlanTestSuite` to run @@ -111,9 +123,25 @@ class
[spark] branch master updated: [SPARK-44161][CONNECT] Handle Row input for UDFs
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 05fc3497f00 [SPARK-44161][CONNECT] Handle Row input for UDFs 05fc3497f00 is described below commit 05fc3497f00a0aad9240f14637ea21d271b2bbe4 Author: Zhen Li AuthorDate: Tue Jun 27 21:05:33 2023 -0400 [SPARK-44161][CONNECT] Handle Row input for UDFs ### What changes were proposed in this pull request? If the client passes Rows as inputs to UDFs, the Spark connect planner will fail to create the RowEncoder for the Row input. The Row encoder sent by the client contains no field or schema information. The real input schema should be obtained from the plan's output. This PR ensures if the server planner failed to create the encoder for the UDF input using reflection, then it will fall back to use RowEncoders created from the plan.output schema. This PR fixed [SPARK-43761](https://issues.apache.org/jira/browse/SPARK-43761) using the same logic. This PR resolved [SPARK-43796](https://issues.apache.org/jira/browse/SPARK-43796). The error is just caused by the case class defined in the test. ### Why are the changes needed? Fix the bug where the Row cannot be used as UDF inputs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E tests. Closes #41704 from zhenlineo/rowEncoder. Authored-by: Zhen Li Signed-off-by: Herman van Hovell --- .../sql/expressions/UserDefinedFunction.scala | 5 +- .../spark/sql/streaming/DataStreamWriter.scala | 11 +-- .../sql/KeyValueGroupedDatasetE2ETestSuite.scala | 45 + .../sql/UserDefinedFunctionE2ETestSuite.scala | 36 +- .../spark/sql/streaming/StreamingQuerySuite.scala | 55 --- .../sql/connect/planner/SparkConnectPlanner.scala | 78 ++ .../spark/sql/catalyst/ScalaReflection.scala | 38 --- .../spark/sql/catalyst/ScalaReflectionSuite.scala | 21 +- .../spark/sql/streaming/DataStreamWriter.scala | 6 +- 9 files changed, 210 insertions(+), 85 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala index bfcd4572e03..14dfc0c6a86 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala @@ -142,7 +142,10 @@ object ScalarUserDefinedFunction { ScalarUserDefinedFunction( function = function, - inputEncoders = parameterTypes.map(tag => ScalaReflection.encoderFor(tag)), + // Input can be a row because the input data schema can be found from the plan. + inputEncoders = +parameterTypes.map(tag => ScalaReflection.encoderForWithRowEncoderSupport(tag)), + // Output cannot be a row as there is no good way to get the return data type. outputEncoder = ScalaReflection.encoderFor(returnType)) } diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala index 263e1e372c8..ed3d2bb8558 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala @@ -30,7 +30,6 @@ import org.apache.spark.connect.proto.Command import org.apache.spark.connect.proto.WriteStreamOperationStart import org.apache.spark.internal.Logging import org.apache.spark.sql.{Dataset, ForeachWriter} -import org.apache.spark.sql.catalyst.encoders.AgnosticEncoders.UnboundRowEncoder import org.apache.spark.sql.connect.common.ForeachWriterPacket import org.apache.spark.sql.execution.streaming.AvailableNowTrigger import org.apache.spark.sql.execution.streaming.ContinuousTrigger @@ -215,15 +214,7 @@ final class DataStreamWriter[T] private[sql] (ds: Dataset[T]) extends Logging { * @since 3.5.0 */ def foreach(writer: ForeachWriter[T]): DataStreamWriter[T] = { -// TODO [SPARK-43761] Update this once resolved UnboundRowEncoder serialization issue. -// ds.encoder equal to UnboundRowEncoder means type parameter T is Row, -// which is not able to be serialized. Server will detect this and use default encoder. -val rowEncoder = if (ds.encoder != UnboundRowEncoder) { - ds.encoder -} else { - null -} -val serialized = Utils.serialize(ForeachWriterPacket(writer, rowEncoder)) +val
[spark] branch master updated: [SPARK-43631][CONNECT][PS] Enable Series.interpolate with Spark Connect
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 20a8fc87d67 [SPARK-43631][CONNECT][PS] Enable Series.interpolate with Spark Connect 20a8fc87d67 is described below commit 20a8fc87d67c842ac3386dc6ae0c53a9533900c2 Author: itholic AuthorDate: Tue Jun 27 14:05:42 2023 -0700 [SPARK-43631][CONNECT][PS] Enable Series.interpolate with Spark Connect ### What changes were proposed in this pull request? This PR proposes to add `LastNonNull` and `NullIndex` to SparkConnectPlanner to enable `Series.interpolate`. ### Why are the changes needed? To increase pandas API coverage ### Does this PR introduce _any_ user-facing change? Yes, `Series.interpolate` will be available from this fix. ### How was this patch tested? Reusing the existing UT. Closes #41670 from itholic/interpolate. Authored-by: itholic Signed-off-by: Ruifeng Zheng --- .../sql/connect/planner/SparkConnectPlanner.scala | 8 +++ python/pyspark/pandas/series.py| 9 --- python/pyspark/pandas/spark/functions.py | 28 ++ .../tests/connect/test_parity_generic_functions.py | 4 +++- python/pyspark/sql/utils.py| 14 ++- 5 files changed, 56 insertions(+), 7 deletions(-) diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala index c19fc5fe90e..ff158990560 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala @@ -1768,6 +1768,14 @@ class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging { val ignoreNA = extractBoolean(children(2), "ignoreNA") Some(EWM(children(0), alpha, ignoreNA)) + case "last_non_null" if fun.getArgumentsCount == 1 => +val children = fun.getArgumentsList.asScala.map(transformExpression) +Some(LastNonNull(children(0))) + + case "null_index" if fun.getArgumentsCount == 1 => +val children = fun.getArgumentsList.asScala.map(transformExpression) +Some(NullIndex(children(0))) + // ML-specific functions case "vector_to_array" if fun.getArgumentsCount == 2 => val expr = transformExpression(fun.getArguments(0)) diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index 0f1e814946a..95ca92e7878 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -53,7 +53,6 @@ from pandas.api.types import ( # type: ignore[attr-defined] CategoricalDtype, ) from pandas.tseries.frequencies import DateOffset -from pyspark import SparkContext from pyspark.sql import functions as F, Column as PySparkColumn, DataFrame as SparkDataFrame from pyspark.sql.types import ( ArrayType, @@ -70,7 +69,7 @@ from pyspark.sql.types import ( TimestampType, ) from pyspark.sql.window import Window -from pyspark.sql.utils import get_column_class +from pyspark.sql.utils import get_column_class, get_window_class from pyspark import pandas as ps # For running doctests and reference resolution in PyCharm. from pyspark.pandas._typing import Axis, Dtype, Label, Name, Scalar, T @@ -2257,10 +2256,10 @@ class Series(Frame, IndexOpsMixin, Generic[T]): return self._psdf.copy()._psser_for(self._column_label) scol = self.spark.column -sql_utils = SparkContext._active_spark_context._jvm.PythonSQLUtils -last_non_null = PySparkColumn(sql_utils.lastNonNull(scol._jc)) -null_index = PySparkColumn(sql_utils.nullIndex(scol._jc)) +last_non_null = SF.last_non_null(scol) +null_index = SF.null_index(scol) +Window = get_window_class() window_forward = Window.orderBy(NATURAL_ORDER_COLUMN_NAME).rowsBetween( Window.unboundedPreceding, Window.currentRow ) diff --git a/python/pyspark/pandas/spark/functions.py b/python/pyspark/pandas/spark/functions.py index 06d5692238d..44650fd4d20 100644 --- a/python/pyspark/pandas/spark/functions.py +++ b/python/pyspark/pandas/spark/functions.py @@ -157,3 +157,31 @@ def ewm(col: Column, alpha: float, ignore_na: bool) -> Column: else: sc = SparkContext._active_spark_context return Column(sc._jvm.PythonSQLUtils.ewm(col._jc, alpha, ignore_na)) + + +def last_non_null(col: Column) -> Column: +if is_remote(): +from pyspark.sql.connect.functions import _invoke_function_over_columns + +return _invoke_function_over_columns( # type:
[spark] branch master updated: [SPARK-43979][SQL][FOLLOW-UP] CollectedMetrics should be treated as the same one for self-join
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 954987f19dc [SPARK-43979][SQL][FOLLOW-UP] CollectedMetrics should be treated as the same one for self-join 954987f19dc is described below commit 954987f19dca67064268cde023d489eb22d81439 Author: Rui Wang AuthorDate: Tue Jun 27 10:15:02 2023 -0700 [SPARK-43979][SQL][FOLLOW-UP] CollectedMetrics should be treated as the same one for self-join ### What changes were proposed in this pull request? Use `transformUpWithNewOutput` than `resolveOperatorsUpWithNewOutput` to simplify the metrics plan. This is to in case that one plan is analyzed and another one is not analyzed. ### Why are the changes needed? To fix the case where we have two CollectedMetrics plan to compare where one is analyzed and another one is not. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #41745 from amaliujia/fix_metrics_path. Authored-by: Rui Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 649140e466a..7c0e8f1490d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -1080,7 +1080,7 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB * duplicates metric definition. */ private def simplifyPlanForCollectedMetrics(plan: LogicalPlan): LogicalPlan = { -plan.resolveOperatorsUpWithNewOutput { +plan.transformUpWithNewOutput { case p: Project if p.projectList.size == p.child.output.size => val assignExprIdOnly = p.projectList.zip(p.child.output).forall { case (left: Alias, right: Attribute) => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-connect-go] branch master updated: [SPARK-43351] Support more data types when reading from spark connect arrow dataset to data frame; Also implement CreateTempView
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git The following commit(s) were added to refs/heads/master by this push: new e9001d2 [SPARK-43351] Support more data types when reading from spark connect arrow dataset to data frame; Also implement CreateTempView e9001d2 is described below commit e9001d2edbc2dd9ba83b7e721d79103bbc3bc598 Author: hiboyang <14280154+hiboy...@users.noreply.github.com> AuthorDate: Tue Jun 27 09:46:20 2023 -0700 [SPARK-43351] Support more data types when reading from spark connect arrow dataset to data frame; Also implement CreateTempView ### What changes were proposed in this pull request? Support more data types when reading from spark connect arrow dataset to data frame; Also implement CreateTempView ### Why are the changes needed? Support more data types when reading from spark connect arrow dataset to data frame; Also implement CreateTempView ### Does this PR introduce _any_ user-facing change? Yes, able to create temp view, e.g. ``` dataframe.CreateTempView(...) ``` ### How was this patch tested? Unit test, and also manual test by running example code Closes #11 from hiboyang/bo-dev-03. Authored-by: hiboyang <14280154+hiboy...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- client/sql/dataframe.go | 184 ++--- client/sql/dataframe_test.go| 249 client/sql/datatype.go | 77 cmd/spark-connect-example-spark-session/main.go | 15 ++ 4 files changed, 497 insertions(+), 28 deletions(-) diff --git a/client/sql/dataframe.go b/client/sql/dataframe.go index eb1718a..f2a0747 100644 --- a/client/sql/dataframe.go +++ b/client/sql/dataframe.go @@ -37,6 +37,8 @@ type DataFrame interface { Collect() ([]Row, error) // Write returns a data frame writer, which could be used to save data frame to supported storage. Write() DataFrameWriter + // CreateTempView creates or replaces a temporary view. + CreateTempView(viewName string, replace bool, global bool) error } // dataFrameImpl is an implementation of DataFrame interface. @@ -157,6 +159,30 @@ func (df *dataFrameImpl) Write() DataFrameWriter { return } +func (df *dataFrameImpl) CreateTempView(viewName string, replace bool, global bool) error { + plan := { + OpType: _Command{ + Command: { + CommandType: _CreateDataframeView{ + CreateDataframeView: { + Input:df.relation, + Name: viewName, + Replace: replace, + IsGlobal: global, + }, + }, + }, + }, + } + + responseClient, err := df.sparkSession.executePlan(plan) + if err != nil { + return fmt.Errorf("failed to create temp view %s: %w", viewName, err) + } + + return consumeExecutePlanClient(responseClient) +} + func (df *dataFrameImpl) createPlan() *proto.Plan { return { OpType: _Root{ @@ -208,38 +234,16 @@ func readArrowBatchData(data []byte, schema *StructType) ([]Row, error) { return nil, fmt.Errorf("failed to read arrow: %w", err) } } - numColumns := len(arrowReader.Schema().Fields()) + + values, err := readArrowRecord(record) + if err != nil { + return nil, err + } + numRows := int(record.NumRows()) if rows == nil { rows = make([]Row, 0, numRows) } - values := make([][]any, numRows) - for i := range values { - values[i] = make([]any, numColumns) - } - for columnIndex := 0; columnIndex < numColumns; columnIndex++ { - columnData := record.Column(columnIndex).Data() - dataTypeId := columnData.DataType().ID() - switch dataTypeId { - case arrow.STRING: - vector := array.NewStringData(columnData) - for rowIndex := 0; rowIndex < numRows; rowIndex++ { - values[rowIndex][columnIndex] = vector.Value(rowIndex) - } - case arrow.INT32: -
[spark] branch master updated: [SPARK-44171][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2279-2282] & delete some unused error classes
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new be8b07a1534 [SPARK-44171][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2279-2282] & delete some unused error classes be8b07a1534 is described below commit be8b07a15348d8fea15c33d35a75969ca1693ff6 Author: panbingkun AuthorDate: Tue Jun 27 19:31:30 2023 +0300 [SPARK-44171][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_[2279-2282] & delete some unused error classes ### What changes were proposed in this pull request? The pr aims to assign names to the error class _LEGACY_ERROR_TEMP_[2279-2282] and delete some unused error classes, details as follows: _LEGACY_ERROR_TEMP_0036 -> `Delete` _LEGACY_ERROR_TEMP_1341 -> `Delete` _LEGACY_ERROR_TEMP_1342 -> `Delete` _LEGACY_ERROR_TEMP_1304 -> `Delete` _LEGACY_ERROR_TEMP_2072 -> `Delete` _LEGACY_ERROR_TEMP_2279 -> `Delete` _LEGACY_ERROR_TEMP_2280 -> UNSUPPORTED_FEATURE.COMMENT_NAMESPACE _LEGACY_ERROR_TEMP_2281 -> UNSUPPORTED_FEATURE.REMOVE_NAMESPACE_COMMENT _LEGACY_ERROR_TEMP_2282 -> UNSUPPORTED_FEATURE.DROP_NAMESPACE_RESTRICT ### Why are the changes needed? The changes improve the error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #41721 from panbingkun/SPARK-44171. Lead-authored-by: panbingkun Co-authored-by: panbingkun <84731...@qq.com> Signed-off-by: Max Gekk --- .../spark/sql/jdbc/v2/MySQLNamespaceSuite.scala| 19 +-- core/src/main/resources/error/error-classes.json | 60 ++ .../spark/sql/errors/QueryCompilationErrors.scala | 16 -- .../spark/sql/errors/QueryExecutionErrors.scala| 30 +-- .../org/apache/spark/sql/jdbc/MySQLDialect.scala | 6 +-- 5 files changed, 47 insertions(+), 84 deletions(-) diff --git a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala index a7ef8d4e104..d58146fecdf 100644 --- a/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala +++ b/connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/MySQLNamespaceSuite.scala @@ -73,7 +73,8 @@ class MySQLNamespaceSuite extends DockerJDBCIntegrationSuite with V2JDBCNamespac exception = intercept[SparkSQLFeatureNotSupportedException] { catalog.createNamespace(Array("foo"), Map("comment" -> "test comment").asJava) }, - errorClass = "_LEGACY_ERROR_TEMP_2280" + errorClass = "UNSUPPORTED_FEATURE.COMMENT_NAMESPACE", + parameters = Map("namespace" -> "`foo`") ) assert(catalog.namespaceExists(Array("foo")) === false) catalog.createNamespace(Array("foo"), Map.empty[String, String].asJava) @@ -84,13 +85,25 @@ class MySQLNamespaceSuite extends DockerJDBCIntegrationSuite with V2JDBCNamespac Array("foo"), NamespaceChange.setProperty("comment", "comment for foo")) }, - errorClass = "_LEGACY_ERROR_TEMP_2280") + errorClass = "UNSUPPORTED_FEATURE.COMMENT_NAMESPACE", + parameters = Map("namespace" -> "`foo`") +) checkError( exception = intercept[SparkSQLFeatureNotSupportedException] { catalog.alterNamespace(Array("foo"), NamespaceChange.removeProperty("comment")) }, - errorClass = "_LEGACY_ERROR_TEMP_2281") + errorClass = "UNSUPPORTED_FEATURE.REMOVE_NAMESPACE_COMMENT", + parameters = Map("namespace" -> "`foo`") +) + +checkError( + exception = intercept[SparkSQLFeatureNotSupportedException] { +catalog.dropNamespace(Array("foo"), cascade = false) + }, + errorClass = "UNSUPPORTED_FEATURE.DROP_NAMESPACE", + parameters = Map("namespace" -> "`foo`") +) catalog.dropNamespace(Array("foo"), cascade = true) assert(catalog.namespaceExists(Array("foo")) === false) } diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 78b54d5230d..342af0ffa6c 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -2383,11 +2383,21 @@ "Combination of ORDER BY/SORT BY/DISTRIBUTE BY/CLUSTER BY." ] }, + "COMMENT_NAMESPACE" : { +"message" : [ + "Attach a comment to the namespace ." +] + }, "DESC_TABLE_COLUMN_PARTITION" : { "message" : [ "DESC TABLE COLUMN for a specific partition." ] }, + "DROP_NAMESPACE" : { +"message" : [ + "Drop the namespace ." +
[spark] branch master updated: [SPARK-44197][BUILD][FOLLOWUP] Update `IsolatedClientLoader` hadoop version
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 24960d84a9d [SPARK-44197][BUILD][FOLLOWUP] Update `IsolatedClientLoader` hadoop version 24960d84a9d is described below commit 24960d84a9dac17728822f3e783335f221c49da3 Author: panbingkun AuthorDate: Tue Jun 27 07:58:44 2023 -0700 [SPARK-44197][BUILD][FOLLOWUP] Update `IsolatedClientLoader` hadoop version ### What changes were proposed in this pull request? The pr aims to follow up SPARK-44197. ### Why are the changes needed? When the Hadoop version that Spark relies on is upgraded from `3.3.5` to `3.3.6`, the corresponding versions in `IsolatedClientLoader` should also be upgraded synchronously. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #41758 from panbingkun/SPARK-44197_FOLLOWUP. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun --- assembly/README | 2 +- resource-managers/kubernetes/integration-tests/README.md| 2 +- .../scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/assembly/README b/assembly/README index a380d8cb330..3dde243d3e6 100644 --- a/assembly/README +++ b/assembly/README @@ -9,4 +9,4 @@ This module is off by default. To activate it specify the profile in the command If you need to build an assembly for a different version of Hadoop the hadoop-version system property needs to be set as in this example: - -Dhadoop.version=3.3.5 + -Dhadoop.version=3.3.6 diff --git a/resource-managers/kubernetes/integration-tests/README.md b/resource-managers/kubernetes/integration-tests/README.md index 2944c189ed4..909e5b652d4 100644 --- a/resource-managers/kubernetes/integration-tests/README.md +++ b/resource-managers/kubernetes/integration-tests/README.md @@ -129,7 +129,7 @@ properties to Maven. For example: mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \ -Pkubernetes -Pkubernetes-integration-tests \ --Phadoop-3 -Dhadoop.version=3.3.5 \ +-Phadoop-3 -Dhadoop.version=3.3.6 \ -Dspark.kubernetes.test.sparkTgz=spark-3.0.0-SNAPSHOT-bin-example.tgz \ -Dspark.kubernetes.test.imageTag=sometag \ -Dspark.kubernetes.test.imageRepo=docker.io/somerepo \ diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala index 64718a9d35c..2765e6af521 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala @@ -66,7 +66,7 @@ private[hive] object IsolatedClientLoader extends Logging { case e: RuntimeException if e.getMessage.contains("hadoop") => // If the error message contains hadoop, it is probably because the hadoop // version cannot be resolved. -val fallbackVersion = "3.3.5" +val fallbackVersion = "3.3.6" logWarning(s"Failed to resolve Hadoop artifacts for the version $hadoopVersion. We " + s"will change the hadoop version from $hadoopVersion to $fallbackVersion and try " + "again. It is recommended to set jars used by Hive metastore client through " + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43081][FOLLOW-UP][ML][CONNECT] Make torch dataloader support torch 1.x
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a758d6a0f9d [SPARK-43081][FOLLOW-UP][ML][CONNECT] Make torch dataloader support torch 1.x a758d6a0f9d is described below commit a758d6a0f9dfa32881cfcec263da0ab0c02f5c1d Author: Weichen Xu AuthorDate: Tue Jun 27 07:57:21 2023 -0700 [SPARK-43081][FOLLOW-UP][ML][CONNECT] Make torch dataloader support torch 1.x ### What changes were proposed in this pull request? Make torch dataloader support torch 1.x. Currently, when running with torch 1.x with num_workers > 0, an error is raised like: ``` ValueError: prefetch_factor option could only be specified in multiprocessing.let num_workers > 0 to enable multiprocessing. ``` ### Why are the changes needed? Compatibility fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually run unit tests with torch 1.x Closes #41751 from WeichenXu123/support-torch-1.x. Authored-by: Weichen Xu Signed-off-by: Ruifeng Zheng --- python/pyspark/ml/torch/distributor.py | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/python/pyspark/ml/torch/distributor.py b/python/pyspark/ml/torch/distributor.py index 9f9636e6b10..8b34acd959e 100644 --- a/python/pyspark/ml/torch/distributor.py +++ b/python/pyspark/ml/torch/distributor.py @@ -995,4 +995,11 @@ def _get_spark_partition_data_loader( dataset = _SparkPartitionTorchDataset(arrow_file, schema, num_samples) -return DataLoader(dataset, batch_size, num_workers=num_workers, prefetch_factor=prefetch_factor) +if num_workers > 0: +return DataLoader( +dataset, batch_size, num_workers=num_workers, prefetch_factor=prefetch_factor +) +else: +# if num_workers is zero, we cannot set `prefetch_factor` otherwise +# torch will raise error. +return DataLoader(dataset, batch_size, num_workers=num_workers) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44204][SQL][HIVE] Add missing recordHiveCall for getPartitionNames
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 43f7a86a05a [SPARK-44204][SQL][HIVE] Add missing recordHiveCall for getPartitionNames 43f7a86a05a is described below commit 43f7a86a05ad8c7ec7060607e43d9ca4d0fe4166 Author: Cheng Pan AuthorDate: Tue Jun 27 16:42:14 2023 +0800 [SPARK-44204][SQL][HIVE] Add missing recordHiveCall for getPartitionNames ### What changes were proposed in this pull request? The code was added by SPARK-35437, looks like it forgot to call `recordHiveCall` before `hive.getPartitionNames` ### Why are the changes needed? Correctly call `recordHiveCall` on each HMS invocation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing CI and Review. Closes #41756 from pan3793/SPARK-44204. Authored-by: Cheng Pan Signed-off-by: yangjie01 --- sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala | 1 + 1 file changed, 1 insertion(+) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala index 08615b90d80..63f672b22ba 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala @@ -1180,6 +1180,7 @@ private[client] class Shim_v0_13 extends Shim_v0_12 { }) } +recordHiveCall() val allPartitionNames = hive.getPartitionNames( table.getDbName, table.getTableName, -1).asScala val partNames = allPartitionNames.filter { p => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44192][BUILD][R] Support R 4.3.1
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8229feab979 [SPARK-44192][BUILD][R] Support R 4.3.1 8229feab979 is described below commit 8229feab97959c5e90bd45be5d0979b0ae41d6e2 Author: yangjie01 AuthorDate: Mon Jun 26 23:55:41 2023 -0700 [SPARK-44192][BUILD][R] Support R 4.3.1 ### What changes were proposed in this pull request? This PR aims to support R 4.3.1 officially in Apache Spark 3.5.0 by upgrading AppVeyor to 4.3.1. ### Why are the changes needed? R 4.3.1 is released on Jun 16, 2023. - https://stat.ethz.ch/pipermail/r-announce/2023/000694.html ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions Closes #41754 from LuciferYang/SPARK-44192. Authored-by: yangjie01 Signed-off-by: Dongjoon Hyun --- dev/appveyor-install-dependencies.ps1 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/appveyor-install-dependencies.ps1 b/dev/appveyor-install-dependencies.ps1 index 6abcc116346..6848d3af43d 100644 --- a/dev/appveyor-install-dependencies.ps1 +++ b/dev/appveyor-install-dependencies.ps1 @@ -129,7 +129,7 @@ $env:PATH = "$env:HADOOP_HOME\bin;" + $env:PATH Pop-Location # == R -$rVer = "4.3.0" +$rVer = "4.3.1" $rToolsVer = "4.0.2" InstallR - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-docker] branch master updated: [SPARK-40513][DOCS] Add apache/spark docker image overview
This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-docker.git The following commit(s) were added to refs/heads/master by this push: new d02ff60 [SPARK-40513][DOCS] Add apache/spark docker image overview d02ff60 is described below commit d02ff6091835311a32c7ccc73d8ebae1d5817ecc Author: Yikun Jiang AuthorDate: Tue Jun 27 14:28:21 2023 +0800 [SPARK-40513][DOCS] Add apache/spark docker image overview ### What changes were proposed in this pull request? This PR add the `OVERVIEW.md`. ### Why are the changes needed? This will be used in the page of https://hub.docker.com/r/apache/spark to introduce the spark docker image and tag info. ### Does this PR introduce _any_ user-facing change? Yes, doc only ### How was this patch tested? Doc only, review. Closes #34 from Yikun/overview. Authored-by: Yikun Jiang Signed-off-by: Yikun Jiang --- OVERVIEW.md | 83 + 1 file changed, 83 insertions(+) diff --git a/OVERVIEW.md b/OVERVIEW.md new file mode 100644 index 000..046 --- /dev/null +++ b/OVERVIEW.md @@ -0,0 +1,83 @@ +# What is Apache Spark™? + +Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structu [...] + +https://spark.apache.org/ + +## Online Documentation + +You can find the latest Spark documentation, including a programming guide, on the [project web page](https://spark.apache.org/documentation.html). This README file only contains basic setup instructions. + +## Interactive Scala Shell + +The easiest way to start using Spark is through the Scala shell: + +``` +docker run -it apache/spark /opt/spark/bin/spark-shell +``` + +Try the following command, which should return 1,000,000,000: + +``` +scala> spark.range(1000 * 1000 * 1000).count() +``` + +## Interactive Python Shell + +The easiest way to start using PySpark is through the Python shell: + +``` +docker run -it apache/spark /opt/spark/bin/pyspark +``` + +And run the following command, which should also return 1,000,000,000: + +``` +>>> spark.range(1000 * 1000 * 1000).count() +``` + +## Interactive R Shell + +The easiest way to start using R on Spark is through the R shell: + +``` +docker run -it apache/spark:r /opt/spark/bin/sparkR +``` + +## Running Spark on Kubernetes + +https://spark.apache.org/docs/latest/running-on-kubernetes.html + +## Supported tags and respective Dockerfile links + +Currently, the `apache/spark` docker image supports 4 types for each version: + +Such as for v3.4.0: +- [3.4.0-scala2.12-java11-python3-ubuntu, 3.4.0-python3, 3.4.0, python3, latest](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-ubuntu) +- [3.4.0-scala2.12-java11-r-ubuntu, 3.4.0-r, r](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-r-ubuntu) +- [3.4.0-scala2.12-java11-ubuntu, 3.4.0-scala, scala](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-ubuntu) +- [3.4.0-scala2.12-java11-python3-r-ubuntu](https://github.com/apache/spark-docker/tree/fe05e38f0ffad271edccd6ae40a77d5f14f3eef7/3.4.0/scala2.12-java11-python3-r-ubuntu) + +## Environment Variable + +The environment variables of entrypoint.sh are listed below: + +| Environment Variable | Meaning | +|--|---| +| SPARK_EXTRA_CLASSPATH | The extra path to be added to the classpath, see also in https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management | +| PYSPARK_PYTHON | Python binary executable to use for PySpark in both driver and workers (default is python3 if available, otherwise python). Property spark.pyspark.python take precedence if it is set | +| PYSPARK_DRIVER_PYTHON | Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python take precedence if it is set | +| SPARK_DIST_CLASSPATH | Distribution-defined classpath to add to processes | +| SPARK_DRIVER_BIND_ADDRESS | Hostname or IP address where to bind listening sockets. See also `spark.driver.bindAddress` | +| SPARK_EXECUTOR_JAVA_OPTS | The Java opts of Spark Executor | +| SPARK_APPLICATION_ID | A unique identifier for the Spark application | +| SPARK_EXECUTOR_POD_IP | The Pod IP address of spark executor | +| SPARK_RESOURCE_PROFILE_ID |
[spark-docker] branch master updated: [SPARK-44175] Remove useless lib64 path link in dockerfile
This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-docker.git The following commit(s) were added to refs/heads/master by this push: new 5405b49 [SPARK-44175] Remove useless lib64 path link in dockerfile 5405b49 is described below commit 5405b49b52aa1661d31ac80cdb8c9aad530d6847 Author: Yikun Jiang AuthorDate: Tue Jun 27 14:09:34 2023 +0800 [SPARK-44175] Remove useless lib64 path link in dockerfile ### What changes were proposed in this pull request? Remove useless lib64 path ### Why are the changes needed? Address comments: https://github.com/docker-library/official-images/pull/13089#issuecomment-1601813499 It was introduced by https://github.com/apache/spark/commit/f13ea15d79fb4752a0a75a05a4a89bd8625ea3d5 to address the issue about snappy on alpine OS, but we already switch the OS to ubuntu, so `/lib64` hack can be cleanup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes #48 from Yikun/rm-lib64-hack. Authored-by: Yikun Jiang Signed-off-by: Yikun Jiang --- 3.4.0/scala2.12-java11-ubuntu/Dockerfile | 1 - 3.4.1/scala2.12-java11-ubuntu/Dockerfile | 1 - Dockerfile.template | 1 - 3 files changed, 3 deletions(-) diff --git a/3.4.0/scala2.12-java11-ubuntu/Dockerfile b/3.4.0/scala2.12-java11-ubuntu/Dockerfile index 77ace47..854f86c 100644 --- a/3.4.0/scala2.12-java11-ubuntu/Dockerfile +++ b/3.4.0/scala2.12-java11-ubuntu/Dockerfile @@ -23,7 +23,6 @@ RUN groupadd --system --gid=${spark_uid} spark && \ RUN set -ex; \ apt-get update; \ -ln -s /lib /lib64; \ apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu libnss-wrapper; \ mkdir -p /opt/spark; \ mkdir /opt/spark/python; \ diff --git a/3.4.1/scala2.12-java11-ubuntu/Dockerfile b/3.4.1/scala2.12-java11-ubuntu/Dockerfile index e782686..bf106a6 100644 --- a/3.4.1/scala2.12-java11-ubuntu/Dockerfile +++ b/3.4.1/scala2.12-java11-ubuntu/Dockerfile @@ -23,7 +23,6 @@ RUN groupadd --system --gid=${spark_uid} spark && \ RUN set -ex; \ apt-get update; \ -ln -s /lib /lib64; \ apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu libnss-wrapper; \ mkdir -p /opt/spark; \ mkdir /opt/spark/python; \ diff --git a/Dockerfile.template b/Dockerfile.template index 6fedce9..80b57e2 100644 --- a/Dockerfile.template +++ b/Dockerfile.template @@ -23,7 +23,6 @@ RUN groupadd --system --gid=${spark_uid} spark && \ RUN set -ex; \ apt-get update; \ -ln -s /lib /lib64; \ apt-get install -y gnupg2 wget bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools gosu libnss-wrapper; \ mkdir -p /opt/spark; \ mkdir /opt/spark/python; \ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org