[spark] branch master updated (ef89b278f8e -> c119b8aec87)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from ef89b278f8e [SPARK-45096][INFRA] Optimize apt-get install in Dockerfile add c119b8aec87 [SPARK-45133][CONNECT] Make Spark Connect queries be FINISHED when last result task is finished No new revisions were added by this update. Summary of changes: .../sql/connect/execution/SparkConnectPlanExecution.scala | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45096][INFRA] Optimize apt-get install in Dockerfile
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ef89b278f8e [SPARK-45096][INFRA] Optimize apt-get install in Dockerfile ef89b278f8e is described below commit ef89b278f8e04b459cd7539fd16754d6cdc77a2d Author: Ruifeng Zheng AuthorDate: Wed Sep 13 10:19:12 2023 +0800 [SPARK-45096][INFRA] Optimize apt-get install in Dockerfile ### What changes were proposed in this pull request? follow the [Best practices for writing Dockerfiles](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#apt-get) : > Always combine RUN apt-get update with apt-get install in the same RUN statement. ### Why are the changes needed? 1, to address https://github.com/apache/spark/pull/42253#discussion_r1280479837 2, when I attempted to change the apt-get install in https://github.com/apache/spark/pull/41918, the behavior was confusing. By following the best practices, further changes should work immediately. ### Does this PR introduce _any_ user-facing change? NO, dev-only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes #42842 from zhengruifeng/infra_docker_file_opt. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- dev/infra/Dockerfile | 50 +++--- 1 file changed, 35 insertions(+), 15 deletions(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index b69e682f239..60204dcc49e 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -24,19 +24,44 @@ ENV FULL_REFRESH_DATE 20221118 ENV DEBIAN_FRONTEND noninteractive ENV DEBCONF_NONINTERACTIVE_SEEN true -ARG APT_INSTALL="apt-get install --no-install-recommends -y" - -RUN apt-get clean -RUN apt-get update -RUN $APT_INSTALL software-properties-common git libxml2-dev pkg-config curl wget openjdk-8-jdk libpython3-dev python3-pip python3-setuptools python3.8 python3.9 -RUN update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java +RUN apt-get update && apt-get install -y \ +software-properties-common \ +git \ +pkg-config \ +curl \ +wget \ +openjdk-8-jdk \ +gfortran \ +libopenblas-dev \ +liblapack-dev \ +build-essential \ +gnupg \ +ca-certificates \ +pandoc \ +libpython3-dev \ +python3-pip \ +python3-setuptools \ +python3.8 \ +python3.9 \ +r-base \ +libcurl4-openssl-dev \ +qpdf \ +zlib1g-dev \ +libssl-dev \ +libpng-dev \ +libharfbuzz-dev \ +libfribidi-dev \ +libtiff5-dev \ +libgit2-dev \ +libxml2-dev \ +libjpeg-dev \ +libfontconfig1-dev \ +libfreetype6-dev \ +&& rm -rf /var/lib/apt/lists/* RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.9 RUN add-apt-repository ppa:pypy/ppa -RUN apt update -RUN $APT_INSTALL gfortran libopenblas-dev liblapack-dev -RUN $APT_INSTALL build-essential RUN mkdir -p /usr/local/pypy/pypy3.8 && \ curl -sqL https://downloads.python.org/pypy/pypy3.8-v7.3.11-linux64.tar.bz2 | tar xjf - -C /usr/local/pypy/pypy3.8 --strip-components=1 && \ @@ -45,19 +70,14 @@ RUN mkdir -p /usr/local/pypy/pypy3.8 && \ RUN curl -sS https://bootstrap.pypa.io/get-pip.py | pypy3 -RUN $APT_INSTALL gnupg ca-certificates pandoc RUN echo 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' >> /etc/apt/sources.list RUN gpg --keyserver hkps://keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 RUN gpg -a --export E084DAB9 | apt-key add - RUN add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' -RUN apt update -RUN $APT_INSTALL r-base libcurl4-openssl-dev qpdf libssl-dev zlib1g-dev + RUN Rscript -e "install.packages(c('knitr', 'markdown', 'rmarkdown', 'testthat', 'devtools', 'e1071', 'survival', 'arrow', 'roxygen2', 'xml2'), repos='https://cloud.r-project.org/')" # See more in SPARK-39959, roxygen2 < 7.2.1 -RUN apt-get install -y libcurl4-openssl-dev libgit2-dev libssl-dev libxml2-dev \ - libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev \ - libtiff5-dev libjpeg-dev RUN Rscript -e "install.packages(c('devtools'), repos='https://cloud.r-project.org/')" RUN Rscript -e "devtools::install_version('roxygen2', version='7.2.0', repos='https://cloud.r-project.org')" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45131][PYTHON][DOCS] Refine docstring of `ceil/ceiling/floor/round/bround`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 74c78970cd9 [SPARK-45131][PYTHON][DOCS] Refine docstring of `ceil/ceiling/floor/round/bround` 74c78970cd9 is described below commit 74c78970cd9e99aa750713574bf175fd1efac7c3 Author: panbingkun AuthorDate: Wed Sep 13 10:17:42 2023 +0800 [SPARK-45131][PYTHON][DOCS] Refine docstring of `ceil/ceiling/floor/round/bround` ### What changes were proposed in this pull request? This pr aims to refine docstring of `ceil/ceiling/floor/round/bround`. ### Why are the changes needed? To improve PySpark documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42892 from panbingkun/SPARK-45131. Authored-by: panbingkun Signed-off-by: Ruifeng Zheng --- python/pyspark/sql/functions.py | 48 + 1 file changed, 34 insertions(+), 14 deletions(-) diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py index d3ad7cfc84e..2d4194c98e9 100644 --- a/python/pyspark/sql/functions.py +++ b/python/pyspark/sql/functions.py @@ -1631,19 +1631,21 @@ def ceil(col: "ColumnOrName", scale: Optional[Union[Column, int]] = None) -> Col Parameters -- col : :class:`~pyspark.sql.Column` or str -target column to compute on. +The target column or column name to compute the ceiling on. scale : :class:`~pyspark.sql.Column` or int -an optional parameter to control the rounding behavior. +An optional parameter to control the rounding behavior. .. versionadded:: 4.0.0 Returns --- :class:`~pyspark.sql.Column` -the column for computed results. +A column for the computed results. Examples +Example 1: Compute the ceiling of a column value + >>> from pyspark.sql import functions as sf >>> spark.range(1).select(sf.ceil(sf.lit(-0.1))).show() +--+ @@ -1652,6 +1654,8 @@ def ceil(col: "ColumnOrName", scale: Optional[Union[Column, int]] = None) -> Col | 0| +--+ +Example 2: Compute the ceiling of a column value with a specified scale + >>> from pyspark.sql import functions as sf >>> spark.range(1).select(sf.ceil(sf.lit(-0.1), 1)).show() +-+ @@ -1680,19 +1684,21 @@ def ceiling(col: "ColumnOrName", scale: Optional[Union[Column, int]] = None) -> Parameters -- col : :class:`~pyspark.sql.Column` or str -target column to compute on. +The target column or column name to compute the ceiling on. scale : :class:`~pyspark.sql.Column` or int -an optional parameter to control the rounding behavior. +An optional parameter to control the rounding behavior. .. versionadded:: 4.0.0 Returns --- :class:`~pyspark.sql.Column` -the column for computed results. +A column for the computed results. Examples +Example 1: Compute the ceiling of a column value + >>> from pyspark.sql import functions as sf >>> spark.range(1).select(sf.ceiling(sf.lit(-0.1))).show() +-+ @@ -1701,6 +1707,8 @@ def ceiling(col: "ColumnOrName", scale: Optional[Union[Column, int]] = None) -> |0| +-+ +Example 2: Compute the ceiling of a column value with a specified scale + >>> from pyspark.sql import functions as sf >>> spark.range(1).select(sf.ceiling(sf.lit(-0.1), 1)).show() ++ @@ -1928,9 +1936,9 @@ def floor(col: "ColumnOrName", scale: Optional[Union[Column, int]] = None) -> Co Parameters -- col : :class:`~pyspark.sql.Column` or str -column to find floor for. +The target column or column name to compute the floor on. scale : :class:`~pyspark.sql.Column` or int -an optional parameter to control the rounding behavior. +An optional parameter to control the rounding behavior. .. versionadded:: 4.0.0 @@ -1942,6 +1950,8 @@ def floor(col: "ColumnOrName", scale: Optional[Union[Column, int]] = None) -> Co Examples +Example 1: Compute the floor of a column value + >>> import pyspark.sql.functions as sf >>> spark.range(1).select(sf.floor(sf.lit(2.5))).show() +--+ @@ -1950,6 +1960,8 @@ def floor(col: "ColumnOrName", scale: Optional[Union[Column, int]] = None) -> Co | 2| +--+ +Example 2: Compute the floor of a column value with a specified scale + >>> import
svn commit: r63958 - /dev/spark/v3.5.0-rc5-bin/ /release/spark/spark-3.5.0/
Author: gengliang Date: Wed Sep 13 02:07:27 2023 New Revision: 63958 Log: Apache Spark 3.5.0 Added: release/spark/spark-3.5.0/ - copied from r63957, dev/spark/v3.5.0-rc5-bin/ Removed: dev/spark/v3.5.0-rc5-bin/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45120][UI] Upgrade d3 from v3 to v7(v7.8.5) and apply api changes in UI
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4221235a6b2 [SPARK-45120][UI] Upgrade d3 from v3 to v7(v7.8.5) and apply api changes in UI 4221235a6b2 is described below commit 4221235a6b2742a42e63f143a62191f94ae05ed8 Author: Kent Yao AuthorDate: Wed Sep 13 09:50:54 2023 +0800 [SPARK-45120][UI] Upgrade d3 from v3 to v7(v7.8.5) and apply api changes in UI ### What changes were proposed in this pull request? This PR upgrades d3 from v3 to v7(v7.8.5). This PR also applies the API changes to our UI drawing. The related changes are listed below: - d3.svg.line -> d3.line - ~~d3.transfrom~~ - d3.layout.histogram -> d3.histogram() -> d3.bin() - d.x -> d.x0 - d.dx -> d.x1 - d.x0 - d.y -> d.length - d3.scale.linear() -> d3.scaleLinear - d3.svg.axis and axis.orient -> d3.axisTop, d3.axisRight, d3.axisBottom, d3.axisLeft - d3.svg.axis().scale(x).orient("bottom") -> d3.axisBottom - d3.time.format -> d3.timeFormat - d3.layout.stack ↦ d3.stack - d3.scale.ordinal ↦ d3.scaleOrdinal - d3.mouse -> d3.pointer - selection.on("mousemove", function(d) { … do something with d3.event and d… }) - becomes: - selection.on("mousemove", function(event, d) { … do something with event and d … }) - svg.selectAll("g rect")[0] -> svg.selectAll("g rect").nodes() Most of these changes come from v4 and v6, a full list of changes can be found at https://github.com/d3/d3/blob/main/CHANGES.md ### Why are the changes needed? d3 v3 is very old(Feb 11, 2015), deprecated by some of its downstream that we depends on ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? This PR was locally tested with `org.apache.spark.examples.streaming.SqlNetworkWordCount` Streaming page loading ![image](https://github.com/apache/spark/assets/8326978/d688b9a7-4e14-42de-a28c-9d81a96bc4d5) SQL page loading ![image](https://github.com/apache/spark/assets/8326978/a1cac4e8-8f1f-480e-a0ac-359cbcfefaee) ### Was this patch authored or co-authored using generative AI tooling? no Closes #42879 from yaooqinn/SPARK-45120. Authored-by: Kent Yao Signed-off-by: Kent Yao --- LICENSE| 6 ++- LICENSE-binary | 6 ++- .../resources/org/apache/spark/ui/static/d3.min.js | 7 +--- .../org/apache/spark/ui/static/spark-dag-viz.js| 6 +-- .../org/apache/spark/ui/static/streaming-page.js | 48 -- .../spark/ui/static/structured-streaming-page.js | 31 ++ licenses-binary/LICENSE-d3.min.js.txt | 39 ++ licenses/LICENSE-d3.min.js.txt | 39 ++ .../spark/sql/execution/ui/static/spark-sql-viz.js | 4 +- 9 files changed, 81 insertions(+), 105 deletions(-) diff --git a/LICENSE b/LICENSE index 1735d3208f2..3fee963db74 100644 --- a/LICENSE +++ b/LICENSE @@ -229,7 +229,6 @@ BSD 3-Clause python/lib/py4j-*-src.zip python/pyspark/cloudpickle/*.py python/pyspark/join.py -core/src/main/resources/org/apache/spark/ui/static/d3.min.js The CSS style for the navigation sidebar of the documentation was originally submitted by Óscar Nájera for the scikit-learn project. The scikit-learn project @@ -248,6 +247,11 @@ docs/js/vendor/anchor.min.js docs/js/vendor/jquery* docs/js/vendor/modernizer* +ISC License +--- + +core/src/main/resources/org/apache/spark/ui/static/d3.min.js + Creative Commons CC0 1.0 Universal Public Domain Dedication --- diff --git a/LICENSE-binary b/LICENSE-binary index 05645977a0b..900b6461106 100644 --- a/LICENSE-binary +++ b/LICENSE-binary @@ -461,7 +461,6 @@ org.jdom:jdom2 python/lib/py4j-*-src.zip python/pyspark/cloudpickle.py python/pyspark/join.py -core/src/main/resources/org/apache/spark/ui/static/d3.min.js The CSS style for the navigation sidebar of the documentation was originally submitted by Óscar Nájera for the scikit-learn project. The scikit-learn project @@ -498,6 +497,11 @@ docs/js/vendor/anchor.min.js docs/js/vendor/jquery* docs/js/vendor/modernizer* +ISC License +--- + +core/src/main/resources/org/apache/spark/ui/static/d3.min.js + Common Development and Distribution License (CDDL) 1.0 -- diff --git a/core/src/main/resources/org/apache/spark/ui/static/d3.min.js b/core/src/main/resources/org/apache/spark/ui/static/d3.min.js index 30cd292198b..8d56002d90f 100644 ---
svn commit: r63956 - /dev/spark/v3.5.0-rc5-docs/
Author: liyuanjian Date: Tue Sep 12 20:53:39 2023 New Revision: 63956 Log: Remove RC artifacts Removed: dev/spark/v3.5.0-rc5-docs/ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44872][CONNECT] Server testing infra and ReattachableExecuteSuite
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new af8c0b999be [SPARK-44872][CONNECT] Server testing infra and ReattachableExecuteSuite af8c0b999be is described below commit af8c0b999be746b661efe2439ac015a0c7d12c00 Author: Juliusz Sompolski AuthorDate: Tue Sep 12 16:48:26 2023 +0200 [SPARK-44872][CONNECT] Server testing infra and ReattachableExecuteSuite ### What changes were proposed in this pull request? Add `SparkConnectServerTest` with infra to test real server with real client in the same process, but communicating over RPC. Add `ReattachableExecuteSuite` with some tests for reattachable execute. Two bugs were found by the tests: * Fix bug in `SparkConnectExecutionManager.createExecuteHolder` when attempting to resubmit an operation that was deemed abandoned. This bug is benign in reattachable execute, because reattachable execute would first send a ReattachExecute, which would be handled correctly in SparkConnectReattachExecuteHandler. For non-reattachable execute (disabled or old client), this is also a very unlikely scenario, because the retrying mechanism should be able to resubmit before the query is decl [...] * In `ExecuteGrpcResponseSender` there was an assertion that assumed that if `sendResponse` did not send, it was because deadline was reached. But it can also be because of interrupt. This would have resulted in interrupt returning an assertion error instead of CURSOR_DISCONNECTED in testing. Outside of testing assertions are not enabled, so this was not a problem outside of testing. ### Why are the changes needed? Testing of reattachable execute. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added. Closes #42560 from juliuszsompolski/sc-reattachable-tests. Authored-by: Juliusz Sompolski Signed-off-by: Herman van Hovell (cherry picked from commit 4b96add471d292ed5c63ccc625489ff78cfb9b25) Signed-off-by: Herman van Hovell --- .../sql/connect/client/CloseableIterator.scala | 22 +- .../client/CustomSparkConnectBlockingStub.scala| 2 +- .../ExecutePlanResponseReattachableIterator.scala | 18 +- .../connect/client/GrpcExceptionConverter.scala| 5 +- .../sql/connect/client/GrpcRetryHandler.scala | 4 +- .../execution/ExecuteGrpcResponseSender.scala | 17 +- .../execution/ExecuteResponseObserver.scala| 8 +- .../spark/sql/connect/service/ExecuteHolder.scala | 10 + .../service/SparkConnectExecutionManager.scala | 40 ++- .../spark/sql/connect/SparkConnectServerTest.scala | 261 +++ .../execution/ReattachableExecuteSuite.scala | 352 + .../scala/org/apache/spark/SparkFunSuite.scala | 24 ++ 12 files changed, 735 insertions(+), 28 deletions(-) diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CloseableIterator.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CloseableIterator.scala index 891e50ed6e7..d3fc9963edc 100644 --- a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CloseableIterator.scala +++ b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CloseableIterator.scala @@ -27,6 +27,20 @@ private[sql] trait CloseableIterator[E] extends Iterator[E] with AutoCloseable { } } +private[sql] abstract class WrappedCloseableIterator[E] extends CloseableIterator[E] { + + def innerIterator: Iterator[E] + + override def next(): E = innerIterator.next() + + override def hasNext(): Boolean = innerIterator.hasNext + + override def close(): Unit = innerIterator match { +case it: CloseableIterator[E] => it.close() +case _ => // nothing + } +} + private[sql] object CloseableIterator { /** @@ -35,12 +49,8 @@ private[sql] object CloseableIterator { def apply[T](iterator: Iterator[T]): CloseableIterator[T] = iterator match { case closeable: CloseableIterator[T] => closeable case _ => - new CloseableIterator[T] { -override def next(): T = iterator.next() - -override def hasNext(): Boolean = iterator.hasNext - -override def close() = { /* empty */ } + new WrappedCloseableIterator[T] { +override def innerIterator = iterator } } } diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectBlockingStub.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/CustomSparkConnectBlockingStub.scala index 73ff01e223f..80edcfa8be1 100644 ---
[spark] branch branch-3.5 updated: [SPARK-45117][SQL] Implement missing otherCopyArgs for the MultiCommutativeOp expression
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 4e44d929005 [SPARK-45117][SQL] Implement missing otherCopyArgs for the MultiCommutativeOp expression 4e44d929005 is described below commit 4e44d929005ac457fc853b256c02fd93f35fcceb Author: Supun Nakandala AuthorDate: Tue Sep 12 23:52:22 2023 +0800 [SPARK-45117][SQL] Implement missing otherCopyArgs for the MultiCommutativeOp expression ### What changes were proposed in this pull request? - This PR implements the missing otherCopyArgs in the MultiCommutativeOp expression ### Why are the changes needed? - Without this method implementation, calling toJSON will throw an exception from the TreeNode::jsonFields method. - This is because the jsonFields method has an assertion that the number of fields defined in the constructor is equal to the number of field values (productIterator.toSeq ++ otherCopyArgs). - The originalRoot field of the MultiCommutativeOp is not part of the productIterator. Hence, it has to be explicitly set in the otherCopyArgs field. ### Does this PR introduce _any_ user-facing change? - No ### How was this patch tested? - Added unit test ### Was this patch authored or co-authored using generative AI tooling? - No Closes #42873 from db-scnakandala/multi-commutative-op. Authored-by: Supun Nakandala Signed-off-by: Wenchen Fan (cherry picked from commit d999f622dc68b4fb2734e2ac7cbe203b062c257f) Signed-off-by: Wenchen Fan --- .../apache/spark/sql/catalyst/expressions/Expression.scala | 2 ++ .../spark/sql/catalyst/expressions/CanonicalizeSuite.scala | 13 + 2 files changed, 15 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala index c2330cdb59d..bd7369e57b0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala @@ -1410,4 +1410,6 @@ case class MultiCommutativeOp( override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression = this.copy(operands = newChildren)(originalRoot) + + override protected final def otherCopyArgs: Seq[AnyRef] = originalRoot :: Nil } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala index 0e22b0d2876..89175ea1970 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala @@ -338,4 +338,17 @@ class CanonicalizeSuite extends SparkFunSuite { SQLConf.get.setConfString(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD.key, default.toString) } + + test("toJSON works properly with MultiCommutativeOp") { +val default = SQLConf.get.getConf(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD) +SQLConf.get.setConfString(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD.key, "1") + +val d = Decimal(1.2) +val literal1 = Literal.create(d, DecimalType(2, 1)) +val literal2 = Literal.create(d, DecimalType(2, 1)) +val literal3 = Literal.create(d, DecimalType(3, 2)) +val op = Add(literal1, Add(literal2, literal3)) +assert(op.canonicalized.toJSON.nonEmpty) +SQLConf.get.setConfString(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD.key, default.toString) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45117][SQL] Implement missing otherCopyArgs for the MultiCommutativeOp expression
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d999f622dc6 [SPARK-45117][SQL] Implement missing otherCopyArgs for the MultiCommutativeOp expression d999f622dc6 is described below commit d999f622dc68b4fb2734e2ac7cbe203b062c257f Author: Supun Nakandala AuthorDate: Tue Sep 12 23:52:22 2023 +0800 [SPARK-45117][SQL] Implement missing otherCopyArgs for the MultiCommutativeOp expression ### What changes were proposed in this pull request? - This PR implements the missing otherCopyArgs in the MultiCommutativeOp expression ### Why are the changes needed? - Without this method implementation, calling toJSON will throw an exception from the TreeNode::jsonFields method. - This is because the jsonFields method has an assertion that the number of fields defined in the constructor is equal to the number of field values (productIterator.toSeq ++ otherCopyArgs). - The originalRoot field of the MultiCommutativeOp is not part of the productIterator. Hence, it has to be explicitly set in the otherCopyArgs field. ### Does this PR introduce _any_ user-facing change? - No ### How was this patch tested? - Added unit test ### Was this patch authored or co-authored using generative AI tooling? - No Closes #42873 from db-scnakandala/multi-commutative-op. Authored-by: Supun Nakandala Signed-off-by: Wenchen Fan --- .../apache/spark/sql/catalyst/expressions/Expression.scala | 2 ++ .../spark/sql/catalyst/expressions/CanonicalizeSuite.scala | 13 + 2 files changed, 15 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala index c2330cdb59d..bd7369e57b0 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala @@ -1410,4 +1410,6 @@ case class MultiCommutativeOp( override protected def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression = this.copy(operands = newChildren)(originalRoot) + + override protected final def otherCopyArgs: Seq[AnyRef] = originalRoot :: Nil } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala index 0e22b0d2876..89175ea1970 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala @@ -338,4 +338,17 @@ class CanonicalizeSuite extends SparkFunSuite { SQLConf.get.setConfString(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD.key, default.toString) } + + test("toJSON works properly with MultiCommutativeOp") { +val default = SQLConf.get.getConf(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD) +SQLConf.get.setConfString(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD.key, "1") + +val d = Decimal(1.2) +val literal1 = Literal.create(d, DecimalType(2, 1)) +val literal2 = Literal.create(d, DecimalType(2, 1)) +val literal3 = Literal.create(d, DecimalType(3, 2)) +val op = Add(literal1, Add(literal2, literal3)) +assert(op.canonicalized.toJSON.nonEmpty) +SQLConf.get.setConfString(MULTI_COMMUTATIVE_OP_OPT_THRESHOLD.key, default.toString) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: Revert "[SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6a2284feaac Revert "[SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3" 6a2284feaac is described below commit 6a2284feaac4f632d645a93361d29e693eeb9d32 Author: Dongjoon Hyun AuthorDate: Tue Sep 12 08:49:40 2023 -0700 Revert "[SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3" This reverts commit 6a2aa1d48c304095dcdf2816a46ec1f5a8af41a2. --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml| 2 +- ...StoreBasicOperationsBenchmark-jdk11-results.txt | 120 ++--- ...StoreBasicOperationsBenchmark-jdk17-results.txt | 120 ++--- .../StateStoreBasicOperationsBenchmark-results.txt | 120 ++--- 5 files changed, 182 insertions(+), 182 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 3d3f710e74c..1d02f8dba56 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -227,7 +227,7 @@ parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar pickle/1.3//pickle-1.3.jar py4j/0.10.9.7//py4j-0.10.9.7.jar remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar -rocksdbjni/8.5.3//rocksdbjni-8.5.3.jar +rocksdbjni/8.3.2//rocksdbjni-8.3.2.jar scala-collection-compat_2.12/2.7.0//scala-collection-compat_2.12-2.7.0.jar scala-compiler/2.12.18//scala-compiler-2.12.18.jar scala-library/2.12.18//scala-library-2.12.18.jar diff --git a/pom.xml b/pom.xml index 70e1ee71568..8fc4b89a78c 100644 --- a/pom.xml +++ b/pom.xml @@ -679,7 +679,7 @@ org.rocksdb rocksdbjni -8.5.3 +8.3.2 ${leveldbjni.group} diff --git a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt index 70e9849572c..d5c175a320d 100644 --- a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt +++ b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt @@ -2,110 +2,110 @@ put rows -OpenJDK 64-Bit Server VM 11.0.20.1+1 on Linux 5.15.0-1045-azure -Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz +OpenJDK 64-Bit Server VM 11.0.19+7 on Linux 5.15.0-1040-azure +Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz putting 1 rows (1 rows to overwrite - rate 100): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative --- -In-memory8 9 1 1.3 770.7 1.0X -RocksDB (trackTotalNumberOfRows: true) 62 63 1 0.26174.3 0.1X -RocksDB (trackTotalNumberOfRows: false) 22 23 1 0.52220.7 0.3X +In-memory9 11 2 1.1 872.7 1.0X +RocksDB (trackTotalNumberOfRows: true) 61 63 1 0.26148.5 0.1X +RocksDB (trackTotalNumberOfRows: false) 21 22 0 0.52108.9 0.4X -OpenJDK 64-Bit Server VM 11.0.20.1+1 on Linux 5.15.0-1045-azure -Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz +OpenJDK 64-Bit Server VM 11.0.19+7 on Linux 5.15.0-1040-azure +Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz putting 1 rows (5000 rows to overwrite - rate 50): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - -In-memory 8 9 1 1.3 781.2 1.0X -RocksDB (trackTotalNumberOfRows: true)52 53 1 0.25196.0 0.2X -RocksDB (trackTotalNumberOfRows: false) 22 24 1 0.42230.3 0.4X +In-memory 9 10 1 1.1 872.0 1.0X +RocksDB (trackTotalNumberOfRows: true)51 53 1 0.25134.7 0.2X +RocksDB (trackTotalNumberOfRows: false) 21
[spark] branch master updated: Revert "[SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3"
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f33c2051c01 Revert "[SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3" f33c2051c01 is described below commit f33c2051c01eb09f6bbb602ed3c7c637e1c6f421 Author: Dongjoon Hyun AuthorDate: Tue Sep 12 08:48:59 2023 -0700 Revert "[SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3" This reverts commit fa2bc21ba1e6cbde31f33faa681f5a1c47219c69. --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml| 2 +- ...StoreBasicOperationsBenchmark-jdk11-results.txt | 120 ++--- ...StoreBasicOperationsBenchmark-jdk17-results.txt | 120 ++--- .../StateStoreBasicOperationsBenchmark-results.txt | 120 ++--- 5 files changed, 182 insertions(+), 182 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 91cad456a21..652127a9bb8 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -225,7 +225,7 @@ parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar pickle/1.3//pickle-1.3.jar py4j/0.10.9.7//py4j-0.10.9.7.jar remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar -rocksdbjni/8.5.3//rocksdbjni-8.5.3.jar +rocksdbjni/8.3.2//rocksdbjni-8.3.2.jar scala-collection-compat_2.12/2.7.0//scala-collection-compat_2.12-2.7.0.jar scala-compiler/2.12.18//scala-compiler-2.12.18.jar scala-library/2.12.18//scala-library-2.12.18.jar diff --git a/pom.xml b/pom.xml index c632186f6e5..02920c0ae74 100644 --- a/pom.xml +++ b/pom.xml @@ -681,7 +681,7 @@ org.rocksdb rocksdbjni -8.5.3 +8.3.2 ${leveldbjni.group} diff --git a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt index 70e9849572c..d5c175a320d 100644 --- a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt +++ b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt @@ -2,110 +2,110 @@ put rows -OpenJDK 64-Bit Server VM 11.0.20.1+1 on Linux 5.15.0-1045-azure -Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz +OpenJDK 64-Bit Server VM 11.0.19+7 on Linux 5.15.0-1040-azure +Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz putting 1 rows (1 rows to overwrite - rate 100): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative --- -In-memory8 9 1 1.3 770.7 1.0X -RocksDB (trackTotalNumberOfRows: true) 62 63 1 0.26174.3 0.1X -RocksDB (trackTotalNumberOfRows: false) 22 23 1 0.52220.7 0.3X +In-memory9 11 2 1.1 872.7 1.0X +RocksDB (trackTotalNumberOfRows: true) 61 63 1 0.26148.5 0.1X +RocksDB (trackTotalNumberOfRows: false) 21 22 0 0.52108.9 0.4X -OpenJDK 64-Bit Server VM 11.0.20.1+1 on Linux 5.15.0-1045-azure -Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz +OpenJDK 64-Bit Server VM 11.0.19+7 on Linux 5.15.0-1040-azure +Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz putting 1 rows (5000 rows to overwrite - rate 50): Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - -In-memory 8 9 1 1.3 781.2 1.0X -RocksDB (trackTotalNumberOfRows: true)52 53 1 0.25196.0 0.2X -RocksDB (trackTotalNumberOfRows: false) 22 24 1 0.42230.3 0.4X +In-memory 9 10 1 1.1 872.0 1.0X +RocksDB (trackTotalNumberOfRows: true)51 53 1 0.25134.7 0.2X +RocksDB (trackTotalNumberOfRows: false) 21
[spark] branch master updated (d8298bffd91 -> 4b96add471d)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from d8298bffd91 [SPARK-45081][SQL] Encoders.bean does no longer work with read-only properties add 4b96add471d [SPARK-44872][CONNECT] Server testing infra and ReattachableExecuteSuite No new revisions were added by this update. Summary of changes: .../sql/connect/client/CloseableIterator.scala | 22 +- .../client/CustomSparkConnectBlockingStub.scala| 2 +- .../ExecutePlanResponseReattachableIterator.scala | 18 +- .../connect/client/GrpcExceptionConverter.scala| 5 +- .../sql/connect/client/GrpcRetryHandler.scala | 4 +- .../execution/ExecuteGrpcResponseSender.scala | 17 +- .../execution/ExecuteResponseObserver.scala| 8 +- .../spark/sql/connect/service/ExecuteHolder.scala | 10 + .../service/SparkConnectExecutionManager.scala | 40 ++- .../spark/sql/connect/SparkConnectServerTest.scala | 261 +++ .../execution/ReattachableExecuteSuite.scala | 352 + .../scala/org/apache/spark/SparkFunSuite.scala | 24 ++ 12 files changed, 735 insertions(+), 28 deletions(-) create mode 100644 connector/connect/server/src/test/scala/org/apache/spark/sql/connect/SparkConnectServerTest.scala create mode 100644 connector/connect/server/src/test/scala/org/apache/spark/sql/connect/execution/ReattachableExecuteSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-45081][SQL] Encoders.bean does no longer work with read-only properties
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ffa4127c774 [SPARK-45081][SQL] Encoders.bean does no longer work with read-only properties ffa4127c774 is described below commit ffa4127c774ea13b4d6bbcc82bc5a9bee23d7156 Author: Giambattista Bloisi AuthorDate: Tue Sep 12 16:16:04 2023 +0200 [SPARK-45081][SQL] Encoders.bean does no longer work with read-only properties ### What changes were proposed in this pull request? This PR re-enables Encoders.bean to be called against beans having read-only properties, that is properties that have only getters and no setter method. Beans with read only properties are even used in internal tests. Setter methods of a Java bean encoder are stored within an Option wrapper because they are missing in case of read-only properties. When a java bean has to be initialized, setter methods for the bean properties have to be called: this PR filters out read-only properties from that process. ### Why are the changes needed? The changes are required to avoid an exception to the thrown by getting the value of a None option object. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? An additional regression test has been added ### Was this patch authored or co-authored using generative AI tooling? No Closes #42829 from gbloisi-openaire/SPARK-45081. Authored-by: Giambattista Bloisi Signed-off-by: Herman van Hovell (cherry picked from commit d8298bffd91de01299f9456b37e4454e8b4a6ae8) Signed-off-by: Herman van Hovell --- .../sql/connect/client/arrow/ArrowDeserializer.scala | 20 +++- .../spark/sql/catalyst/DeserializerBuildHelper.scala | 4 +++- .../test/org/apache/spark/sql/JavaDatasetSuite.java | 17 + 3 files changed, 31 insertions(+), 10 deletions(-) diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala index cd54966ccf5..94295785987 100644 --- a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala +++ b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala @@ -332,15 +332,17 @@ object ArrowDeserializers { val constructor = methodLookup.findConstructor(tag.runtimeClass, MethodType.methodType(classOf[Unit])) val lookup = createFieldLookup(vectors) -val setters = fields.map { field => - val vector = lookup(field.name) - val deserializer = deserializerFor(field.enc, vector, timeZoneId) - val setter = methodLookup.findVirtual( -tag.runtimeClass, -field.writeMethod.get, -MethodType.methodType(classOf[Unit], field.enc.clsTag.runtimeClass)) - (bean: Any, i: Int) => setter.invoke(bean, deserializer.get(i)) -} +val setters = fields + .filter(_.writeMethod.isDefined) + .map { field => +val vector = lookup(field.name) +val deserializer = deserializerFor(field.enc, vector, timeZoneId) +val setter = methodLookup.findVirtual( + tag.runtimeClass, + field.writeMethod.get, + MethodType.methodType(classOf[Unit], field.enc.clsTag.runtimeClass)) +(bean: Any, i: Int) => setter.invoke(bean, deserializer.get(i)) + } new StructFieldSerializer[Any](struct) { def value(i: Int): Any = { val instance = constructor.invoke() diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala index 16a7d7ff065..0b88d5a4130 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala @@ -390,7 +390,9 @@ object DeserializerBuildHelper { CreateExternalRow(convertedFields, enc.schema)) case JavaBeanEncoder(tag, fields) => - val setters = fields.map { f => + val setters = fields +.filter(_.writeMethod.isDefined) +.map { f => val newTypePath = walkedTypePath.recordField( f.enc.clsTag.runtimeClass.getName, f.name) diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java index 4f7cf8da787..f416d411322 100644 ---
[spark] branch master updated: [SPARK-45081][SQL] Encoders.bean does no longer work with read-only properties
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d8298bffd91 [SPARK-45081][SQL] Encoders.bean does no longer work with read-only properties d8298bffd91 is described below commit d8298bffd91de01299f9456b37e4454e8b4a6ae8 Author: Giambattista Bloisi AuthorDate: Tue Sep 12 16:16:04 2023 +0200 [SPARK-45081][SQL] Encoders.bean does no longer work with read-only properties ### What changes were proposed in this pull request? This PR re-enables Encoders.bean to be called against beans having read-only properties, that is properties that have only getters and no setter method. Beans with read only properties are even used in internal tests. Setter methods of a Java bean encoder are stored within an Option wrapper because they are missing in case of read-only properties. When a java bean has to be initialized, setter methods for the bean properties have to be called: this PR filters out read-only properties from that process. ### Why are the changes needed? The changes are required to avoid an exception to the thrown by getting the value of a None option object. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? An additional regression test has been added ### Was this patch authored or co-authored using generative AI tooling? No Closes #42829 from gbloisi-openaire/SPARK-45081. Authored-by: Giambattista Bloisi Signed-off-by: Herman van Hovell --- .../sql/connect/client/arrow/ArrowDeserializer.scala | 20 +++- .../spark/sql/catalyst/DeserializerBuildHelper.scala | 4 +++- .../test/org/apache/spark/sql/JavaDatasetSuite.java | 17 + 3 files changed, 31 insertions(+), 10 deletions(-) diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala index cd54966ccf5..94295785987 100644 --- a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala +++ b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/arrow/ArrowDeserializer.scala @@ -332,15 +332,17 @@ object ArrowDeserializers { val constructor = methodLookup.findConstructor(tag.runtimeClass, MethodType.methodType(classOf[Unit])) val lookup = createFieldLookup(vectors) -val setters = fields.map { field => - val vector = lookup(field.name) - val deserializer = deserializerFor(field.enc, vector, timeZoneId) - val setter = methodLookup.findVirtual( -tag.runtimeClass, -field.writeMethod.get, -MethodType.methodType(classOf[Unit], field.enc.clsTag.runtimeClass)) - (bean: Any, i: Int) => setter.invoke(bean, deserializer.get(i)) -} +val setters = fields + .filter(_.writeMethod.isDefined) + .map { field => +val vector = lookup(field.name) +val deserializer = deserializerFor(field.enc, vector, timeZoneId) +val setter = methodLookup.findVirtual( + tag.runtimeClass, + field.writeMethod.get, + MethodType.methodType(classOf[Unit], field.enc.clsTag.runtimeClass)) +(bean: Any, i: Int) => setter.invoke(bean, deserializer.get(i)) + } new StructFieldSerializer[Any](struct) { def value(i: Int): Any = { val instance = constructor.invoke() diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala index 16a7d7ff065..0b88d5a4130 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala @@ -390,7 +390,9 @@ object DeserializerBuildHelper { CreateExternalRow(convertedFields, enc.schema)) case JavaBeanEncoder(tag, fields) => - val setters = fields.map { f => + val setters = fields +.filter(_.writeMethod.isDefined) +.map { f => val newTypePath = walkedTypePath.recordField( f.enc.clsTag.runtimeClass.getName, f.name) diff --git a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java index 4f7cf8da787..f416d411322 100644 --- a/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java +++ b/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java @@ -1783,6
[spark] branch master updated: [MINOR][DOCS] Add errors.rst to .gitignore
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2b731565941 [MINOR][DOCS] Add errors.rst to .gitignore 2b731565941 is described below commit 2b731565941a9aedd52e51a10aabed4115be41cf Author: panbingkun AuthorDate: Tue Sep 12 19:01:34 2023 +0900 [MINOR][DOCS] Add errors.rst to .gitignore ### What changes were proposed in this pull request? After PR [[SPARK-44945][DOCS][PYTHON] Automate PySpark error class documentation](https://github.com/apache/spark/pull/42658), `errors.rst` file will be automatically generated during document build process. https://github.com/apache/spark/assets/15246973/f6048e2e-7fc8-4930-9c11-767ccbfa1c68;> Add the file `python/docs/source/development/errors.rst` to git ignore. ### Why are the changes needed? - To avoid developers from accidentally adding those files when working on docs. - According on what we have seen, the files generated ` python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst` in the building document have also been added to the `.gitignore` file. https://github.com/apache/spark/blob/994b6976b2a5a53323a83e70e0c6195cd74292a1/python/docs/source/conf.py#L26-L41 https://github.com/apache/spark/blob/994b6976b2a5a53323a83e70e0c6195cd74292a1/.gitignore#L77 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42885 from panbingkun/minor_errors_rst. Authored-by: panbingkun Signed-off-by: Hyukjin Kwon --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 064b502175b..174f66c6064 100644 --- a/.gitignore +++ b/.gitignore @@ -73,6 +73,7 @@ python/.eggs/ python/coverage.xml python/deps python/docs/_site/ +python/docs/source/development/errors.rst python/docs/source/reference/**/api/ python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst python/test_coverage/coverage_data - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44911][SQL] Create hive table with invalid column should return error class
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1e03db36a93 [SPARK-44911][SQL] Create hive table with invalid column should return error class 1e03db36a93 is described below commit 1e03db36a939aea5b4d55059967ccde96cb29564 Author: ming95 <505306...@qq.com> AuthorDate: Tue Sep 12 11:55:08 2023 +0300 [SPARK-44911][SQL] Create hive table with invalid column should return error class ### What changes were proposed in this pull request? create hive table with invalid column should return error class. run sql ``` create table test stored as parquet as select id, date'2018-01-01' + make_dt_interval(0, id) from range(0, 10) ``` before this issue , error would be : ``` org.apache.spark.sql.AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: `spark_catalog`.`default`.`test`; Column: DATE '2018-01-01' + make_dt_interval(0, id, 0, 0.00) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$verifyDataSchema$4(HiveExternalCatalog.scala:175) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$verifyDataSchema$4$adapted(HiveExternalCatalog.scala:171) at scala.collection.Iterator.foreach(Iterator.scala:943) ``` after this issue ``` Exception in thread "main" org.apache.spark.sql.AnalysisException: [INVALID_HIVE_COLUMN_NAME] Cannot create the table `spark_catalog`.`default`.`parquet_ds1` having the column `DATE '2018-01-01' + make_dt_interval(0, id, 0, 0`.`00)` whose name contains invalid characters ',' in Hive metastore. at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$verifyDataSchema$4(HiveExternalCatalog.scala:180) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$verifyDataSchema$4$adapted(HiveExternalCatalog.scala:171) at scala.collection.Iterator.foreach(Iterator.scala:943) ``` ### Why are the changes needed? as above ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add UT ### Was this patch authored or co-authored using generative AI tooling? no Closes #42609 from ming95/SPARK-44911. Authored-by: ming95 <505306...@qq.com> Signed-off-by: Max Gekk --- .../src/main/resources/error/error-classes.json| 2 +- docs/sql-error-conditions.md | 2 +- .../spark/sql/hive/HiveExternalCatalog.scala | 11 --- .../spark/sql/hive/execution/HiveDDLSuite.scala| 21 .../spark/sql/hive/execution/SQLQuerySuite.scala | 23 +++--- 5 files changed, 47 insertions(+), 12 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 415bdbaf42a..4740ed72f89 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -1587,7 +1587,7 @@ }, "INVALID_HIVE_COLUMN_NAME" : { "message" : [ - "Cannot create the table having the nested column whose name contains invalid characters in Hive metastore." + "Cannot create the table having the column whose name contains invalid characters in Hive metastore." ] }, "INVALID_IDENTIFIER" : { diff --git a/docs/sql-error-conditions.md b/docs/sql-error-conditions.md index 0d54938593c..444c2b7c0d1 100644 --- a/docs/sql-error-conditions.md +++ b/docs/sql-error-conditions.md @@ -971,7 +971,7 @@ For more details see [INVALID_HANDLE](sql-error-conditions-invalid-handle-error- SQLSTATE: none assigned -Cannot create the table `` having the nested column `` whose name contains invalid characters `` in Hive metastore. +Cannot create the table `` having the column `` whose name contains invalid characters `` in Hive metastore. ### INVALID_IDENTIFIER diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala index e4325989b70..67292460bbc 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala @@ -42,7 +42,7 @@ import org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.types.DataTypeUtils import org.apache.spark.sql.catalyst.util.{CaseInsensitiveMap, CharVarcharUtils} -import org.apache.spark.sql.catalyst.util.TypeUtils.toSQLId +import org.apache.spark.sql.catalyst.util.TypeUtils.{toSQLId,
[spark] branch master updated: [SPARK-45092][SQL][UI] Avoid analyzing twice for failed queries
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4d213ff3dea [SPARK-45092][SQL][UI] Avoid analyzing twice for failed queries 4d213ff3dea is described below commit 4d213ff3dea4d66e5dec7be3b35c5441d9187c30 Author: Kent Yao AuthorDate: Tue Sep 12 16:35:39 2023 +0800 [SPARK-45092][SQL][UI] Avoid analyzing twice for failed queries ### What changes were proposed in this pull request? As a discussion starting from https://github.com/apache/spark/pull/42481#discussion_r1316776270, for failed queries, we need to avoid calling SparkPlanInfo fromSparkPlan, which triggers another round of analyzing. This patch uses `Either[Throwable, () => T]` to pass the throwable conditionally and bypass plan explain functions on error. ### Why are the changes needed? improvements of https://github.com/apache/spark/pull/42481 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #42838 from yaooqinn/SPARK-45092. Authored-by: Kent Yao Signed-off-by: Kent Yao --- .../spark/sql/execution/QueryExecution.scala | 2 +- .../apache/spark/sql/execution/SQLExecution.scala | 72 ++ .../spark/sql/execution/ui/UISeleniumSuite.scala | 2 +- 3 files changed, 49 insertions(+), 27 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala index 8ddfde8acf8..b3c97a83970 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala @@ -71,7 +71,7 @@ class QueryExecution( // Because we do eager analysis for Dataframe, there will be no execution created after // AnalysisException occurs. So we need to explicitly create a new execution to post // start/end events to notify the listener and UI components. -SQLExecution.withNewExecutionId(this, Some("analyze"))(throw e) +SQLExecution.withNewExecutionIdOnError(this, Some("analyze"))(e) } } diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala index 2a44a016d2d..b96b9c25dda 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala @@ -66,9 +66,10 @@ object SQLExecution extends Logging { * Wrap an action that will execute "queryExecution" to track all Spark jobs in the body so that * we can connect them with an execution. */ - def withNewExecutionId[T]( + private def withNewExecutionId0[T]( queryExecution: QueryExecution, - name: Option[String] = None)(body: => T): T = queryExecution.sparkSession.withActive { + name: Option[String] = None)( + body: Either[Throwable, () => T]): T = queryExecution.sparkSession.withActive { val sparkSession = queryExecution.sparkSession val sc = sparkSession.sparkContext val oldExecutionId = sc.getLocalProperty(EXECUTION_ID_KEY) @@ -103,9 +104,6 @@ object SQLExecution extends Logging { redactedStr.substring(0, Math.min(truncateLength, redactedStr.length)) }.getOrElse(callSite.shortForm) - val planDescriptionMode = -ExplainMode.fromString(sparkSession.sessionState.conf.uiExplainMode) - val globalConfigs = sparkSession.sharedState.conf.getAll.toMap val modifiedConfigs = sparkSession.sessionState.conf.getAllConfs .filterNot { case (key, value) => @@ -118,28 +116,39 @@ object SQLExecution extends Logging { withSQLConfPropagated(sparkSession) { var ex: Option[Throwable] = None val startTime = System.nanoTime() +val startEvent = SparkListenerSQLExecutionStart( + executionId = executionId, + rootExecutionId = Some(rootExecutionId), + description = desc, + details = callSite.longForm, + physicalPlanDescription = "", + sparkPlanInfo = SparkPlanInfo.EMPTY, + time = System.currentTimeMillis(), + modifiedConfigs = redactedConfigs, + jobTags = sc.getJobTags() +) try { - val planInfo = try { -SparkPlanInfo.fromSparkPlan(queryExecution.executedPlan) - } catch { -case NonFatal(e) => - logDebug("Failed to generate SparkPlanInfo", e) - // If the queryExecution already failed before this, we are not able
[spark] branch master updated: [SPARK-44915][CORE] Validate checksum of remounted PVC's shuffle data before recovery
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 994b6976b2a [SPARK-44915][CORE] Validate checksum of remounted PVC's shuffle data before recovery 994b6976b2a is described below commit 994b6976b2a5a53323a83e70e0c6195cd74292a1 Author: Dongjoon Hyun AuthorDate: Tue Sep 12 01:15:01 2023 -0700 [SPARK-44915][CORE] Validate checksum of remounted PVC's shuffle data before recovery ### What changes were proposed in this pull request? This PR aims to validate checksum of remounted PVC's shuffle data before recovery. ### Why are the changes needed? In general, there are many reasons which causes the executor terminations and some of them causes data corruptions on disks. Since Apache Spark has checksum files already, we can take advantage of it in order to improve the robustness by preventing any potential remounted disk issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with newly added test suite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42724 from dongjoon-hyun/SPARK-44915. Lead-authored-by: Dongjoon Hyun Co-authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../spark/shuffle/ShuffleChecksumUtils.scala | 13 ++ ...ernetesLocalDiskShuffleExecutorComponents.scala | 69 +-- .../spark/shuffle/ShuffleChecksumUtilsSuite.scala | 134 + 3 files changed, 209 insertions(+), 7 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/shuffle/ShuffleChecksumUtils.scala b/core/src/main/scala/org/apache/spark/shuffle/ShuffleChecksumUtils.scala index 75b0efcf5cd..b2a18d75387 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/ShuffleChecksumUtils.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/ShuffleChecksumUtils.scala @@ -22,9 +22,22 @@ import java.util.zip.CheckedInputStream import org.apache.spark.network.shuffle.checksum.ShuffleChecksumHelper import org.apache.spark.network.util.LimitedInputStream +import org.apache.spark.shuffle.IndexShuffleBlockResolver.NOOP_REDUCE_ID +import org.apache.spark.storage.{BlockId, ShuffleChecksumBlockId, ShuffleDataBlockId} object ShuffleChecksumUtils { + /** + * Return checksumFile for shuffle data block ID. Otherwise, null. + */ + def getChecksumFileName(blockId: BlockId, algorithm: String): String = blockId match { +case ShuffleDataBlockId(shuffleId, mapId, _) => + ShuffleChecksumHelper.getChecksumFileName( +ShuffleChecksumBlockId(shuffleId, mapId, NOOP_REDUCE_ID).name, algorithm) +case _ => + null + } + /** * Ensure that the checksum values are consistent with index file and data file. */ diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleExecutorComponents.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleExecutorComponents.scala index e553a56b7e1..a858db374df 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleExecutorComponents.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/shuffle/KubernetesLocalDiskShuffleExecutorComponents.scala @@ -20,6 +20,7 @@ package org.apache.spark.shuffle import java.io.File import java.util.Optional +import scala.collection.mutable import scala.reflect.ClassTag import org.apache.commons.io.FileExistsException @@ -27,9 +28,11 @@ import org.apache.commons.io.FileExistsException import org.apache.spark.{SparkConf, SparkEnv} import org.apache.spark.deploy.k8s.Config.KUBERNETES_DRIVER_REUSE_PVC import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.{SHUFFLE_CHECKSUM_ALGORITHM, SHUFFLE_CHECKSUM_ENABLED} +import org.apache.spark.shuffle.ShuffleChecksumUtils.{compareChecksums, getChecksumFileName} import org.apache.spark.shuffle.api.{ShuffleExecutorComponents, ShuffleMapOutputWriter, SingleSpillShuffleMapOutputWriter} import org.apache.spark.shuffle.sort.io.LocalDiskShuffleExecutorComponents -import org.apache.spark.storage.{BlockId, BlockManager, StorageLevel, UnrecognizedBlockId} +import org.apache.spark.storage.{BlockId, BlockManager, ShuffleDataBlockId, StorageLevel, UnrecognizedBlockId} import org.apache.spark.util.Utils class KubernetesLocalDiskShuffleExecutorComponents(sparkConf: SparkConf) @@ -73,7 +76,7 @@ object KubernetesLocalDiskShuffleExecutorComponents extends Logging { */ def recoverDiskStore(conf: SparkConf, bm: BlockManager): Unit = { // Find All files -val files = Utils.getConfiguredLocalDirs(conf) +val
[spark] tag v3.5.0 created (now ce5ddad9903)
This is an automated email from the ASF dual-hosted git repository. liyuanjian pushed a change to tag v3.5.0 in repository https://gitbox.apache.org/repos/asf/spark.git at ce5ddad9903 (commit) No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6565ae47cae -> d8129f837c4)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6565ae47cae [SPARK-43252][SQL] Replace the error class `_LEGACY_ERROR_TEMP_2016` with an internal error add d8129f837c4 [SPARK-45085][SQL] Merge UNSUPPORTED_TEMP_VIEW_OPERATION into UNSUPPORTED_VIEW_OPERATION and refactor some logic No new revisions were added by this update. Summary of changes: R/pkg/tests/fulltests/test_sparkSQL.R | 2 +- .../src/main/resources/error/error-classes.json| 17 -- docs/sql-error-conditions.md | 8 --- .../spark/sql/catalyst/analysis/Analyzer.scala | 19 +++ .../sql/catalyst/analysis/v2ResolutionPlans.scala | 4 +- .../spark/sql/errors/QueryCompilationErrors.scala | 52 +++-- .../analyzer-results/change-column.sql.out | 8 +-- .../sql-tests/results/change-column.sql.out| 8 +-- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 4 +- .../apache/spark/sql/execution/SQLViewSuite.scala | 66 +++--- .../spark/sql/execution/SQLViewTestSuite.scala | 4 +- .../spark/sql/execution/command/DDLSuite.scala | 6 +- .../execution/command/TruncateTableSuiteBase.scala | 10 ++-- .../execution/command/v1/ShowPartitionsSuite.scala | 10 ++-- .../apache/spark/sql/internal/CatalogSuite.scala | 4 +- 15 files changed, 80 insertions(+), 142 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (fa2bc21ba1e -> 6565ae47cae)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from fa2bc21ba1e [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3 add 6565ae47cae [SPARK-43252][SQL] Replace the error class `_LEGACY_ERROR_TEMP_2016` with an internal error No new revisions were added by this update. Summary of changes: common/utils/src/main/resources/error/error-classes.json| 5 - .../org/apache/spark/sql/errors/QueryExecutionErrors.scala | 6 ++ .../sql/catalyst/expressions/codegen/CodeBlockSuite.scala | 13 - 3 files changed, 10 insertions(+), 14 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d7845da6ddf -> fa2bc21ba1e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from d7845da6ddf [SPARK-45125][INFRA] Remove dev/github_jira_sync.py in favor of ASF jira_options add fa2bc21ba1e [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml| 2 +- ...StoreBasicOperationsBenchmark-jdk11-results.txt | 120 ++--- ...StoreBasicOperationsBenchmark-jdk17-results.txt | 120 ++--- .../StateStoreBasicOperationsBenchmark-results.txt | 120 ++--- 5 files changed, 182 insertions(+), 182 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6a2aa1d48c3 [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3 6a2aa1d48c3 is described below commit 6a2aa1d48c304095dcdf2816a46ec1f5a8af41a2 Author: panbingkun AuthorDate: Tue Sep 12 00:29:38 2023 -0700 [SPARK-45110][BUILD] Upgrade rocksdbjni to 8.5.3 ### What changes were proposed in this pull request? This pr aims to upgrade rocksdbjni from 8.3.2 to 8.5.3. ### Why are the changes needed? 1.The full release notes: - https://github.com/facebook/rocksdb/releases/tag/v8.5.3 - https://github.com/facebook/rocksdb/releases/tag/v8.4.4 - https://github.com/facebook/rocksdb/releases/tag/v8.3.3 2.Bug Fixes: https://github.com/apache/spark/assets/15246973/879224c3-6f29-40d7-9c07-0f656fa2ff76;> - Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results. - Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test: ``` ./build/mvn clean install -pl core -am -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -fn ... [INFO] [INFO] Reactor Summary for Spark Project Parent POM 4.0.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 7.121 s] [INFO] Spark Project Tags . SUCCESS [ 10.181 s] [INFO] Spark Project Local DB . SUCCESS [ 21.153 s] [INFO] Spark Project Common Utils . SUCCESS [ 14.960 s] [INFO] Spark Project Networking ... SUCCESS [01:01 min] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 16.992 s] [INFO] Spark Project Unsafe ... SUCCESS [ 14.967 s] [INFO] Spark Project Launcher . SUCCESS [ 11.737 s] [INFO] Spark Project Core . SUCCESS [38:06 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 40:45 min [INFO] Finished at: 2023-09-10T17:25:26+08:00 [INFO] ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42862 from panbingkun/SPARK-45110. Authored-by: panbingkun Signed-off-by: Dongjoon Hyun (cherry picked from commit fa2bc21ba1e6cbde31f33faa681f5a1c47219c69) Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 +- pom.xml| 2 +- ...StoreBasicOperationsBenchmark-jdk11-results.txt | 120 ++--- ...StoreBasicOperationsBenchmark-jdk17-results.txt | 120 ++--- .../StateStoreBasicOperationsBenchmark-results.txt | 120 ++--- 5 files changed, 182 insertions(+), 182 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index 1d02f8dba56..3d3f710e74c 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -227,7 +227,7 @@ parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar pickle/1.3//pickle-1.3.jar py4j/0.10.9.7//py4j-0.10.9.7.jar remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar -rocksdbjni/8.3.2//rocksdbjni-8.3.2.jar +rocksdbjni/8.5.3//rocksdbjni-8.5.3.jar scala-collection-compat_2.12/2.7.0//scala-collection-compat_2.12-2.7.0.jar scala-compiler/2.12.18//scala-compiler-2.12.18.jar scala-library/2.12.18//scala-library-2.12.18.jar diff --git a/pom.xml b/pom.xml index 8fc4b89a78c..70e1ee71568 100644 --- a/pom.xml +++ b/pom.xml @@ -679,7 +679,7 @@ org.rocksdb rocksdbjni -8.3.2 +8.5.3 ${leveldbjni.group} diff --git a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt index d5c175a320d..70e9849572c 100644 --- a/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt +++ b/sql/core/benchmarks/StateStoreBasicOperationsBenchmark-jdk11-results.txt @@ -2,110 +2,110 @@ put rows
[spark] branch master updated: [SPARK-45125][INFRA] Remove dev/github_jira_sync.py in favor of ASF jira_options
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d7845da6ddf [SPARK-45125][INFRA] Remove dev/github_jira_sync.py in favor of ASF jira_options d7845da6ddf is described below commit d7845da6ddf2f838b1d91606b8730d078fea11b4 Author: Kent Yao AuthorDate: Tue Sep 12 00:27:17 2023 -0700 [SPARK-45125][INFRA] Remove dev/github_jira_sync.py in favor of ASF jira_options ### What changes were proposed in this pull request? Since SPARK-44942 and https://issues.apache.org/jira/browse/INFRA-24962, we've enabled jira_options for GitHub and JIRA syncing, and it's been working properly. Thus, this PR removes dev/github_jira_sync.py in favor of ASF jira_options. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? confirmed with INFRA and watch the jiar for several days ### Was this patch authored or co-authored using generative AI tooling? no Closes #42882 from yaooqinn/SPARK-45125. Authored-by: Kent Yao Signed-off-by: Dongjoon Hyun --- .github/labeler.yml | 4 +- dev/github_jira_sync.py | 202 2 files changed, 1 insertion(+), 205 deletions(-) diff --git a/.github/labeler.yml b/.github/labeler.yml index 4ae831f2131..b252edd8873 100644 --- a/.github/labeler.yml +++ b/.github/labeler.yml @@ -42,12 +42,11 @@ INFRA: - ".asf.yaml" - ".gitattributes" - ".gitignore" - - "dev/github_jira_sync.py" - "dev/merge_spark_pr.py" - "dev/run-tests-jenkins*" BUILD: # Can be supported when a stable release with correct all/any is released - #- any: ['dev/**/*', '!dev/github_jira_sync.py', '!dev/merge_spark_pr.py', '!dev/.rat-excludes'] + #- any: ['dev/**/*', '!dev/merge_spark_pr.py', '!dev/.rat-excludes'] - "dev/**/*" - "build/**/*" - "project/**/*" @@ -58,7 +57,6 @@ BUILD: - "scalastyle-config.xml" # These can be added in the above `any` clause (and the /dev/**/* glob removed) when # `any`/`all` support is released - # - "!dev/github_jira_sync.py" # - "!dev/merge_spark_pr.py" # - "!dev/run-tests-jenkins*" # - "!dev/.rat-excludes" diff --git a/dev/github_jira_sync.py b/dev/github_jira_sync.py deleted file mode 100755 index 45908518d82..000 --- a/dev/github_jira_sync.py +++ /dev/null @@ -1,202 +0,0 @@ -#!/usr/bin/env python3 - -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# -# Utility for updating JIRA's with information about GitHub pull requests - -import json -import os -import re -import sys -from urllib.request import urlopen -from urllib.request import Request -from urllib.error import HTTPError - -try: -import jira.client -except ImportError: -print("This tool requires the jira-python library") -print("Install using 'pip3 install jira'") -sys.exit(-1) - -# User facing configs -GITHUB_API_BASE = os.environ.get("GITHUB_API_BASE", "https://api.github.com/repos/apache/spark;) -GITHUB_OAUTH_KEY = os.environ.get("GITHUB_OAUTH_KEY") -JIRA_PROJECT_NAME = os.environ.get("JIRA_PROJECT_NAME", "SPARK") -JIRA_API_BASE = os.environ.get("JIRA_API_BASE", "https://issues.apache.org/jira;) -JIRA_USERNAME = os.environ.get("JIRA_USERNAME", "apachespark") -JIRA_PASSWORD = os.environ.get("JIRA_PASSWORD", "XXX") -# Maximum number of updates to perform in one run -MAX_UPDATES = int(os.environ.get("MAX_UPDATES", "10")) -# Cut-off for oldest PR on which to comment. Useful for avoiding -# "notification overload" when running for the first time. -MIN_COMMENT_PR = int(os.environ.get("MIN_COMMENT_PR", "1496")) - -# File used as an optimization to store maximum previously seen PR -# Used mostly because accessing ASF JIRA is slow, so we want to avoid checking -# the state of JIRA's that are tied to PR's we've already looked at. -MAX_FILE = ".github-jira-max" - - -def get_url(url): -try: -request = Request(url) -request.add_header("Authorization", "token %s" % GITHUB_OAUTH_KEY) -
[spark] branch branch-3.4 updated: [SPARK-45109][SQL][CONNECT][3.4] Fix log function in Connect
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 544017854c7 [SPARK-45109][SQL][CONNECT][3.4] Fix log function in Connect 544017854c7 is described below commit 544017854c77be1c8b7fffc3f23a5fdee2fb798e Author: Peter Toth AuthorDate: Tue Sep 12 00:23:49 2023 -0700 [SPARK-45109][SQL][CONNECT][3.4] Fix log function in Connect ### What changes were proposed in this pull request? This is a backport PR of https://github.com/apache/spark/pull/42869 as the 1 argument `log` function should point to `ln`. (Please note that the original https://github.com/apache/spark/pull/42863 doesn't need to be backported as `aes_descrypt` and `ln` is not implemented in Connect in Spark 3.4.) ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No, these Spark Connect functions haven't been released. ### How was this patch tested? Exsiting UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42872 from peter-toth/SPARK-45109-fix-log-3.4. Authored-by: Peter Toth Signed-off-by: Dongjoon Hyun --- .../src/main/scala/org/apache/spark/sql/functions.scala | 2 +- .../query-tests/explain-results/function_log.explain| 2 +- .../resources/query-tests/queries/function_log.json | 2 +- .../query-tests/queries/function_log.proto.bin | Bin 172 -> 171 bytes 4 files changed, 3 insertions(+), 3 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala index 29c2e89c537..7a59fa5a3a3 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala @@ -2032,7 +2032,7 @@ object functions { * @group math_funcs * @since 3.4.0 */ - def log(e: Column): Column = Column.fn("log", e) + def log(e: Column): Column = Column.fn("ln", e) /** * Computes the natural logarithm of the given column. diff --git a/connector/connect/common/src/test/resources/query-tests/explain-results/function_log.explain b/connector/connect/common/src/test/resources/query-tests/explain-results/function_log.explain index d3c3743b1ef..66b782ac817 100644 --- a/connector/connect/common/src/test/resources/query-tests/explain-results/function_log.explain +++ b/connector/connect/common/src/test/resources/query-tests/explain-results/function_log.explain @@ -1,2 +1,2 @@ -Project [LOG(E(), b#0) AS LOG(E(), b)#0] +Project [ln(b#0) AS ln(b)#0] +- LocalRelation , [id#0L, a#0, b#0, d#0, e#0, f#0, g#0] diff --git a/connector/connect/common/src/test/resources/query-tests/queries/function_log.json b/connector/connect/common/src/test/resources/query-tests/queries/function_log.json index 1b2d0ed0b14..ababbc52d08 100644 --- a/connector/connect/common/src/test/resources/query-tests/queries/function_log.json +++ b/connector/connect/common/src/test/resources/query-tests/queries/function_log.json @@ -13,7 +13,7 @@ }, "expressions": [{ "unresolvedFunction": { -"functionName": "log", +"functionName": "ln", "arguments": [{ "unresolvedAttribute": { "unparsedIdentifier": "b" diff --git a/connector/connect/common/src/test/resources/query-tests/queries/function_log.proto.bin b/connector/connect/common/src/test/resources/query-tests/queries/function_log.proto.bin index 548fb480dd2..ecb87a1fc41 100644 Binary files a/connector/connect/common/src/test/resources/query-tests/queries/function_log.proto.bin and b/connector/connect/common/src/test/resources/query-tests/queries/function_log.proto.bin differ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45121][CONNECT][PS] Support `Series.empty` for Spark Connect
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5d2d9155d24 [SPARK-45121][CONNECT][PS] Support `Series.empty` for Spark Connect 5d2d9155d24 is described below commit 5d2d9155d24ea9a466e1868969dccd4ae1ac7278 Author: Haejoon Lee AuthorDate: Tue Sep 12 14:37:52 2023 +0800 [SPARK-45121][CONNECT][PS] Support `Series.empty` for Spark Connect ### What changes were proposed in this pull request? This PR proposes to support Series.empty for Spark Connect by removing JVM dependency. ### Why are the changes needed? Increase API coverage for Spark Connect. ### Does this PR introduce _any_ user-facing change? `Series.empty` is available on Spark Connect. ### How was this patch tested? Added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42877 from itholic/SPARK-45121. Authored-by: Haejoon Lee Signed-off-by: Ruifeng Zheng --- python/pyspark/pandas/base.py | 2 +- python/pyspark/pandas/tests/series/test_series.py | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/python/pyspark/pandas/base.py b/python/pyspark/pandas/base.py index 1cb17de89e8..ef0b51f757d 100644 --- a/python/pyspark/pandas/base.py +++ b/python/pyspark/pandas/base.py @@ -505,7 +505,7 @@ class IndexOpsMixin(object, metaclass=ABCMeta): >>> ps.DataFrame({}, index=list('abc')).index.empty False """ -return self._internal.resolved_copy.spark_frame.rdd.isEmpty() +return self._internal.resolved_copy.spark_frame.isEmpty() @property def hasnans(self) -> bool: diff --git a/python/pyspark/pandas/tests/series/test_series.py b/python/pyspark/pandas/tests/series/test_series.py index 136d905eb49..aa147aa75cf 100644 --- a/python/pyspark/pandas/tests/series/test_series.py +++ b/python/pyspark/pandas/tests/series/test_series.py @@ -113,6 +113,8 @@ class SeriesTestsMixin: self.assert_eq(ps.from_pandas(pser_a), pser_a) self.assert_eq(ps.from_pandas(pser_b), pser_b) +self.assertTrue(pser_a.empty) + def test_all_null_series(self): pser_a = pd.Series([None, None, None], dtype="float64") pser_b = pd.Series([None, None, None], dtype="str") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43123][PS] Raise `TypeError` for `DataFrame.interpolate` when all columns are object-dtype
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3148511a923 [SPARK-43123][PS] Raise `TypeError` for `DataFrame.interpolate` when all columns are object-dtype 3148511a923 is described below commit 3148511a923bf59ea37d8f44e7427cde66f9f167 Author: Haejoon Lee AuthorDate: Tue Sep 12 14:36:42 2023 +0800 [SPARK-43123][PS] Raise `TypeError` for `DataFrame.interpolate` when all columns are object-dtype ### What changes were proposed in this pull request? This PR proposes to aise `TypeError` for `DataFrame.interpolate` when all columns are object-dtype. ### Why are the changes needed? To match the behavior of Pandas: ```python >>> pd.DataFrame({"A": ['a', 'b', 'c'], "B": ['a', 'b', 'c']}).interpolate() ... TypeError: Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype. ``` We currently return empty DataFrame instead of raise TypeError: ```python >>> pd.DataFrame({"A": ['a', 'b', 'c'], "B": ['a', 'b', 'c']}).interpolate() Empty DataFrame Columns: [] Index: [0, 1, 2] ``` ### Does this PR introduce _any_ user-facing change? Compute `DataFrame.interpolate` on DataFrame that has all object-dtype columns will raise TypeError instead of returning an empty DataFrame. ### How was this patch tested? Added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42878 from itholic/SPARK-45123. Authored-by: Haejoon Lee Signed-off-by: Ruifeng Zheng --- python/pyspark/pandas/frame.py| 5 + python/pyspark/pandas/tests/test_frame_interpolate.py | 5 + 2 files changed, 10 insertions(+) diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index adbef607256..3aebbd65427 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -6097,6 +6097,11 @@ defaultdict(, {'col..., 'col...})] if isinstance(psser.spark.data_type, (NumericType, BooleanType)): numeric_col_names.append(psser.name) +if len(numeric_col_names) == 0: +raise TypeError( +"Cannot interpolate with all object-dtype columns in the DataFrame. " +"Try setting at least one column to a numeric dtype." +) psdf = self[numeric_col_names] return psdf._apply_series_op( lambda psser: psser._interpolate( diff --git a/python/pyspark/pandas/tests/test_frame_interpolate.py b/python/pyspark/pandas/tests/test_frame_interpolate.py index 5b5856f7ab8..17c73781f8e 100644 --- a/python/pyspark/pandas/tests/test_frame_interpolate.py +++ b/python/pyspark/pandas/tests/test_frame_interpolate.py @@ -53,6 +53,11 @@ class FrameInterpolateTestsMixin: with self.assertRaisesRegex(ValueError, "invalid limit_area"): psdf.id.interpolate(limit_area="jump") +with self.assertRaisesRegex( +TypeError, "Cannot interpolate with all object-dtype columns in the DataFrame." +): +ps.DataFrame({"A": ["a", "b", "c"], "B": ["a", "b", "c"]}).interpolate() + def _test_interpolate(self, pobj): psobj = ps.from_pandas(pobj) self.assert_eq(psobj.interpolate(), pobj.interpolate()) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-45075][SQL][3.4] Fix alter table with invalid default value will not report error
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new ed6c1b33a0a [SPARK-45075][SQL][3.4] Fix alter table with invalid default value will not report error ed6c1b33a0a is described below commit ed6c1b33a0a763680182b29baedebb241e0139a4 Author: Jia Fan AuthorDate: Mon Sep 11 23:28:20 2023 -0700 [SPARK-45075][SQL][3.4] Fix alter table with invalid default value will not report error ### What changes were proposed in this pull request? This is a backporting PR to branch-3.4 from https://github.com/apache/spark/pull/42810 Changed the way of assert the error to adapt to 3.4 ### Why are the changes needed? Fix bug on 3.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42876 from Hisoka-X/SPARK-45075_followup_3.4_alter_column. Authored-by: Jia Fan Signed-off-by: Dongjoon Hyun --- .../spark/sql/connector/catalog/TableChange.java | 3 +-- .../plans/logical/v2AlterTableCommands.scala | 11 +-- .../spark/sql/connector/AlterTableTests.scala | 23 ++ 3 files changed, 33 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java index 609cfab2d56..ebecb6f507e 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableChange.java @@ -696,9 +696,8 @@ public interface TableChange { /** * Returns the column default value SQL string (Spark SQL dialect). The default value literal * is not provided as updating column default values does not need to back-fill existing data. - * Null means dropping the column default value. + * Empty string means dropping the column default value. */ -@Nullable public String newDefaultValue() { return newDefaultValue; } @Override diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala index eb9d45f06ec..b02c4fac12d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2AlterTableCommands.scala @@ -17,9 +17,9 @@ package org.apache.spark.sql.catalyst.plans.logical -import org.apache.spark.sql.catalyst.analysis.{FieldName, FieldPosition} +import org.apache.spark.sql.catalyst.analysis.{FieldName, FieldPosition, ResolvedFieldName} import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec -import org.apache.spark.sql.catalyst.util.TypeUtils +import org.apache.spark.sql.catalyst.util.{ResolveDefaultColumns, TypeUtils} import org.apache.spark.sql.connector.catalog.{TableCatalog, TableChange} import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.types.DataType @@ -228,6 +228,13 @@ case class AlterColumn( TableChange.updateColumnPosition(colName, newPosition.position) } val defaultValueChange = setDefaultExpression.map { newDefaultExpression => + if (newDefaultExpression.nonEmpty) { +// SPARK-45075: We call 'ResolveDefaultColumns.analyze' here to make sure that the default +// value parses successfully, and return an error otherwise +val newDataType = dataType.getOrElse(column.asInstanceOf[ResolvedFieldName].field.dataType) +ResolveDefaultColumns.analyze(column.name.last, newDataType, newDefaultExpression, + "ALTER TABLE ALTER COLUMN") + } TableChange.updateColumnDefaultValue(colName, newDefaultExpression) } typeChange.toSeq ++ nullabilityChange ++ commentChange ++ positionChange ++ defaultValueChange diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala index 2047212a4ea..8f1faa18933 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala @@ -366,6 +366,29 @@ trait AlterTableTests extends SharedSparkSession with QueryErrorsBase { } } + test("SPARK-45075: ALTER COLUMN with invalid default value") { +withSQLConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS.key -> s"$v2Format, ") { + withTable("t") { +
[spark] branch master updated: [SPARK-45124][CONNET] Do not use local user ID for Local Relations
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 47d801e5e9d [SPARK-45124][CONNET] Do not use local user ID for Local Relations 47d801e5e9d is described below commit 47d801e5e9ded3fb50d274a720ee7874e0b37cc3 Author: Hyukjin Kwon AuthorDate: Tue Sep 12 14:59:44 2023 +0900 [SPARK-45124][CONNET] Do not use local user ID for Local Relations ### What changes were proposed in this pull request? This PR removes the use of `userId` and `sessionId` in `CachedLocalRelation` messages and subsequently make `SparkConnectPlanner` use the `userId`/`sessionId` of the active session rather than the user-provided information. ### Why are the changes needed? Allowing a fetch of a local relation using user-provided information is a potential security risk since this allows users to fetch arbitrary local relations. ### Does this PR introduce _any_ user-facing change? Virtually no. It will ignore the session id or user id that users set (but instead use internal ones that users cannot manipulate). ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42880 from HyukjinKwon/no-local-user. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- .../scala/org/apache/spark/sql/SparkSession.scala | 2 - .../main/protobuf/spark/connect/relations.proto| 10 +- .../sql/connect/planner/SparkConnectPlanner.scala | 2 +- python/pyspark/sql/connect/plan.py | 3 - python/pyspark/sql/connect/proto/relations_pb2.py | 160 ++--- python/pyspark/sql/connect/proto/relations_pb2.pyi | 15 +- 6 files changed, 87 insertions(+), 105 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala index 7882ea64013..7bd8fa59aea 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -134,8 +134,6 @@ class SparkSession private[sql] ( } else { val hash = client.cacheLocalRelation(arrowData, encoder.schema.json) builder.getCachedLocalRelationBuilder -.setUserId(client.userId) -.setSessionId(client.sessionId) .setHash(hash) } } else { diff --git a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto index 8001b3cbcfa..f7f1315ede0 100644 --- a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto +++ b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto @@ -400,11 +400,11 @@ message LocalRelation { // A local relation that has been cached already. message CachedLocalRelation { - // (Required) An identifier of the user which created the local relation - string userId = 1; - - // (Required) An identifier of the Spark SQL session in which the user created the local relation. - string sessionId = 2; + // `userId` and `sessionId` fields are deleted since the server must always use the active + // session/user rather than arbitrary values provided by the client. It is never valid to access + // a local relation from a different session/user. + reserved 1, 2; + reserved "userId", "sessionId"; // (Required) A sha-256 hash of the serialized local relation in proto, see LocalRelation. string hash = 3; diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala index 1a63c9fc27c..b8ab5539b30 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala @@ -970,7 +970,7 @@ class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging { private def transformCachedLocalRelation(rel: proto.CachedLocalRelation): LogicalPlan = { val blockManager = session.sparkContext.env.blockManager -val blockId = CacheId(rel.getUserId, rel.getSessionId, rel.getHash) +val blockId = CacheId(sessionHolder.userId, sessionHolder.sessionId, rel.getHash) val bytes = blockManager.getLocalBytes(blockId) bytes .map { blockData => diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py index 5e9b4e53dbf..f641cb4b2fe 100644 ---
[spark] branch branch-3.5 updated: [SPARK-45124][CONNET] Do not use local user ID for Local Relations
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 09b14f0968c [SPARK-45124][CONNET] Do not use local user ID for Local Relations 09b14f0968c is described below commit 09b14f0968cebe0f2c5c9a369935f27d4ea228f6 Author: Hyukjin Kwon AuthorDate: Tue Sep 12 14:59:44 2023 +0900 [SPARK-45124][CONNET] Do not use local user ID for Local Relations ### What changes were proposed in this pull request? This PR removes the use of `userId` and `sessionId` in `CachedLocalRelation` messages and subsequently make `SparkConnectPlanner` use the `userId`/`sessionId` of the active session rather than the user-provided information. ### Why are the changes needed? Allowing a fetch of a local relation using user-provided information is a potential security risk since this allows users to fetch arbitrary local relations. ### Does this PR introduce _any_ user-facing change? Virtually no. It will ignore the session id or user id that users set (but instead use internal ones that users cannot manipulate). ### How was this patch tested? Manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42880 from HyukjinKwon/no-local-user. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 47d801e5e9ded3fb50d274a720ee7874e0b37cc3) Signed-off-by: Hyukjin Kwon --- .../scala/org/apache/spark/sql/SparkSession.scala | 2 - .../main/protobuf/spark/connect/relations.proto| 10 +- .../sql/connect/planner/SparkConnectPlanner.scala | 2 +- python/pyspark/sql/connect/plan.py | 3 - python/pyspark/sql/connect/proto/relations_pb2.py | 160 ++--- python/pyspark/sql/connect/proto/relations_pb2.pyi | 15 +- 6 files changed, 87 insertions(+), 105 deletions(-) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala index 7882ea64013..7bd8fa59aea 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala @@ -134,8 +134,6 @@ class SparkSession private[sql] ( } else { val hash = client.cacheLocalRelation(arrowData, encoder.schema.json) builder.getCachedLocalRelationBuilder -.setUserId(client.userId) -.setSessionId(client.sessionId) .setHash(hash) } } else { diff --git a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto index 8001b3cbcfa..f7f1315ede0 100644 --- a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto +++ b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto @@ -400,11 +400,11 @@ message LocalRelation { // A local relation that has been cached already. message CachedLocalRelation { - // (Required) An identifier of the user which created the local relation - string userId = 1; - - // (Required) An identifier of the Spark SQL session in which the user created the local relation. - string sessionId = 2; + // `userId` and `sessionId` fields are deleted since the server must always use the active + // session/user rather than arbitrary values provided by the client. It is never valid to access + // a local relation from a different session/user. + reserved 1, 2; + reserved "userId", "sessionId"; // (Required) A sha-256 hash of the serialized local relation in proto, see LocalRelation. string hash = 3; diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala index 2abbacc5a9b..641dfc5dcd3 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala @@ -970,7 +970,7 @@ class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging { private def transformCachedLocalRelation(rel: proto.CachedLocalRelation): LogicalPlan = { val blockManager = session.sparkContext.env.blockManager -val blockId = CacheId(rel.getUserId, rel.getSessionId, rel.getHash) +val blockId = CacheId(sessionHolder.userId, sessionHolder.sessionId, rel.getHash) val bytes = blockManager.getLocalBytes(blockId) bytes .map { blockData => diff --git