[GitHub] [spark-website] yaooqinn commented on a diff in pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0
yaooqinn commented on code in PR #476: URL: https://github.com/apache/spark-website/pull/476#discussion_r1325386610 ## releases/_posts/2023-09-13-spark-release-3-5-0.md: ## @@ -0,0 +1,336 @@ +--- +layout: post +title: Spark Release 3.5.0 +categories: [] +tags: [] +status: publish +type: post +published: true +meta: +_edit_last: '4' +_wpas_done_all: '1' +--- + +Apache Spark 3.5.0 is the sixth release in the 3.x series. With significant contributions from the open-source community, this release addressed over 1,300 Jira tickets. + +This release introduces more scenarios with general availability for Spark Connect, like Scala and Go client, distributed training and inference support, and enhancement of compatibility for Structured streaming; introduces new PySpark and SQL functionality such as like SQL IDENTIFIER clause, named argument support for SQL function calls, SQL function support for HyperLogLog approximate aggregations, and Python user-defined table functions; simplifies distributed training with DeepSpeed; introduces watermark propagation among operators, introduces dropDuplicatesWithinWatermark operations in Structured Streaming. + +To download Apache Spark 3.5.0, please visit the [downloads](https://spark.apache.org/downloads.html) page. For [detailed changes](https://s.apache.org/spark-3.5.0), you can consult JIRA. We have also curated a list of high-level changes here, grouped by major modules. + + +* This will become a table of contents (this text will be scraped). +{:toc} + + +### Highlights + + + +* Scala and Go client support in Spark Connect [SPARK-42554](https://issues.apache.org/jira/browse/SPARK-42554) [SPARK-43351](https://issues.apache.org/jira/browse/SPARK-43351) +* PyTorch-based distributed ML Support for Spark Connect [SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471) +* Structured Streaming support for Spark Connect in Python and Scala [SPARK-42938](https://issues.apache.org/jira/browse/SPARK-42938) +* Pandas API support for the Python Spark Connect Client [SPARK-42497](https://issues.apache.org/jira/browse/SPARK-42497) +* Introduce Arrow Python UDFs [SPARK-40307](https://issues.apache.org/jira/browse/SPARK-40307) +* Support Python user-defined table functions [SPARK-43798](https://issues.apache.org/jira/browse/SPARK-43798) +* Migrate PySpark errors onto error classes [SPARK-42986](https://issues.apache.org/jira/browse/SPARK-42986) +* PySpark Test Framework [SPARK-44042](https://issues.apache.org/jira/browse/SPARK-44042) +* Add support for Datasketches HllSketch [SPARK-16484](https://issues.apache.org/jira/browse/SPARK-16484) +* Built-in SQL Function Improvement [SPARK-41231](https://issues.apache.org/jira/browse/SPARK-41231) +* IDENTIFIER clause [SPARK-43205](https://issues.apache.org/jira/browse/SPARK-43205) +* Add SQL functions into Scala, Python and R API [SPARK-43907](https://issues.apache.org/jira/browse/SPARK-43907) +* Add named argument support for SQL functions [SPARK-43922](https://issues.apache.org/jira/browse/SPARK-43922) +* Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated [SPARK-41469](https://issues.apache.org/jira/browse/SPARK-41469) +* Distributed ML <> spark connect [SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471) +* DeepSpeed Distributor [SPARK-44264](https://issues.apache.org/jira/browse/SPARK-44264) +* Implement changelog checkpointing for RocksDB state store [SPARK-43421](https://issues.apache.org/jira/browse/SPARK-43421) +* Introduce watermark propagation among operators [SPARK-42376](https://issues.apache.org/jira/browse/SPARK-42376) +* Introduce dropDuplicatesWithinWatermark [SPARK-42931](https://issues.apache.org/jira/browse/SPARK-42931) +* RocksDB state store provider memory management enhancements [SPARK-43311](https://issues.apache.org/jira/browse/SPARK-43311) + + + +### Spark Connect + +* Refactoring of the sql module into sql and sql-api to produce a minimum set of dependencies that can be shared between the Scala Spark Connect client and Spark and avoids pulling all of the Spark transitive dependencies. [SPARK-44273](https://issues.apache.org/jira/browse/SPARK-44273) +* Introducing the Scala client for Spark Connect [SPARK-42554](https://issues.apache.org/jira/browse/SPARK-42554) +* Pandas API support for the Python Spark Connect Client [SPARK-42497](https://issues.apache.org/jira/browse/SPARK-42497) +* PyTorch-based distributed ML Support for Spark Connect [SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471) +* Structured Streaming support for Spark Connect in Python and Scala [SPARK-42938](https://issues.apache.org/jira/browse/SPARK-42938) +* Initial version of the Go client [SPARK-43351](https://issues.apache.org/jira/browse/SPARK-43351) +* Lot’s of compatibility improvements between Spark native and the Spark Connect clients across Python and Scala +* Improved debugability and request handling for
[spark] branch master updated: [SPARK-45056][PYTHON][SS][CONNECT] Termination tests for streamingQueryListener and foreachBatch
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b6190a3db97 [SPARK-45056][PYTHON][SS][CONNECT] Termination tests for streamingQueryListener and foreachBatch b6190a3db97 is described below commit b6190a3db974c19b6a0c4fe7af75531d67755074 Author: Wei Liu AuthorDate: Thu Sep 14 11:23:44 2023 +0900 [SPARK-45056][PYTHON][SS][CONNECT] Termination tests for streamingQueryListener and foreachBatch ### What changes were proposed in this pull request? Add termination tests for StreamingQueryListener and foreachBatch. The behavior is mimicked by creating the same query on server side that would have been created if running the same python query is ran on client side. For example, in foreachBatch, a python foreachBatch function is serialized using cloudPickleSerializer and passed to the server side, here we start another python process on the server and call the same cloudPickleSerializer and pass the bytes to the server, and construct `SimplePythonFunction` accordingly. Refactored the code a bit for testing purpose. ### Why are the changes needed? Necessary tests ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test only addition ### Was this patch authored or co-authored using generative AI tooling? No Closes #42779 from WweiL/SPARK-44435-followup-termination-tests. Lead-authored-by: Wei Liu Co-authored-by: Wei Liu Signed-off-by: Hyukjin Kwon --- .github/workflows/build_and_test.yml | 6 +- .../sql/connect/planner/SparkConnectPlanner.scala | 3 +- .../planner/StreamingQueryListenerHelper.scala | 8 +- .../service/SparkConnectSessionHodlerSuite.scala | 205 + .../spark/api/python/PythonWorkerFactory.scala | 5 + .../spark/api/python/StreamingPythonRunner.scala | 26 ++- .../connect/streaming/test_parity_foreach_batch.py | 1 - .../streaming/test_streaming_foreach_batch.py | 2 - .../apache/spark/sql/IntegratedUDFTestUtils.scala | 4 +- 9 files changed, 242 insertions(+), 18 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 21809564497..25c95bd607d 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -256,14 +256,14 @@ jobs: # We should install one Python that is higher then 3+ for SQL and Yarn because: # - SQL component also has Python related tests, for example, IntegratedUDFTestUtils. # - Yarn has a Python specific test too, for example, YarnClusterSuite. - if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) + if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) || contains(matrix.modules, 'connect') with: python-version: 3.8 architecture: x64 - name: Install Python packages (Python 3.8) - if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) + if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) || contains(matrix.modules, 'connect') run: | -python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas scipy unittest-xml-reporting 'grpcio==1.56.0' 'protobuf==3.20.3' +python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas scipy unittest-xml-reporting 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 'protobuf==3.20.3' python3.8 -m pip list # Run the tests. - name: Run tests diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala index b8ab5539b30..24dee006f0b 100644 --- a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala +++ b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala @@ -3131,8 +3131,7 @@ class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging { val listener = if (command.getAddListener.hasPythonListenerPayload) { new PythonStreamingQueryListener( transformPythonFunction(command.getAddListener.getPythonListenerPayload), -sessionHolder, -pythonExec) +sessionHolder) } else { val listenerPacket = Utils .deserialize[StreamingListenerPacket]( diff --git a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingQueryListenerHelper.scala
[spark] branch master updated: [SPARK-36191][SQL] Handle limit and order by in correlated scalar (lateral) subqueries
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 21c27d55b34 [SPARK-36191][SQL] Handle limit and order by in correlated scalar (lateral) subqueries 21c27d55b34 is described below commit 21c27d55b342eb32bdf2d56a81dd9adb23f5526e Author: Andrey Gubichev AuthorDate: Thu Sep 14 09:09:05 2023 +0800 [SPARK-36191][SQL] Handle limit and order by in correlated scalar (lateral) subqueries ### What changes were proposed in this pull request? Handle LIMIT/ORDER BY in the correlated scalar (lateral) subqueries by rewriting them using ROW_NUMBER() window function. ### Why are the changes needed? Extends our coverage of subqueries ### Does this PR introduce _any_ user-facing change? Users are able to run more subqueries now ### How was this patch tested? Unit tests and query tests. Results of query tests are verified against PostgreSQL. ### Was this patch authored or co-authored using generative AI tooling? No Closes #42705 from agubichev/SPARK-36191-limit. Authored-by: Andrey Gubichev Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/CheckAnalysis.scala | 5 + .../catalyst/optimizer/DecorrelateInnerQuery.scala | 33 +++ .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 8 - .../optimizer/DecorrelateInnerQuerySuite.scala | 111 +++ .../analyzer-results/join-lateral.sql.out | 155 ++ .../analyzer-results/postgreSQL/join.sql.out | 318 + .../scalar-subquery-predicate.sql.out | 123 .../resources/sql-tests/inputs/join-lateral.sql| 27 ++ .../resources/sql-tests/inputs/postgreSQL/join.sql | 82 +++--- .../scalar-subquery/scalar-subquery-predicate.sql | 32 +++ .../sql-tests/results/join-lateral.sql.out | 81 ++ .../sql-tests/results/postgreSQL/join.sql.out | 82 ++ .../scalar-subquery-predicate.sql.out | 66 + 13 files changed, 1074 insertions(+), 49 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 038cd7d944a..4cfeb60eb04 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -1418,6 +1418,11 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB failOnInvalidOuterReference(g) checkPlan(g.child, aggregated, canContainOuter) +// Correlated subquery can have a LIMIT clause +case l @ Limit(_, input) => + failOnInvalidOuterReference(l) + checkPlan(input, aggregated, canContainOuter) + // Category 4: Any other operators not in the above 3 categories // cannot be on a correlation path, that is they are allowed only // under a correlation point but they and their descendant operators diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala index a07177f6e8a..1cb7c8f3ac6 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala @@ -655,6 +655,39 @@ object DecorrelateInnerQuery extends PredicateHelper { val newProject = Project(newProjectList ++ referencesToAdd, newChild) (newProject, joinCond, outerReferenceMap) + case Limit(limit, input) => +// LIMIT K (with potential ORDER BY) is decorrelated by computing K rows per every +// domain value via a row_number() window function. For example, for a subquery +// (SELECT T2.a FROM T2 WHERE T2.b = OuterReference(x) ORDER BY T2.c LIMIT 3) +// -- we need to get top 3 values of T2.a (ordering by T2.c) for every value of x. +// Following our general decorrelation procedure, 'x' is then replaced by T2.b, so the +// subquery is decorrelated as: +// SELECT * FROM ( +// SELECT T2.a, row_number() OVER (PARTITION BY T2.b ORDER BY T2.c) AS rn FROM T2) +// WHERE rn <= 3 +val (child, ordering) = input match { + case Sort(order, _, child) => (child, order) + case _ => (input, Seq()) +} +val (newChild, joinCond, outerReferenceMap) = + decorrelate(child, parentOuterReferences, aggregated = true, underSetOp)
[spark] branch master updated: [SPARK-45135][PYTHON][TESTS] Make `utils.eventually` a parameterized decorator
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9798244ca64 [SPARK-45135][PYTHON][TESTS] Make `utils.eventually` a parameterized decorator 9798244ca64 is described below commit 9798244ca647ec68d36f4b9b21356a6de5f73f77 Author: Ruifeng Zheng AuthorDate: Thu Sep 14 08:44:53 2023 +0800 [SPARK-45135][PYTHON][TESTS] Make `utils.eventually` a parameterized decorator ### What changes were proposed in this pull request? - Make utils.eventually a parameterized decorator - Retry `test_read_images` if it fails ### Why are the changes needed? previously, we used `utils.eventually` to retry flaky tests, e.g. https://github.com/apache/spark/commit/745ed93fe451b3f9e8148b06356c28b889a4db5a however, it needs to modify the test body, sometime the change maybe large To minimize the changes, I'd like to ~~add a decorator for test retry~~ Make utils.eventually a parameterized decorator. ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? no Closes #42891 from zhengruifeng/test_retry_test_read_images. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- python/pyspark/ml/tests/test_image.py | 3 +- python/pyspark/ml/tests/test_wrapper.py| 4 +- python/pyspark/mllib/tests/test_algorithms.py | 37 ++-- .../mllib/tests/test_streaming_algorithms.py | 24 .../pandas/test_pandas_grouped_map_with_state.py | 2 +- python/pyspark/testing/utils.py| 70 ++ python/pyspark/tests/test_taskcontext.py | 3 +- python/pyspark/tests/test_util.py | 23 ++- python/pyspark/tests/test_worker.py| 15 ++--- 9 files changed, 112 insertions(+), 69 deletions(-) diff --git a/python/pyspark/ml/tests/test_image.py b/python/pyspark/ml/tests/test_image.py index 86fa46c3248..ee254a41007 100644 --- a/python/pyspark/ml/tests/test_image.py +++ b/python/pyspark/ml/tests/test_image.py @@ -19,10 +19,11 @@ import unittest from pyspark.ml.image import ImageSchema from pyspark.testing.mlutils import SparkSessionTestCase from pyspark.sql import Row -from pyspark.testing.utils import QuietTest +from pyspark.testing.utils import QuietTest, eventually class ImageFileFormatTest(SparkSessionTestCase): +@eventually(timeout=60.0, catch_assertions=True) def test_read_images(self): data_path = "data/mllib/images/origin/kittens" df = ( diff --git a/python/pyspark/ml/tests/test_wrapper.py b/python/pyspark/ml/tests/test_wrapper.py index 33d93c02acd..3efdbabd998 100644 --- a/python/pyspark/ml/tests/test_wrapper.py +++ b/python/pyspark/ml/tests/test_wrapper.py @@ -63,7 +63,7 @@ class JavaWrapperMemoryTests(SparkSessionTestCase): self.assertIn("LinearRegressionTrainingSummary", summary._java_obj.toString()) return True -eventually(condition, timeout=10, catch_assertions=True) +eventually(timeout=10, catch_assertions=True)(condition)() try: summary.__del__() @@ -77,7 +77,7 @@ class JavaWrapperMemoryTests(SparkSessionTestCase): summary._java_obj.toString() return True -eventually(condition, timeout=10, catch_assertions=True) +eventually(timeout=10, catch_assertions=True)(condition)() class WrapperTests(MLlibTestCase): diff --git a/python/pyspark/mllib/tests/test_algorithms.py b/python/pyspark/mllib/tests/test_algorithms.py index dc48c2c021d..bcedd65b05b 100644 --- a/python/pyspark/mllib/tests/test_algorithms.py +++ b/python/pyspark/mllib/tests/test_algorithms.py @@ -97,27 +97,28 @@ class ListTests(MLlibTestCase): # TODO: Allow small numeric difference. self.assertTrue(array_equal(c1, c2)) +@eventually(timeout=60, catch_assertions=True) def test_gmm(self): from pyspark.mllib.clustering import GaussianMixture -def condition(): -data = self.sc.parallelize( -[ -[1, 2], -[8, 9], -[-4, -3], -[-6, -7], -] -) -clusters = GaussianMixture.train( -data, 2, convergenceTol=0.001, maxIterations=10, seed=1 -) -labels = clusters.predict(data).collect() -self.assertEqual(labels[0], labels[1]) -self.assertEqual(labels[2], labels[3]) -return True - -eventually(condition, timeout=60, catch_assertions=True) +data = self.sc.parallelize( +[ +
[spark] branch master updated (c12eb2c93d7 -> d26e77aabd2)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from c12eb2c93d7 [SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 4.7.1 add d26e77aabd2 Revert "[SPARK-45120][UI] Upgrade d3 from v3 to v7(v7.8.5) and apply api changes in UI" No new revisions were added by this update. Summary of changes: LICENSE| 6 +-- LICENSE-binary | 6 +-- .../resources/org/apache/spark/ui/static/d3.min.js | 7 +++- .../org/apache/spark/ui/static/spark-dag-viz.js| 6 +-- .../org/apache/spark/ui/static/streaming-page.js | 48 ++ .../spark/ui/static/structured-streaming-page.js | 31 -- licenses-binary/LICENSE-d3.min.js.txt | 39 -- licenses/LICENSE-d3.min.js.txt | 39 -- .../spark/sql/execution/ui/static/spark-sql-viz.js | 4 +- 9 files changed, 105 insertions(+), 81 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] Yikun commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0
Yikun commented on PR #476: URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718441629 https://github.com/apache/spark-docker/pull/55 I already submited the draft patch, need your help to upload the key. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0
xuanyuanking commented on PR #476: URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718304305 cc @Yikun, could you also help me with DockerHub image parts? I don't have the access -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0
xuanyuanking commented on PR #476: URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718300223 cc @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] xuanyuanking opened a new pull request, #476: Add News, Release Note, Download Link for Apache Spark 3.5.0
xuanyuanking opened a new pull request, #476: URL: https://github.com/apache/spark-website/pull/476 - Vote Result: https://lists.apache.org/thread/kqqlsy90ormkccddtx6wgqmy8r1df8tx - Documentation: https://spark.apache.org/docs/3.5.0/ - Binary: https://downloads.apache.org/spark/spark-3.5.0/ - PySpark: pending on the File Limit Request https://github.com/pypi/support/issues/3175 - Maven Central: https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.13/3.5.0/ https://github.com/apache/spark-website/assets/4833765/a74b94d0-f3ba-4388-9941-3afa43728ecb;> https://github.com/apache/spark-website/assets/4833765/ee599a66-559a-48fc-a7df-4f0b785fb544;> https://github.com/apache/spark-website/assets/4833765/a22d8f02-31f7-4173-9624-90a35dd209a3;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch 3.5.0-release deleted (was dc8b61a479)
This is an automated email from the ASF dual-hosted git repository. liyuanjian pushed a change to branch 3.5.0-release in repository https://gitbox.apache.org/repos/asf/spark-website.git was dc8b61a479 Add docs for Apache Spark 3.5.0 The revisions that were on this branch are still contained in other references; therefore, this change does not discard any commits from the repository. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-website] branch 3.5.0-release created (now dc8b61a479)
This is an automated email from the ASF dual-hosted git repository. liyuanjian pushed a change to branch 3.5.0-release in repository https://gitbox.apache.org/repos/asf/spark-website.git at dc8b61a479 Add docs for Apache Spark 3.5.0 No new revisions were added by this update. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 4.7.1
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c12eb2c93d7 [SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 4.7.1 c12eb2c93d7 is described below commit c12eb2c93d7678795726a7f2ccbb30a66a93fb22 Author: yangjie01 AuthorDate: Wed Sep 13 08:43:52 2023 -0700 [SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 4.7.1 ### What changes were proposed in this pull request? This pr downgrade `scala-maven-plugin` to version 4.7.1 to avoid it automatically adding the `-release` option as a Scala compilation argument. ### Why are the changes needed? The `scala-maven-plugin` versions 4.7.2 and later will try to automatically append the `-release` option as a Scala compilation argument when it is not specified by the user: 1. 4.7.2 and 4.8.0: try to add the `-release` option for Scala versions 2.13.9 and higher. 2. 4.8.1: try to append the `-release` option for Scala versions 2.12.x/2.13.x/3.1.1, and append `-java-output-version` for Scala 3.1.2. The addition of the `-release` option has caused issues mentioned in SPARK-44376 | https://github.com/apache/spark/pull/41943 and https://github.com/apache/spark/pull/40442#issuecomment-1474161688. This is because the `-release` option has stronger compilation restrictions than `-target`, ensuring not only bytecode format, but also that the API used in the code is compatible with the specified version of Java. However, many APIs in the `sun.*` package are not `exports` in Java 11, 17, [...] For discussions within the Scala community, see https://github.com/scala/bug/issues/12643, https://github.com/scala/bug/issues/12824, https://github.com/scala/bug/issues/12866, but this is not a bug. I have also submitted an issue to the `scala-maven-plugin` community to discuss the possibility of adding additional settings to control the addition of the `-release` option: https://github.com/davidB/scala-maven-plugin/issues/722. For Apache Spark 4.0, in the short term, I suggest downgrading `scala-maven-plugin` to version 4.7.1 to avoid it automatic adding the `-release` option as a Scala compilation argument. In the long term, we should reduce use of APIs that are not `exports` for compatibility with the `-release` compilation option due to `-target` already deprecated after Scala 2.13.9. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual check run `git revert 656bf36363c466b60d0045234ccaaa654ed8` to revert to using Scala 2.13.11 and run `dev/change-scala-version.sh 2.13` to change Scala to 2.13 1. Run `build/mvn clean install -DskipTests -Pscala-2.13 -X` to check the Scala compilation arguments. Before ``` [[DEBUG] [zinc] Running cached compiler 1992eaf4 for Scala compiler version 2.13.11 [DEBUG] [zinc] The Scala compiler is invoked with: -unchecked -deprecation -feature -explaintypes -target:jvm-1.8 -Wconf:cat=deprecation:wv,any:e -Wunused:imports -Wconf:cat=scaladoc:wv -Wconf:cat=lint-multiarg-infix:wv -Wconf:cat=other-nullary-override:wv -Wconf:cat=other-match-analysis=org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction.catalogFunction:wv -Wconf:cat=other-pure-statement=org.apache.spark.streaming.util.FileBasedWriteAheadLog.readAll.readFile:wv -Wconf:cat=other-pure-statement=org.apache.spark.scheduler.OutputCommitCoordinatorSuite..futureAction:wv -Wconf:msg=^(?=.*?method|value|type|object|trait|inheritance)(?=.*?deprecated)(?=.*?since 2.13).+$:s -Wconf:msg=^(?=.*?Widening conversion from)(?=.*?is deprecated because it loses precision).+$:s -Wconf:msg=Auto-application to \`\(\)\` is deprecated:s -Wconf:msg=method with a single empty parameter list overrides method without any parameter list:s -Wconf:msg=method without a parameter list overrides a method with a single empty one:s -Wconf:cat=deprecation=procedure syntax is deprecated:e -Wconf:cat=unchecked=outer reference:s -Wconf:cat=unchecked=eliminated by erasure:s -Wconf:msg=^(?=.*?a value of type)(?=.*?cannot also be).+$:s -Wconf:cat=unused-imports=org\/apache\/spark\/graphx\/impl\/VertexPartitionBase.scala:s -Wconf:cat=unused-imports=org\/apache\/spark\/graphx\/impl\/VertexPartitionBaseOps.scala:s -Wconf:msg=Implicit definition should have explicit type:s -release 8 -bootclasspath ... ``` After ``` [DEBUG] [zinc] Running cached compiler 72dd4888 for Scala compiler version 2.13.11 [DEBUG] [zinc] The Scala compiler is invoked with: -unchecked
[spark] branch master updated: [SPARK-45157][SQL] Avoid repeated `if` checks in `[On|Off|HeapColumnVector`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b568ba43f0d [SPARK-45157][SQL] Avoid repeated `if` checks in `[On|Off|HeapColumnVector` b568ba43f0d is described below commit b568ba43f0dd80130bca1bf86c48d0d359e57883 Author: Wenchen Fan AuthorDate: Wed Sep 13 08:36:05 2023 -0700 [SPARK-45157][SQL] Avoid repeated `if` checks in `[On|Off|HeapColumnVector` ### What changes were proposed in this pull request? This is a small followup of https://github.com/apache/spark/pull/42850. `getBytes` checks if the `dictionary` is null or not, then call `getByte` which also checks if the `dictionary` is null or not. This PR avoids the repeated if checks by copying one line code from `getByte` to `getBytes`. The same applies to other `getXXX` methods. ### Why are the changes needed? Make the perf-critical path more efficient. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #42903 from cloud-fan/vector. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun --- .../spark/sql/execution/vectorized/OffHeapColumnVector.java | 12 ++-- .../spark/sql/execution/vectorized/OnHeapColumnVector.java | 12 ++-- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java index 9cb1b1f0b5e..2bb0b02d4c9 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java @@ -218,7 +218,7 @@ public final class OffHeapColumnVector extends WritableColumnVector { Platform.copyMemory(null, data + rowId, array, Platform.BYTE_ARRAY_OFFSET, count); } else { for (int i = 0; i < count; i++) { -array[i] = getByte(rowId + i); +array[i] = (byte) dictionary.decodeToInt(dictionaryIds.getDictId(rowId + i)); } } return array; @@ -279,7 +279,7 @@ public final class OffHeapColumnVector extends WritableColumnVector { Platform.copyMemory(null, data + rowId * 2L, array, Platform.SHORT_ARRAY_OFFSET, count * 2L); } else { for (int i = 0; i < count; i++) { -array[i] = getShort(rowId + i); +array[i] = (short) dictionary.decodeToInt(dictionaryIds.getDictId(rowId + i)); } } return array; @@ -345,7 +345,7 @@ public final class OffHeapColumnVector extends WritableColumnVector { Platform.copyMemory(null, data + rowId * 4L, array, Platform.INT_ARRAY_OFFSET, count * 4L); } else { for (int i = 0; i < count; i++) { -array[i] = getInt(rowId + i); +array[i] = dictionary.decodeToInt(dictionaryIds.getDictId(rowId + i)); } } return array; @@ -423,7 +423,7 @@ public final class OffHeapColumnVector extends WritableColumnVector { Platform.copyMemory(null, data + rowId * 8L, array, Platform.LONG_ARRAY_OFFSET, count * 8L); } else { for (int i = 0; i < count; i++) { -array[i] = getLong(rowId + i); +array[i] = dictionary.decodeToLong(dictionaryIds.getDictId(rowId + i)); } } return array; @@ -487,7 +487,7 @@ public final class OffHeapColumnVector extends WritableColumnVector { Platform.copyMemory(null, data + rowId * 4L, array, Platform.FLOAT_ARRAY_OFFSET, count * 4L); } else { for (int i = 0; i < count; i++) { -array[i] = getFloat(rowId + i); +array[i] = dictionary.decodeToFloat(dictionaryIds.getDictId(rowId + i)); } } return array; @@ -553,7 +553,7 @@ public final class OffHeapColumnVector extends WritableColumnVector { count * 8L); } else { for (int i = 0; i < count; i++) { -array[i] = getDouble(rowId + i); +array[i] = dictionary.decodeToDouble(dictionaryIds.getDictId(rowId + i)); } } return array; diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java index be590bb9ac7..2bf2b8d08fc 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java @@ -216,7 +216,7 @@ public final class OnHeapColumnVector extends WritableColumnVector { System.arraycopy(byteData, rowId, array, 0, count); }
[spark] branch branch-3.3 updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.3 by this push: new 9ee184ad5cf [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' 9ee184ad5cf is described below commit 9ee184ad5cf1ea808143cffd6fa982ca8ef503fe Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> AuthorDate: Wed Sep 13 08:48:14 2023 -0500 [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' **What changes were proposed in this pull request?** The PR updates the default value of 'spark.submit.deployMode' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.submit.deployMode' is 'client', but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #42902 from chenyu-opensource/branch-SPARK-45146. Authored-by: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Sean Owen (cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e) Signed-off-by: Sean Owen --- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index cb1f5212439..9e243635baf 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -394,7 +394,7 @@ of the most common options to set are: spark.submit.deployMode - (none) + client The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.4 updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 7544bdb12d1 [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' 7544bdb12d1 is described below commit 7544bdb12d1d0449aaa7e7a5f8124a5cf662712f Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> AuthorDate: Wed Sep 13 08:48:14 2023 -0500 [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' **What changes were proposed in this pull request?** The PR updates the default value of 'spark.submit.deployMode' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.submit.deployMode' is 'client', but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #42902 from chenyu-opensource/branch-SPARK-45146. Authored-by: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Sean Owen (cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e) Signed-off-by: Sean Owen --- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index f099cea7eb9..d61f726130b 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -394,7 +394,7 @@ of the most common options to set are: spark.submit.deployMode - (none) + client The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 076cb7aabac [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' 076cb7aabac is described below commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> AuthorDate: Wed Sep 13 08:48:14 2023 -0500 [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' **What changes were proposed in this pull request?** The PR updates the default value of 'spark.submit.deployMode' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.submit.deployMode' is 'client', but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #42902 from chenyu-opensource/branch-SPARK-45146. Authored-by: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Sean Owen --- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index 6f7e12555e8..3ca9b704eba 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -394,7 +394,7 @@ of the most common options to set are: spark.submit.deployMode - (none) + client The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e72ae794e69 [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' e72ae794e69 is described below commit e72ae794e69d8182291655d023aee903a913571b Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> AuthorDate: Wed Sep 13 08:48:14 2023 -0500 [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode' **What changes were proposed in this pull request?** The PR updates the default value of 'spark.submit.deployMode' in configuration.html on the website **Why are the changes needed?** The default value of 'spark.submit.deployMode' is 'client', but the website is wrong. **Does this PR introduce any user-facing change?** No **How was this patch tested?** It doesn't need to. **Was this patch authored or co-authored using generative AI tooling?** No Closes #42902 from chenyu-opensource/branch-SPARK-45146. Authored-by: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com> Signed-off-by: Sean Owen (cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e) Signed-off-by: Sean Owen --- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index dfded480c99..1139beb6646 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -394,7 +394,7 @@ of the most common options to set are: spark.submit.deployMode - (none) + client The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 151f88b53e6 [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image 151f88b53e6 is described below commit 151f88b53e67944d6ca5c635466f50958019c8b4 Author: Hyukjin Kwon AuthorDate: Wed Sep 13 21:20:19 2023 +0900 [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image This PR proposes to pin the dependencies related to Spark Connect in its base image according to the range we support. See also https://github.com/apache/spark/blob/master/python/docs/source/getting_started/install.rst#dependencies To properly test the dependency versions we support. No, dev-only. In this PR, it will be tested. No. Closes #42898 from HyukjinKwon/SPARK-45142. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon (cherry picked from commit 61435b42fdc4071f35aba6af9248ff9ad8fc8514) Signed-off-by: Hyukjin Kwon --- dev/infra/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index af8e1a980f9..d3bae836cc6 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -68,7 +68,7 @@ RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' # Add Python deps for Spark Connect. -RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos grpcio-status +RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 'protobuf==3.20.3' 'googleapis-common-protos==1.56.4' # Add torch as a testing dependency for TorchDistributor RUN python3.9 -m pip install torch torchvision torcheval - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 61435b42fdc [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image 61435b42fdc is described below commit 61435b42fdc4071f35aba6af9248ff9ad8fc8514 Author: Hyukjin Kwon AuthorDate: Wed Sep 13 21:20:19 2023 +0900 [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image ### What changes were proposed in this pull request? This PR proposes to pin the dependencies related to Spark Connect in its base image according to the range we support. See also https://github.com/apache/spark/blob/master/python/docs/source/getting_started/install.rst#dependencies ### Why are the changes needed? To properly test the dependency versions we support. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? In this PR, it will be tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42898 from HyukjinKwon/SPARK-45142. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- dev/infra/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index feee7415004..99423ce072c 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -88,7 +88,7 @@ RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' # Add Python deps for Spark Connect. -RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos grpcio-status +RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 'protobuf==3.20.3' 'googleapis-common-protos==1.56.4' # Add torch as a testing dependency for TorchDistributor RUN python3.9 -m pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45139][SQL] Add DatabricksDialect to handle SQL type conversion
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2710dbe6abf [SPARK-45139][SQL] Add DatabricksDialect to handle SQL type conversion 2710dbe6abf is described below commit 2710dbe6abf0b143a18268a99635d0e6033bea78 Author: Ivan Sadikov AuthorDate: Wed Sep 13 02:28:56 2023 -0700 [SPARK-45139][SQL] Add DatabricksDialect to handle SQL type conversion ### What changes were proposed in this pull request? This PR adds `DatabricksDialect` to Spark to allow users to query Databricks clusters and Databricks SQL warehouses with more precise SQL type conversion and quote identifiers instead of doing it manually in the code. ### Why are the changes needed? The PR fixes type conversion and makes it easier to query Databricks clusters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests in JDBCSuite to check conversion. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42896 from sadikovi/add_databricks_dialect. Authored-by: Ivan Sadikov Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/jdbc/DatabricksDialect.scala | 93 ++ .../org/apache/spark/sql/jdbc/JdbcDialects.scala | 1 + .../org/apache/spark/sql/jdbc/JDBCSuite.scala | 26 ++ 3 files changed, 120 insertions(+) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/DatabricksDialect.scala b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/DatabricksDialect.scala new file mode 100644 index 000..1b715283dd4 --- /dev/null +++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/DatabricksDialect.scala @@ -0,0 +1,93 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.jdbc + +import java.sql.Connection + +import scala.collection.mutable.ArrayBuilder + +import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions +import org.apache.spark.sql.execution.datasources.v2.TableSampleInfo +import org.apache.spark.sql.types._ + +private case object DatabricksDialect extends JdbcDialect { + + override def canHandle(url: String): Boolean = { +url.startsWith("jdbc:databricks") + } + + override def getCatalystType( + sqlType: Int, + typeName: String, + size: Int, + md: MetadataBuilder): Option[DataType] = { +sqlType match { + case java.sql.Types.TINYINT => Some(ByteType) + case java.sql.Types.SMALLINT => Some(ShortType) + case java.sql.Types.REAL => Some(FloatType) + case _ => None +} + } + + override def getJDBCType(dt: DataType): Option[JdbcType] = dt match { +case BooleanType => Some(JdbcType("BOOLEAN", java.sql.Types.BOOLEAN)) +case DoubleType => Some(JdbcType("DOUBLE", java.sql.Types.DOUBLE)) +case StringType => Some(JdbcType("STRING", java.sql.Types.VARCHAR)) +case BinaryType => Some(JdbcType("BINARY", java.sql.Types.BINARY)) +case _ => None + } + + override def quoteIdentifier(colName: String): String = { +s"`$colName`" + } + + override def supportsLimit: Boolean = true + + override def supportsOffset: Boolean = true + + override def supportsTableSample: Boolean = true + + override def getTableSample(sample: TableSampleInfo): String = { +s"TABLESAMPLE (${(sample.upperBound - sample.lowerBound) * 100}) REPEATABLE (${sample.seed})" + } + + // Override listSchemas to run "show schemas" as a PreparedStatement instead of + // invoking getMetaData.getSchemas as it may not work correctly in older versions of the driver. + override def schemasExists(conn: Connection, options: JDBCOptions, schema: String): Boolean = { +val stmt = conn.prepareStatement("SHOW SCHEMAS") +val rs = stmt.executeQuery() +while (rs.next()) { + if (rs.getString(1) == schema) { +return true + } +} +false + } + + // Override listSchemas to run "show schemas" as a PreparedStatement instead of + // invoking
[spark] branch master updated: [SPARK-45147][CORE] Remove `System.setSecurityManager` usage
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4f74fc58148 [SPARK-45147][CORE] Remove `System.setSecurityManager` usage 4f74fc58148 is described below commit 4f74fc581486a1a750b3bb27abc12e1a87215ea6 Author: Dongjoon Hyun AuthorDate: Wed Sep 13 02:03:28 2023 -0700 [SPARK-45147][CORE] Remove `System.setSecurityManager` usage ### What changes were proposed in this pull request? This PR aims to remove the deprecate Java `System.setSecurityManager` usage for Apache Spark 4.0. Note that this is the only usage in Apache Spark AS-IS code. ``` $ git grep setSecurityManager core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala: System.setSecurityManager(sm) core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala: System.setSecurityManager(currentSm) ``` This usage was added at `Apache Spark 1.5.0`. - #5841 Since `Apache Spark 2.4.0`, we don't need `setSecurityManager` due to the following improvement. - #20925 ### Why are the changes needed? ``` $ java -version openjdk version "21-ea" 2023-09-19 OpenJDK Runtime Environment (build 21-ea+32-2482) OpenJDK 64-Bit Server VM (build 21-ea+32-2482, mixed mode, sharing) ``` ``` max spark-3.5.0-bin-hadoop3:$ bin/spark-sql --help ... CLI options: Exception in thread "main" java.lang.UnsupportedOperationException: The Security Manager is deprecated and will be removed in a future release at java.base/java.lang.System.setSecurityManager(System.java:429) at org.apache.spark.deploy.SparkSubmitArguments.getSqlShellOptions(SparkSubmitArguments.scala:623) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. ``` $ build/sbt test:package -Phive -Phive-thriftserver $ bin/spark-sql --help ... CLI options: -d,--define Variable substitution to apply to Hive commands. e.g. -d A=B or --define A=B --database Specify the database to use -e SQL from command line -f SQL from files -H,--helpPrint help information --hiveconfUse value for given property --hivevar Variable substitution to apply to Hive commands. e.g. --hivevar A=B -i Initialization SQL file -S,--silent Silent mode in interactive shell -v,--verbose Verbose mode (echo executed SQL to the console) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42901 from dongjoon-hyun/SPARK-45147. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/deploy/SparkSubmitArguments.scala | 26 +- 1 file changed, 1 insertion(+), 25 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala index a3fe5153bee..867fc05cb8a 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala @@ -18,7 +18,6 @@ package org.apache.spark.deploy import java.io.{ByteArrayOutputStream, File, PrintStream} -import java.lang.reflect.InvocationTargetException import java.nio.charset.StandardCharsets import java.util.{List => JList} @@ -599,39 +598,17 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S /** * Run the Spark SQL CLI main class with the "--help" option and catch its output. Then filter * the results to remove unwanted lines. - * - * Since the CLI will call `System.exit()`, we install a security manager to prevent that call - * from working, and restore the original one afterwards. */ private def getSqlShellOptions(): String = { val currentOut = System.out val currentErr = System.err -val currentSm = System.getSecurityManager() try { val out = new ByteArrayOutputStream() val stream = new PrintStream(out) System.setOut(stream) System.setErr(stream) - val sm = new SecurityManager() { -override def checkExit(status: Int): Unit = { - throw new SecurityException() -} - -override def checkPermission(perm: java.security.Permission): Unit = {} - } - System.setSecurityManager(sm) - - try { -
[spark] branch master updated: [SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` in CI
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e3d2dfa8b51 [SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` in CI e3d2dfa8b51 is described below commit e3d2dfa8b514f9358823c3cb1ad6523da8a6646b Author: Ruifeng Zheng AuthorDate: Wed Sep 13 15:51:27 2023 +0800 [SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` in CI ### What changes were proposed in this pull request? Pin `pyarrow==12.0.1` in CI ### Why are the changes needed? to fix test failure, https://github.com/apache/spark/actions/runs/6167186123/job/16738683632 ``` == FAIL [0.095s]: test_from_to_pandas (pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, in _assert_pandas_equal assert_series_equal( File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 931, in assert_series_equal assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}") File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 415, in assert_attr_equal raise_assert_detail(obj, msg, left_attr, right_attr) File "/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 599, in raise_assert_detail raise AssertionError(msg) AssertionError: Attributes of Series are different Attribute "dtype" are different [left]: datetime64[ns] [right]: datetime64[us] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI and manually test ### Was this patch authored or co-authored using generative AI tooling? No Closes #42897 from zhengruifeng/pin_pyarrow. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .github/workflows/build_and_test.yml | 4 ++-- dev/infra/Dockerfile | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index f0bd65bcf41..21809564497 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -263,7 +263,7 @@ jobs: - name: Install Python packages (Python 3.8) if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 'sql-')) run: | -python3.8 -m pip install 'numpy>=1.20.0' pyarrow pandas scipy unittest-xml-reporting 'grpcio==1.56.0' 'protobuf==3.20.3' +python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas scipy unittest-xml-reporting 'grpcio==1.56.0' 'protobuf==3.20.3' python3.8 -m pip list # Run the tests. - name: Run tests @@ -728,7 +728,7 @@ jobs: # See also https://issues.apache.org/jira/browse/SPARK-38279. python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0' python3.9 -m pip install ipython_genutils # See SPARK-38517 -python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8' +python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 'plotly>=4.8' python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421 apt-get update -y apt-get install -y ruby ruby-dev diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index 60204dcc49e..feee7415004 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -85,7 +85,7 @@ RUN Rscript -e "devtools::install_version('roxygen2', version='7.2.0', repos='ht ENV R_LIBS_SITE "/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library" RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib -RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' +RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*' # Add Python deps for Spark Connect. RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos grpcio-status - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45145][EXAMPLE] Add JavaSparkSQLCli example
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 66f42b5b292 [SPARK-45145][EXAMPLE] Add JavaSparkSQLCli example 66f42b5b292 is described below commit 66f42b5b292cb4afff14b42f0d31835067ebdf5e Author: Dongjoon Hyun AuthorDate: Wed Sep 13 00:20:18 2023 -0700 [SPARK-45145][EXAMPLE] Add JavaSparkSQLCli example ### What changes were proposed in this pull request? This PR aims to add a simple Java example, `JavaSparkSQLCli.java`. Like SparkPi example, we can take advantage of this minimal example with `bin/run-example` and we can run a simple SQL query as a canary test during RC testing and voting phase. - https://spark.apache.org/docs/3.5.0/ ``` ./bin/run-example SparkPi 10 ``` After this PR, ``` $ bin/run-example sql.JavaSparkSQLCli "SELECT 'Spark SQL' col" +-+ |col | +-+ |Spark SQL| +-+ ``` ### Why are the changes needed? Although there exists `org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver` and `bin/spark-sql` shell environment, it requires `-Phive -Phive-thriftserver` explicitly. ``` $ bin/spark-shell --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.5.0 /_/ ... $ bin/run-example org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver ... Error: Failed to load class org.apache.spark.examples.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver. Failed to load main class org.apache.spark.examples.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver. You need to build Spark with -Phive and -Phive-thriftserver. 23/09/12 23:06:38 INFO ShutdownHookManager: Shutdown hook called 23/09/12 23:06:38 INFO ShutdownHookManager: Deleting directory /private/var/folders/d4/dr6zxyvd4cl38877bj3fxs_mgn/T/spark-2a9ee4a3-64c3-492c-a268-a71b9aa44a2a ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually. ``` $ build/sbt test:package $ bin/run-example sql.JavaSparkSQLCli "SELECT 'Spark SQL' col" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42900 from dongjoon-hyun/SPARK-45145. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- .../apache/spark/examples/sql/JavaSparkSQLCli.java | 41 ++ 1 file changed, 41 insertions(+) diff --git a/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLCli.java b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLCli.java new file mode 100644 index 000..d5b033c46bf --- /dev/null +++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLCli.java @@ -0,0 +1,41 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.examples.sql; + +import org.apache.spark.sql.SparkSession; + +/** + * Example Usage: + * + * bin/run-example sql.JavaSparkSQLCli "SELECT 'Spark SQL' col" + * + */ +public class JavaSparkSQLCli { + + public static void main(String[] args) { +SparkSession spark = SparkSession + .builder() + .appName("Java Spark SQL Cli") + .getOrCreate(); + +for (String a: args) { + spark.sql(a).show(false); +} + +spark.stop(); + } +} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org