[GitHub] [spark-website] yaooqinn commented on a diff in pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-13 Thread via GitHub


yaooqinn commented on code in PR #476:
URL: https://github.com/apache/spark-website/pull/476#discussion_r1325386610


##
releases/_posts/2023-09-13-spark-release-3-5-0.md:
##
@@ -0,0 +1,336 @@
+---
+layout: post
+title: Spark Release 3.5.0
+categories: []
+tags: []
+status: publish
+type: post
+published: true
+meta:
+_edit_last: '4'
+_wpas_done_all: '1'
+---
+
+Apache Spark 3.5.0 is the sixth release in the 3.x series. With significant 
contributions from the open-source community, this release addressed over 1,300 
Jira tickets.
+
+This release introduces more scenarios with general availability for Spark 
Connect, like Scala and Go client, distributed training and inference support, 
and enhancement of compatibility for Structured streaming; introduces new 
PySpark and SQL functionality such as like SQL IDENTIFIER clause, named 
argument support for SQL function calls, SQL function support for HyperLogLog 
approximate aggregations, and Python user-defined table functions; simplifies 
distributed training with DeepSpeed; introduces watermark propagation among 
operators, introduces dropDuplicatesWithinWatermark operations in Structured 
Streaming.
+
+To download Apache Spark 3.5.0, please visit the 
[downloads](https://spark.apache.org/downloads.html) page. For [detailed 
changes](https://s.apache.org/spark-3.5.0), you can consult JIRA. We have also 
curated a list of high-level changes here, grouped by major modules.
+
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+
+### Highlights
+
+
+
+* Scala and Go client support in Spark Connect 
[SPARK-42554](https://issues.apache.org/jira/browse/SPARK-42554) 
[SPARK-43351](https://issues.apache.org/jira/browse/SPARK-43351) 
+* PyTorch-based distributed ML Support for Spark Connect 
[SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471) 
+* Structured Streaming support for Spark Connect in Python and Scala 
[SPARK-42938](https://issues.apache.org/jira/browse/SPARK-42938) 
+* Pandas API support for the Python Spark Connect Client 
[SPARK-42497](https://issues.apache.org/jira/browse/SPARK-42497)
+* Introduce Arrow Python UDFs 
[SPARK-40307](https://issues.apache.org/jira/browse/SPARK-40307)
+* Support Python user-defined table functions 
[SPARK-43798](https://issues.apache.org/jira/browse/SPARK-43798)
+* Migrate PySpark errors onto error classes 
[SPARK-42986](https://issues.apache.org/jira/browse/SPARK-42986)
+* PySpark Test Framework 
[SPARK-44042](https://issues.apache.org/jira/browse/SPARK-44042)
+* Add support for Datasketches HllSketch 
[SPARK-16484](https://issues.apache.org/jira/browse/SPARK-16484)
+* Built-in SQL Function Improvement 
[SPARK-41231](https://issues.apache.org/jira/browse/SPARK-41231)
+* IDENTIFIER clause 
[SPARK-43205](https://issues.apache.org/jira/browse/SPARK-43205)
+* Add SQL functions into Scala, Python and R API 
[SPARK-43907](https://issues.apache.org/jira/browse/SPARK-43907) 
+* Add named argument support for SQL functions 
[SPARK-43922](https://issues.apache.org/jira/browse/SPARK-43922)
+* Avoid unnecessary task rerun on decommissioned executor lost if shuffle data 
migrated [SPARK-41469](https://issues.apache.org/jira/browse/SPARK-41469)
+* Distributed ML <> spark connect 
[SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471) 
+* DeepSpeed Distributor 
[SPARK-44264](https://issues.apache.org/jira/browse/SPARK-44264)
+* Implement changelog checkpointing for RocksDB state store 
[SPARK-43421](https://issues.apache.org/jira/browse/SPARK-43421)
+* Introduce watermark propagation among operators 
[SPARK-42376](https://issues.apache.org/jira/browse/SPARK-42376)
+* Introduce dropDuplicatesWithinWatermark 
[SPARK-42931](https://issues.apache.org/jira/browse/SPARK-42931)
+* RocksDB state store provider memory management enhancements  
[SPARK-43311](https://issues.apache.org/jira/browse/SPARK-43311)
+
+
+
+### Spark Connect
+
+* Refactoring of the sql module into sql  and sql-api to produce a minimum set 
of dependencies that can be shared between the Scala Spark Connect client and 
Spark and avoids pulling all of the Spark transitive dependencies. 
[SPARK-44273](https://issues.apache.org/jira/browse/SPARK-44273) 
+* Introducing the Scala client for Spark Connect 
[SPARK-42554](https://issues.apache.org/jira/browse/SPARK-42554)
+* Pandas API support for the Python Spark Connect Client 
[SPARK-42497](https://issues.apache.org/jira/browse/SPARK-42497)
+* PyTorch-based distributed ML Support for Spark Connect 
[SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471)
+* Structured Streaming support for Spark Connect in Python and Scala 
[SPARK-42938](https://issues.apache.org/jira/browse/SPARK-42938)
+* Initial version of the Go client 
[SPARK-43351](https://issues.apache.org/jira/browse/SPARK-43351)
+* Lot’s of compatibility improvements between Spark native and the Spark 
Connect clients across Python and Scala
+* Improved debugability and request handling for 

[spark] branch master updated: [SPARK-45056][PYTHON][SS][CONNECT] Termination tests for streamingQueryListener and foreachBatch

2023-09-13 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b6190a3db97 [SPARK-45056][PYTHON][SS][CONNECT] Termination tests for 
streamingQueryListener and foreachBatch
b6190a3db97 is described below

commit b6190a3db974c19b6a0c4fe7af75531d67755074
Author: Wei Liu 
AuthorDate: Thu Sep 14 11:23:44 2023 +0900

[SPARK-45056][PYTHON][SS][CONNECT] Termination tests for 
streamingQueryListener and foreachBatch

### What changes were proposed in this pull request?

Add termination tests for StreamingQueryListener and foreachBatch.

The behavior is mimicked by creating the same query on server side that 
would have been created if running the same python query is ran on client side. 
For example, in foreachBatch, a python foreachBatch function is serialized 
using cloudPickleSerializer and passed to the server side, here we start 
another python process on the server and call the same cloudPickleSerializer 
and pass the bytes to the server, and construct `SimplePythonFunction` 
accordingly.

Refactored the code a bit for testing purpose.

### Why are the changes needed?

Necessary tests

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Test only addition

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42779 from WweiL/SPARK-44435-followup-termination-tests.

Lead-authored-by: Wei Liu 
Co-authored-by: Wei Liu 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml   |   6 +-
 .../sql/connect/planner/SparkConnectPlanner.scala  |   3 +-
 .../planner/StreamingQueryListenerHelper.scala |   8 +-
 .../service/SparkConnectSessionHodlerSuite.scala   | 205 +
 .../spark/api/python/PythonWorkerFactory.scala |   5 +
 .../spark/api/python/StreamingPythonRunner.scala   |  26 ++-
 .../connect/streaming/test_parity_foreach_batch.py |   1 -
 .../streaming/test_streaming_foreach_batch.py  |   2 -
 .../apache/spark/sql/IntegratedUDFTestUtils.scala  |   4 +-
 9 files changed, 242 insertions(+), 18 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 21809564497..25c95bd607d 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -256,14 +256,14 @@ jobs:
   # We should install one Python that is higher then 3+ for SQL and Yarn 
because:
   # - SQL component also has Python related tests, for example, 
IntegratedUDFTestUtils.
   # - Yarn has a Python specific test too, for example, YarnClusterSuite.
-  if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') 
&& !contains(matrix.modules, 'sql-'))
+  if: contains(matrix.modules, 'yarn') || (contains(matrix.modules, 'sql') 
&& !contains(matrix.modules, 'sql-')) || contains(matrix.modules, 'connect')
   with:
 python-version: 3.8
 architecture: x64
 - name: Install Python packages (Python 3.8)
-  if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 
'sql-'))
+  if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 
'sql-')) || contains(matrix.modules, 'connect')
   run: |
-python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 
scipy unittest-xml-reporting 'grpcio==1.56.0' 'protobuf==3.20.3'
+python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 
scipy unittest-xml-reporting 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 
'protobuf==3.20.3'
 python3.8 -m pip list
 # Run the tests.
 - name: Run tests
diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index b8ab5539b30..24dee006f0b 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -3131,8 +3131,7 @@ class SparkConnectPlanner(val sessionHolder: 
SessionHolder) extends Logging {
 val listener = if (command.getAddListener.hasPythonListenerPayload) {
   new PythonStreamingQueryListener(
 
transformPythonFunction(command.getAddListener.getPythonListenerPayload),
-sessionHolder,
-pythonExec)
+sessionHolder)
 } else {
   val listenerPacket = Utils
 .deserialize[StreamingListenerPacket](
diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingQueryListenerHelper.scala
 

[spark] branch master updated: [SPARK-36191][SQL] Handle limit and order by in correlated scalar (lateral) subqueries

2023-09-13 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 21c27d55b34 [SPARK-36191][SQL] Handle limit and order by in correlated 
scalar (lateral) subqueries
21c27d55b34 is described below

commit 21c27d55b342eb32bdf2d56a81dd9adb23f5526e
Author: Andrey Gubichev 
AuthorDate: Thu Sep 14 09:09:05 2023 +0800

[SPARK-36191][SQL] Handle limit and order by in correlated scalar (lateral) 
subqueries

### What changes were proposed in this pull request?

Handle LIMIT/ORDER BY in the correlated scalar (lateral) subqueries by 
rewriting them using ROW_NUMBER() window function.

### Why are the changes needed?
Extends our coverage of subqueries

### Does this PR introduce _any_ user-facing change?

Users are able to run more subqueries now

### How was this patch tested?

Unit tests and query tests. Results of query tests are verified against 
PostgreSQL.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42705 from agubichev/SPARK-36191-limit.

Authored-by: Andrey Gubichev 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |   5 +
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  33 +++
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala |   8 -
 .../optimizer/DecorrelateInnerQuerySuite.scala | 111 +++
 .../analyzer-results/join-lateral.sql.out  | 155 ++
 .../analyzer-results/postgreSQL/join.sql.out   | 318 +
 .../scalar-subquery-predicate.sql.out  | 123 
 .../resources/sql-tests/inputs/join-lateral.sql|  27 ++
 .../resources/sql-tests/inputs/postgreSQL/join.sql |  82 +++---
 .../scalar-subquery/scalar-subquery-predicate.sql  |  32 +++
 .../sql-tests/results/join-lateral.sql.out |  81 ++
 .../sql-tests/results/postgreSQL/join.sql.out  |  82 ++
 .../scalar-subquery-predicate.sql.out  |  66 +
 13 files changed, 1074 insertions(+), 49 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 038cd7d944a..4cfeb60eb04 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -1418,6 +1418,11 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   failOnInvalidOuterReference(g)
   checkPlan(g.child, aggregated, canContainOuter)
 
+// Correlated subquery can have a LIMIT clause
+case l @ Limit(_, input) =>
+  failOnInvalidOuterReference(l)
+  checkPlan(input, aggregated, canContainOuter)
+
 // Category 4: Any other operators not in the above 3 categories
 // cannot be on a correlation path, that is they are allowed only
 // under a correlation point but they and their descendant operators
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
index a07177f6e8a..1cb7c8f3ac6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala
@@ -655,6 +655,39 @@ object DecorrelateInnerQuery extends PredicateHelper {
 val newProject = Project(newProjectList ++ referencesToAdd, 
newChild)
 (newProject, joinCond, outerReferenceMap)
 
+  case Limit(limit, input) =>
+// LIMIT K (with potential ORDER BY) is decorrelated by computing 
K rows per every
+// domain value via a row_number() window function. For example, 
for a subquery
+// (SELECT T2.a FROM T2 WHERE T2.b = OuterReference(x) ORDER BY 
T2.c LIMIT 3)
+// -- we need to get top 3 values of T2.a (ordering by T2.c) for 
every value of x.
+// Following our general decorrelation procedure, 'x' is then 
replaced by T2.b, so the
+// subquery is decorrelated as:
+// SELECT * FROM (
+//   SELECT T2.a, row_number() OVER (PARTITION BY T2.b ORDER BY 
T2.c) AS rn FROM T2)
+// WHERE rn <= 3
+val (child, ordering) = input match {
+  case Sort(order, _, child) => (child, order)
+  case _ => (input, Seq())
+}
+val (newChild, joinCond, outerReferenceMap) =
+  decorrelate(child, parentOuterReferences, aggregated = true, 
underSetOp)

[spark] branch master updated: [SPARK-45135][PYTHON][TESTS] Make `utils.eventually` a parameterized decorator

2023-09-13 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9798244ca64 [SPARK-45135][PYTHON][TESTS] Make `utils.eventually` a 
parameterized decorator
9798244ca64 is described below

commit 9798244ca647ec68d36f4b9b21356a6de5f73f77
Author: Ruifeng Zheng 
AuthorDate: Thu Sep 14 08:44:53 2023 +0800

[SPARK-45135][PYTHON][TESTS] Make `utils.eventually` a parameterized 
decorator

### What changes were proposed in this pull request?

- Make utils.eventually a parameterized decorator
- Retry `test_read_images` if it fails

### Why are the changes needed?
previously, we used `utils.eventually` to retry flaky tests, e.g. 
https://github.com/apache/spark/commit/745ed93fe451b3f9e8148b06356c28b889a4db5a

however, it needs to modify the test body, sometime the change maybe large

To minimize the changes, I'd like to ~~add a decorator for test retry~~ 
Make utils.eventually a parameterized decorator.

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #42891 from zhengruifeng/test_retry_test_read_images.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/ml/tests/test_image.py  |  3 +-
 python/pyspark/ml/tests/test_wrapper.py|  4 +-
 python/pyspark/mllib/tests/test_algorithms.py  | 37 ++--
 .../mllib/tests/test_streaming_algorithms.py   | 24 
 .../pandas/test_pandas_grouped_map_with_state.py   |  2 +-
 python/pyspark/testing/utils.py| 70 ++
 python/pyspark/tests/test_taskcontext.py   |  3 +-
 python/pyspark/tests/test_util.py  | 23 ++-
 python/pyspark/tests/test_worker.py| 15 ++---
 9 files changed, 112 insertions(+), 69 deletions(-)

diff --git a/python/pyspark/ml/tests/test_image.py 
b/python/pyspark/ml/tests/test_image.py
index 86fa46c3248..ee254a41007 100644
--- a/python/pyspark/ml/tests/test_image.py
+++ b/python/pyspark/ml/tests/test_image.py
@@ -19,10 +19,11 @@ import unittest
 from pyspark.ml.image import ImageSchema
 from pyspark.testing.mlutils import SparkSessionTestCase
 from pyspark.sql import Row
-from pyspark.testing.utils import QuietTest
+from pyspark.testing.utils import QuietTest, eventually
 
 
 class ImageFileFormatTest(SparkSessionTestCase):
+@eventually(timeout=60.0, catch_assertions=True)
 def test_read_images(self):
 data_path = "data/mllib/images/origin/kittens"
 df = (
diff --git a/python/pyspark/ml/tests/test_wrapper.py 
b/python/pyspark/ml/tests/test_wrapper.py
index 33d93c02acd..3efdbabd998 100644
--- a/python/pyspark/ml/tests/test_wrapper.py
+++ b/python/pyspark/ml/tests/test_wrapper.py
@@ -63,7 +63,7 @@ class JavaWrapperMemoryTests(SparkSessionTestCase):
 self.assertIn("LinearRegressionTrainingSummary", 
summary._java_obj.toString())
 return True
 
-eventually(condition, timeout=10, catch_assertions=True)
+eventually(timeout=10, catch_assertions=True)(condition)()
 
 try:
 summary.__del__()
@@ -77,7 +77,7 @@ class JavaWrapperMemoryTests(SparkSessionTestCase):
 summary._java_obj.toString()
 return True
 
-eventually(condition, timeout=10, catch_assertions=True)
+eventually(timeout=10, catch_assertions=True)(condition)()
 
 
 class WrapperTests(MLlibTestCase):
diff --git a/python/pyspark/mllib/tests/test_algorithms.py 
b/python/pyspark/mllib/tests/test_algorithms.py
index dc48c2c021d..bcedd65b05b 100644
--- a/python/pyspark/mllib/tests/test_algorithms.py
+++ b/python/pyspark/mllib/tests/test_algorithms.py
@@ -97,27 +97,28 @@ class ListTests(MLlibTestCase):
 # TODO: Allow small numeric difference.
 self.assertTrue(array_equal(c1, c2))
 
+@eventually(timeout=60, catch_assertions=True)
 def test_gmm(self):
 from pyspark.mllib.clustering import GaussianMixture
 
-def condition():
-data = self.sc.parallelize(
-[
-[1, 2],
-[8, 9],
-[-4, -3],
-[-6, -7],
-]
-)
-clusters = GaussianMixture.train(
-data, 2, convergenceTol=0.001, maxIterations=10, seed=1
-)
-labels = clusters.predict(data).collect()
-self.assertEqual(labels[0], labels[1])
-self.assertEqual(labels[2], labels[3])
-return True
-
-eventually(condition, timeout=60, catch_assertions=True)
+data = self.sc.parallelize(
+[
+   

[spark] branch master updated (c12eb2c93d7 -> d26e77aabd2)

2023-09-13 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c12eb2c93d7 [SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 
4.7.1
 add d26e77aabd2 Revert "[SPARK-45120][UI] Upgrade d3 from v3 to v7(v7.8.5) 
and apply api changes in UI"

No new revisions were added by this update.

Summary of changes:
 LICENSE|  6 +--
 LICENSE-binary |  6 +--
 .../resources/org/apache/spark/ui/static/d3.min.js |  7 +++-
 .../org/apache/spark/ui/static/spark-dag-viz.js|  6 +--
 .../org/apache/spark/ui/static/streaming-page.js   | 48 ++
 .../spark/ui/static/structured-streaming-page.js   | 31 --
 licenses-binary/LICENSE-d3.min.js.txt  | 39 --
 licenses/LICENSE-d3.min.js.txt | 39 --
 .../spark/sql/execution/ui/static/spark-sql-viz.js |  4 +-
 9 files changed, 105 insertions(+), 81 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] Yikun commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-13 Thread via GitHub


Yikun commented on PR #476:
URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718441629

   https://github.com/apache/spark-docker/pull/55
   
   I already submited the draft patch, need your help to upload the key. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-13 Thread via GitHub


xuanyuanking commented on PR #476:
URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718304305

   cc @Yikun, could you also help me with DockerHub image parts? I don't have 
the access


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] xuanyuanking commented on pull request #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-13 Thread via GitHub


xuanyuanking commented on PR #476:
URL: https://github.com/apache/spark-website/pull/476#issuecomment-1718300223

   cc @gengliangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] xuanyuanking opened a new pull request, #476: Add News, Release Note, Download Link for Apache Spark 3.5.0

2023-09-13 Thread via GitHub


xuanyuanking opened a new pull request, #476:
URL: https://github.com/apache/spark-website/pull/476

   - Vote Result: 
https://lists.apache.org/thread/kqqlsy90ormkccddtx6wgqmy8r1df8tx
   - Documentation: https://spark.apache.org/docs/3.5.0/
   - Binary: https://downloads.apache.org/spark/spark-3.5.0/
   - PySpark: pending on the File Limit Request 
https://github.com/pypi/support/issues/3175
   - Maven Central: 
https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.13/3.5.0/
   
   https://github.com/apache/spark-website/assets/4833765/a74b94d0-f3ba-4388-9941-3afa43728ecb;>
   https://github.com/apache/spark-website/assets/4833765/ee599a66-559a-48fc-a7df-4f0b785fb544;>
   https://github.com/apache/spark-website/assets/4833765/a22d8f02-31f7-4173-9624-90a35dd209a3;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-website] branch 3.5.0-release deleted (was dc8b61a479)

2023-09-13 Thread liyuanjian
This is an automated email from the ASF dual-hosted git repository.

liyuanjian pushed a change to branch 3.5.0-release
in repository https://gitbox.apache.org/repos/asf/spark-website.git


 was dc8b61a479 Add docs for Apache Spark 3.5.0

The revisions that were on this branch are still contained in
other references; therefore, this change does not discard any commits
from the repository.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-website] branch 3.5.0-release created (now dc8b61a479)

2023-09-13 Thread liyuanjian
This is an automated email from the ASF dual-hosted git repository.

liyuanjian pushed a change to branch 3.5.0-release
in repository https://gitbox.apache.org/repos/asf/spark-website.git


  at dc8b61a479 Add docs for Apache Spark 3.5.0

No new revisions were added by this update.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 4.7.1

2023-09-13 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c12eb2c93d7 [SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 
4.7.1
c12eb2c93d7 is described below

commit c12eb2c93d7678795726a7f2ccbb30a66a93fb22
Author: yangjie01 
AuthorDate: Wed Sep 13 08:43:52 2023 -0700

[SPARK-45144][BUILD] Downgrade `scala-maven-plugin` to 4.7.1

### What changes were proposed in this pull request?
This pr downgrade `scala-maven-plugin` to version 4.7.1 to avoid it 
automatically adding the `-release` option as a Scala compilation argument.

### Why are the changes needed?
The `scala-maven-plugin` versions 4.7.2 and later will try to automatically 
append the `-release` option as a Scala compilation argument when it is not 
specified by the user:

1. 4.7.2 and 4.8.0: try to add the `-release` option for Scala versions 
2.13.9 and higher.
2. 4.8.1: try to append the `-release` option for Scala versions 
2.12.x/2.13.x/3.1.1, and append `-java-output-version` for Scala 3.1.2.

The addition of the `-release` option has caused issues mentioned in 
SPARK-44376 | https://github.com/apache/spark/pull/41943 and 
https://github.com/apache/spark/pull/40442#issuecomment-1474161688. This is 
because the `-release` option has stronger compilation restrictions than 
`-target`, ensuring not only bytecode format, but also that the API used in the 
code is compatible with the specified version of Java. However, many APIs in 
the `sun.*` package are not `exports` in Java 11, 17, [...]

For discussions within the Scala community, see 
https://github.com/scala/bug/issues/12643, 
https://github.com/scala/bug/issues/12824, 
https://github.com/scala/bug/issues/12866,  but this is not a bug.

I have also submitted an issue to the `scala-maven-plugin` community to 
discuss the possibility of adding additional settings to control the addition 
of the `-release` option: 
https://github.com/davidB/scala-maven-plugin/issues/722.

For Apache Spark 4.0, in the short term, I suggest downgrading 
`scala-maven-plugin` to version 4.7.1 to avoid it automatic adding the 
`-release` option as a Scala compilation argument. In the long term, we should 
reduce use of APIs that are not `exports` for compatibility with the `-release` 
compilation option due to `-target` already deprecated after Scala 2.13.9.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Actions
- Manual check

run `git revert 656bf36363c466b60d0045234ccaaa654ed8` to revert to 
using Scala 2.13.11 and run `dev/change-scala-version.sh 2.13` to change Scala 
to 2.13

1. Run `build/mvn clean install -DskipTests -Pscala-2.13 -X` to check the 
Scala compilation arguments.

Before

```
[[DEBUG] [zinc] Running cached compiler 1992eaf4 for Scala compiler version 
2.13.11
[DEBUG] [zinc] The Scala compiler is invoked with:
  -unchecked
  -deprecation
  -feature
  -explaintypes
  -target:jvm-1.8
  -Wconf:cat=deprecation:wv,any:e
  -Wunused:imports
  -Wconf:cat=scaladoc:wv
  -Wconf:cat=lint-multiarg-infix:wv
  -Wconf:cat=other-nullary-override:wv
  
-Wconf:cat=other-match-analysis=org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction.catalogFunction:wv
  
-Wconf:cat=other-pure-statement=org.apache.spark.streaming.util.FileBasedWriteAheadLog.readAll.readFile:wv
  
-Wconf:cat=other-pure-statement=org.apache.spark.scheduler.OutputCommitCoordinatorSuite..futureAction:wv
  
-Wconf:msg=^(?=.*?method|value|type|object|trait|inheritance)(?=.*?deprecated)(?=.*?since
 2.13).+$:s
  -Wconf:msg=^(?=.*?Widening conversion from)(?=.*?is deprecated because it 
loses precision).+$:s
  -Wconf:msg=Auto-application to \`\(\)\` is deprecated:s
  -Wconf:msg=method with a single empty parameter list overrides method 
without any parameter list:s
  -Wconf:msg=method without a parameter list overrides a method with a 
single empty one:s
  -Wconf:cat=deprecation=procedure syntax is deprecated:e
  -Wconf:cat=unchecked=outer reference:s
  -Wconf:cat=unchecked=eliminated by erasure:s
  -Wconf:msg=^(?=.*?a value of type)(?=.*?cannot also be).+$:s
  
-Wconf:cat=unused-imports=org\/apache\/spark\/graphx\/impl\/VertexPartitionBase.scala:s
  
-Wconf:cat=unused-imports=org\/apache\/spark\/graphx\/impl\/VertexPartitionBaseOps.scala:s
  -Wconf:msg=Implicit definition should have explicit type:s
  -release
  8
  -bootclasspath
...
```

After

```
[DEBUG] [zinc] Running cached compiler 72dd4888 for Scala compiler version 
2.13.11
[DEBUG] [zinc] The Scala compiler is invoked with:
  -unchecked
  

[spark] branch master updated: [SPARK-45157][SQL] Avoid repeated `if` checks in `[On|Off|HeapColumnVector`

2023-09-13 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b568ba43f0d [SPARK-45157][SQL] Avoid repeated `if` checks in 
`[On|Off|HeapColumnVector`
b568ba43f0d is described below

commit b568ba43f0dd80130bca1bf86c48d0d359e57883
Author: Wenchen Fan 
AuthorDate: Wed Sep 13 08:36:05 2023 -0700

[SPARK-45157][SQL] Avoid repeated `if` checks in `[On|Off|HeapColumnVector`

### What changes were proposed in this pull request?

This is a small followup of https://github.com/apache/spark/pull/42850. 
`getBytes` checks if the `dictionary` is null or not, then call `getByte` which 
also checks if the `dictionary` is null or not. This PR avoids the repeated if 
checks by copying one line code from `getByte` to `getBytes`. The same applies 
to other `getXXX` methods.

### Why are the changes needed?

Make the perf-critical path more efficient.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #42903 from cloud-fan/vector.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/vectorized/OffHeapColumnVector.java  | 12 ++--
 .../spark/sql/execution/vectorized/OnHeapColumnVector.java   | 12 ++--
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
index 9cb1b1f0b5e..2bb0b02d4c9 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
@@ -218,7 +218,7 @@ public final class OffHeapColumnVector extends 
WritableColumnVector {
   Platform.copyMemory(null, data + rowId, array, 
Platform.BYTE_ARRAY_OFFSET, count);
 } else {
   for (int i = 0; i < count; i++) {
-array[i] = getByte(rowId + i);
+array[i] = (byte) dictionary.decodeToInt(dictionaryIds.getDictId(rowId 
+ i));
   }
 }
 return array;
@@ -279,7 +279,7 @@ public final class OffHeapColumnVector extends 
WritableColumnVector {
   Platform.copyMemory(null, data + rowId * 2L, array, 
Platform.SHORT_ARRAY_OFFSET, count * 2L);
 } else {
   for (int i = 0; i < count; i++) {
-array[i] = getShort(rowId + i);
+array[i] = (short) 
dictionary.decodeToInt(dictionaryIds.getDictId(rowId + i));
   }
 }
 return array;
@@ -345,7 +345,7 @@ public final class OffHeapColumnVector extends 
WritableColumnVector {
   Platform.copyMemory(null, data + rowId * 4L, array, 
Platform.INT_ARRAY_OFFSET, count * 4L);
 } else {
   for (int i = 0; i < count; i++) {
-array[i] = getInt(rowId + i);
+array[i] = dictionary.decodeToInt(dictionaryIds.getDictId(rowId + i));
   }
 }
 return array;
@@ -423,7 +423,7 @@ public final class OffHeapColumnVector extends 
WritableColumnVector {
   Platform.copyMemory(null, data + rowId * 8L, array, 
Platform.LONG_ARRAY_OFFSET, count * 8L);
 } else {
   for (int i = 0; i < count; i++) {
-array[i] = getLong(rowId + i);
+array[i] = dictionary.decodeToLong(dictionaryIds.getDictId(rowId + i));
   }
 }
 return array;
@@ -487,7 +487,7 @@ public final class OffHeapColumnVector extends 
WritableColumnVector {
   Platform.copyMemory(null, data + rowId * 4L, array, 
Platform.FLOAT_ARRAY_OFFSET, count * 4L);
 } else {
   for (int i = 0; i < count; i++) {
-array[i] = getFloat(rowId + i);
+array[i] = dictionary.decodeToFloat(dictionaryIds.getDictId(rowId + 
i));
   }
 }
 return array;
@@ -553,7 +553,7 @@ public final class OffHeapColumnVector extends 
WritableColumnVector {
 count * 8L);
 } else {
   for (int i = 0; i < count; i++) {
-array[i] = getDouble(rowId + i);
+array[i] = dictionary.decodeToDouble(dictionaryIds.getDictId(rowId + 
i));
   }
 }
 return array;
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
index be590bb9ac7..2bf2b8d08fc 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
@@ -216,7 +216,7 @@ public final class OnHeapColumnVector extends 
WritableColumnVector {
   System.arraycopy(byteData, rowId, array, 0, count);
 } 

[spark] branch branch-3.3 updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

2023-09-13 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 9ee184ad5cf [SPARK-45146][DOCS] Update the default value of 
'spark.submit.deployMode'
9ee184ad5cf is described below

commit 9ee184ad5cf1ea808143cffd6fa982ca8ef503fe
Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com>
AuthorDate: Wed Sep 13 08:48:14 2023 -0500

[SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

**What changes were proposed in this pull request?**
The PR updates the default value of 'spark.submit.deployMode' in 
configuration.html on the website

**Why are the changes needed?**
The default value of 'spark.submit.deployMode' is 'client', but the website 
is wrong.

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
It doesn't need to.

**Was this patch authored or co-authored using generative AI tooling?**
No

Closes #42902 from chenyu-opensource/branch-SPARK-45146.

Authored-by: chenyu-opensource 
<119398199+chenyu-opensou...@users.noreply.github.com>
Signed-off-by: Sean Owen 
(cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e)
Signed-off-by: Sean Owen 
---
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index cb1f5212439..9e243635baf 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -394,7 +394,7 @@ of the most common options to set are:
 
 
   spark.submit.deployMode
-  (none)
+  client
   
 The deploy mode of Spark driver program, either "client" or "cluster",
 Which means to launch driver program locally ("client")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.4 updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

2023-09-13 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 7544bdb12d1 [SPARK-45146][DOCS] Update the default value of 
'spark.submit.deployMode'
7544bdb12d1 is described below

commit 7544bdb12d1d0449aaa7e7a5f8124a5cf662712f
Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com>
AuthorDate: Wed Sep 13 08:48:14 2023 -0500

[SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

**What changes were proposed in this pull request?**
The PR updates the default value of 'spark.submit.deployMode' in 
configuration.html on the website

**Why are the changes needed?**
The default value of 'spark.submit.deployMode' is 'client', but the website 
is wrong.

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
It doesn't need to.

**Was this patch authored or co-authored using generative AI tooling?**
No

Closes #42902 from chenyu-opensource/branch-SPARK-45146.

Authored-by: chenyu-opensource 
<119398199+chenyu-opensou...@users.noreply.github.com>
Signed-off-by: Sean Owen 
(cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e)
Signed-off-by: Sean Owen 
---
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index f099cea7eb9..d61f726130b 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -394,7 +394,7 @@ of the most common options to set are:
 
 
   spark.submit.deployMode
-  (none)
+  client
   
 The deploy mode of Spark driver program, either "client" or "cluster",
 Which means to launch driver program locally ("client")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

2023-09-13 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 076cb7aabac [SPARK-45146][DOCS] Update the default value of 
'spark.submit.deployMode'
076cb7aabac is described below

commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e
Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com>
AuthorDate: Wed Sep 13 08:48:14 2023 -0500

[SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

**What changes were proposed in this pull request?**
The PR updates the default value of 'spark.submit.deployMode' in 
configuration.html on the website

**Why are the changes needed?**
The default value of 'spark.submit.deployMode' is 'client', but the website 
is wrong.

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
It doesn't need to.

**Was this patch authored or co-authored using generative AI tooling?**
No

Closes #42902 from chenyu-opensource/branch-SPARK-45146.

Authored-by: chenyu-opensource 
<119398199+chenyu-opensou...@users.noreply.github.com>
Signed-off-by: Sean Owen 
---
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 6f7e12555e8..3ca9b704eba 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -394,7 +394,7 @@ of the most common options to set are:
 
 
   spark.submit.deployMode
-  (none)
+  client
   
 The deploy mode of Spark driver program, either "client" or "cluster",
 Which means to launch driver program locally ("client")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

2023-09-13 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new e72ae794e69 [SPARK-45146][DOCS] Update the default value of 
'spark.submit.deployMode'
e72ae794e69 is described below

commit e72ae794e69d8182291655d023aee903a913571b
Author: chenyu-opensource <119398199+chenyu-opensou...@users.noreply.github.com>
AuthorDate: Wed Sep 13 08:48:14 2023 -0500

[SPARK-45146][DOCS] Update the default value of 'spark.submit.deployMode'

**What changes were proposed in this pull request?**
The PR updates the default value of 'spark.submit.deployMode' in 
configuration.html on the website

**Why are the changes needed?**
The default value of 'spark.submit.deployMode' is 'client', but the website 
is wrong.

**Does this PR introduce any user-facing change?**
No

**How was this patch tested?**
It doesn't need to.

**Was this patch authored or co-authored using generative AI tooling?**
No

Closes #42902 from chenyu-opensource/branch-SPARK-45146.

Authored-by: chenyu-opensource 
<119398199+chenyu-opensou...@users.noreply.github.com>
Signed-off-by: Sean Owen 
(cherry picked from commit 076cb7aabac2f0ff11ca77ca530b7b8db5310a5e)
Signed-off-by: Sean Owen 
---
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index dfded480c99..1139beb6646 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -394,7 +394,7 @@ of the most common options to set are:
 
 
   spark.submit.deployMode
-  (none)
+  client
   
 The deploy mode of Spark driver program, either "client" or "cluster",
 Which means to launch driver program locally ("client")


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.5 updated: [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image

2023-09-13 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 151f88b53e6 [SPARK-45142][INFRA] Specify the range for Spark Connect 
dependencies in pyspark base image
151f88b53e6 is described below

commit 151f88b53e67944d6ca5c635466f50958019c8b4
Author: Hyukjin Kwon 
AuthorDate: Wed Sep 13 21:20:19 2023 +0900

[SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in 
pyspark base image

This PR proposes to pin the dependencies related to Spark Connect in its 
base image according to the range we support.
See also 
https://github.com/apache/spark/blob/master/python/docs/source/getting_started/install.rst#dependencies

To properly test the dependency versions we support.

No, dev-only.

In this PR, it will be tested.

No.

Closes #42898 from HyukjinKwon/SPARK-45142.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 61435b42fdc4071f35aba6af9248ff9ad8fc8514)
Signed-off-by: Hyukjin Kwon 
---
 dev/infra/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index af8e1a980f9..d3bae836cc6 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -68,7 +68,7 @@ RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage 
matplotlib
 RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
 
 # Add Python deps for Spark Connect.
-RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos 
grpcio-status
+RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 
'protobuf==3.20.3' 'googleapis-common-protos==1.56.4'
 
 # Add torch as a testing dependency for TorchDistributor
 RUN python3.9 -m pip install torch torchvision torcheval


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in pyspark base image

2023-09-13 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 61435b42fdc [SPARK-45142][INFRA] Specify the range for Spark Connect 
dependencies in pyspark base image
61435b42fdc is described below

commit 61435b42fdc4071f35aba6af9248ff9ad8fc8514
Author: Hyukjin Kwon 
AuthorDate: Wed Sep 13 21:20:19 2023 +0900

[SPARK-45142][INFRA] Specify the range for Spark Connect dependencies in 
pyspark base image

### What changes were proposed in this pull request?

This PR proposes to pin the dependencies related to Spark Connect in its 
base image according to the range we support.
See also 
https://github.com/apache/spark/blob/master/python/docs/source/getting_started/install.rst#dependencies

### Why are the changes needed?

To properly test the dependency versions we support.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

In this PR, it will be tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42898 from HyukjinKwon/SPARK-45142.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 dev/infra/Dockerfile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index feee7415004..99423ce072c 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -88,7 +88,7 @@ RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage 
matplotlib
 RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
 
 # Add Python deps for Spark Connect.
-RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos 
grpcio-status
+RUN python3.9 -m pip install 'grpcio>=1.48,<1.57' 'grpcio-status>=1.48,<1.57' 
'protobuf==3.20.3' 'googleapis-common-protos==1.56.4'
 
 # Add torch as a testing dependency for TorchDistributor
 RUN python3.9 -m pip install torch torchvision --index-url 
https://download.pytorch.org/whl/cpu


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45139][SQL] Add DatabricksDialect to handle SQL type conversion

2023-09-13 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2710dbe6abf [SPARK-45139][SQL] Add DatabricksDialect to handle SQL 
type conversion
2710dbe6abf is described below

commit 2710dbe6abf0b143a18268a99635d0e6033bea78
Author: Ivan Sadikov 
AuthorDate: Wed Sep 13 02:28:56 2023 -0700

[SPARK-45139][SQL] Add DatabricksDialect to handle SQL type conversion

### What changes were proposed in this pull request?

This PR adds `DatabricksDialect` to Spark to allow users to query 
Databricks clusters and Databricks SQL warehouses with more precise SQL type 
conversion and quote identifiers instead of doing it manually in the code.

### Why are the changes needed?

The PR fixes type conversion and makes it easier to query Databricks 
clusters.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests in JDBCSuite to check conversion.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42896 from sadikovi/add_databricks_dialect.

Authored-by: Ivan Sadikov 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/jdbc/DatabricksDialect.scala  | 93 ++
 .../org/apache/spark/sql/jdbc/JdbcDialects.scala   |  1 +
 .../org/apache/spark/sql/jdbc/JDBCSuite.scala  | 26 ++
 3 files changed, 120 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/DatabricksDialect.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/DatabricksDialect.scala
new file mode 100644
index 000..1b715283dd4
--- /dev/null
+++ b/sql/core/src/main/scala/org/apache/spark/sql/jdbc/DatabricksDialect.scala
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.jdbc
+
+import java.sql.Connection
+
+import scala.collection.mutable.ArrayBuilder
+
+import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions
+import org.apache.spark.sql.execution.datasources.v2.TableSampleInfo
+import org.apache.spark.sql.types._
+
+private case object DatabricksDialect extends JdbcDialect {
+
+  override def canHandle(url: String): Boolean = {
+url.startsWith("jdbc:databricks")
+  }
+
+  override def getCatalystType(
+  sqlType: Int,
+  typeName: String,
+  size: Int,
+  md: MetadataBuilder): Option[DataType] = {
+sqlType match {
+  case java.sql.Types.TINYINT => Some(ByteType)
+  case java.sql.Types.SMALLINT => Some(ShortType)
+  case java.sql.Types.REAL => Some(FloatType)
+  case _ => None
+}
+  }
+
+  override def getJDBCType(dt: DataType): Option[JdbcType] = dt match {
+case BooleanType => Some(JdbcType("BOOLEAN", java.sql.Types.BOOLEAN))
+case DoubleType => Some(JdbcType("DOUBLE", java.sql.Types.DOUBLE))
+case StringType => Some(JdbcType("STRING", java.sql.Types.VARCHAR))
+case BinaryType => Some(JdbcType("BINARY", java.sql.Types.BINARY))
+case _ => None
+  }
+
+  override def quoteIdentifier(colName: String): String = {
+s"`$colName`"
+  }
+
+  override def supportsLimit: Boolean = true
+
+  override def supportsOffset: Boolean = true
+
+  override def supportsTableSample: Boolean = true
+
+  override def getTableSample(sample: TableSampleInfo): String = {
+s"TABLESAMPLE (${(sample.upperBound - sample.lowerBound) * 100}) 
REPEATABLE (${sample.seed})"
+  }
+
+  // Override listSchemas to run "show schemas" as a PreparedStatement instead 
of
+  // invoking getMetaData.getSchemas as it may not work correctly in older 
versions of the driver.
+  override def schemasExists(conn: Connection, options: JDBCOptions, schema: 
String): Boolean = {
+val stmt = conn.prepareStatement("SHOW SCHEMAS")
+val rs = stmt.executeQuery()
+while (rs.next()) {
+  if (rs.getString(1) == schema) {
+return true
+  }
+}
+false
+  }
+
+  // Override listSchemas to run "show schemas" as a PreparedStatement instead 
of
+  // invoking 

[spark] branch master updated: [SPARK-45147][CORE] Remove `System.setSecurityManager` usage

2023-09-13 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4f74fc58148 [SPARK-45147][CORE] Remove `System.setSecurityManager` 
usage
4f74fc58148 is described below

commit 4f74fc581486a1a750b3bb27abc12e1a87215ea6
Author: Dongjoon Hyun 
AuthorDate: Wed Sep 13 02:03:28 2023 -0700

[SPARK-45147][CORE] Remove `System.setSecurityManager` usage

### What changes were proposed in this pull request?

This PR aims to remove the deprecate Java `System.setSecurityManager` usage 
for Apache Spark 4.0.

Note that this is the only usage in Apache Spark AS-IS code.
```
$ git grep setSecurityManager
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala: 
 System.setSecurityManager(sm)
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala: 
 System.setSecurityManager(currentSm)
```

This usage was added at `Apache Spark 1.5.0`.

- #5841

Since `Apache Spark 2.4.0`, we don't need `setSecurityManager` due to the 
following improvement.

- #20925

### Why are the changes needed?

```
$ java -version
openjdk version "21-ea" 2023-09-19
OpenJDK Runtime Environment (build 21-ea+32-2482)
OpenJDK 64-Bit Server VM (build 21-ea+32-2482, mixed mode, sharing)
```

```
max spark-3.5.0-bin-hadoop3:$ bin/spark-sql --help
...
CLI options:
Exception in thread "main" java.lang.UnsupportedOperationException:
The Security Manager is deprecated and will be removed in a future release
  at java.base/java.lang.System.setSecurityManager(System.java:429)
  at 
org.apache.spark.deploy.SparkSubmitArguments.getSqlShellOptions(SparkSubmitArguments.scala:623)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.
```
$ build/sbt test:package -Phive -Phive-thriftserver
$ bin/spark-sql --help
...
CLI options:
 -d,--define   Variable substitution to apply to Hive
  commands. e.g. -d A=B or --define A=B
--database  Specify the database to use
 -e  SQL from command line
 -f SQL from files
 -H,--helpPrint help information
--hiveconfUse value for given property
--hivevar  Variable substitution to apply to Hive
  commands. e.g. --hivevar A=B
 -i Initialization SQL file
 -S,--silent  Silent mode in interactive shell
 -v,--verbose Verbose mode (echo executed SQL to the
  console)
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42901 from dongjoon-hyun/SPARK-45147.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/deploy/SparkSubmitArguments.scala | 26 +-
 1 file changed, 1 insertion(+), 25 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
index a3fe5153bee..867fc05cb8a 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
@@ -18,7 +18,6 @@
 package org.apache.spark.deploy
 
 import java.io.{ByteArrayOutputStream, File, PrintStream}
-import java.lang.reflect.InvocationTargetException
 import java.nio.charset.StandardCharsets
 import java.util.{List => JList}
 
@@ -599,39 +598,17 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
   /**
* Run the Spark SQL CLI main class with the "--help" option and catch its 
output. Then filter
* the results to remove unwanted lines.
-   *
-   * Since the CLI will call `System.exit()`, we install a security manager to 
prevent that call
-   * from working, and restore the original one afterwards.
*/
   private def getSqlShellOptions(): String = {
 val currentOut = System.out
 val currentErr = System.err
-val currentSm = System.getSecurityManager()
 try {
   val out = new ByteArrayOutputStream()
   val stream = new PrintStream(out)
   System.setOut(stream)
   System.setErr(stream)
 
-  val sm = new SecurityManager() {
-override def checkExit(status: Int): Unit = {
-  throw new SecurityException()
-}
-
-override def checkPermission(perm: java.security.Permission): Unit = {}
-  }
-  System.setSecurityManager(sm)
-
-  try {
-

[spark] branch master updated: [SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` in CI

2023-09-13 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e3d2dfa8b51 [SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` 
in CI
e3d2dfa8b51 is described below

commit e3d2dfa8b514f9358823c3cb1ad6523da8a6646b
Author: Ruifeng Zheng 
AuthorDate: Wed Sep 13 15:51:27 2023 +0800

[SPARK-45141][PYTHON][INFRA][TESTS] Pin `pyarrow==12.0.1` in CI

### What changes were proposed in this pull request?
Pin `pyarrow==12.0.1` in CI

### Why are the changes needed?
to fix test failure,  
https://github.com/apache/spark/actions/runs/6167186123/job/16738683632

```
==
FAIL [0.095s]: test_from_to_pandas 
(pyspark.pandas.tests.data_type_ops.test_datetime_ops.DatetimeOpsTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/testing/pandasutils.py", line 122, 
in _assert_pandas_equal
assert_series_equal(
  File 
"/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 
931, in assert_series_equal
assert_attr_equal("dtype", left, right, obj=f"Attributes of {obj}")
  File 
"/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 
415, in assert_attr_equal
raise_assert_detail(obj, msg, left_attr, right_attr)
  File 
"/usr/local/lib/python3.9/dist-packages/pandas/_testing/asserters.py", line 
599, in raise_assert_detail
raise AssertionError(msg)
AssertionError: Attributes of Series are different

Attribute "dtype" are different
[left]:  datetime64[ns]
[right]: datetime64[us]
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI and manually test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #42897 from zhengruifeng/pin_pyarrow.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml | 4 ++--
 dev/infra/Dockerfile | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index f0bd65bcf41..21809564497 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -263,7 +263,7 @@ jobs:
 - name: Install Python packages (Python 3.8)
   if: (contains(matrix.modules, 'sql') && !contains(matrix.modules, 
'sql-'))
   run: |
-python3.8 -m pip install 'numpy>=1.20.0' pyarrow pandas scipy 
unittest-xml-reporting 'grpcio==1.56.0' 'protobuf==3.20.3'
+python3.8 -m pip install 'numpy>=1.20.0' 'pyarrow==12.0.1' pandas 
scipy unittest-xml-reporting 'grpcio==1.56.0' 'protobuf==3.20.3'
 python3.8 -m pip list
 # Run the tests.
 - name: Run tests
@@ -728,7 +728,7 @@ jobs:
 #   See also https://issues.apache.org/jira/browse/SPARK-38279.
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
-python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
pyarrow pandas 'plotly>=4.8'
+python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
'pyarrow==12.0.1' pandas 'plotly>=4.8'
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
 apt-get update -y
 apt-get install -y ruby ruby-dev
diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index 60204dcc49e..feee7415004 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -85,7 +85,7 @@ RUN Rscript -e "devtools::install_version('roxygen2', 
version='7.2.0', repos='ht
 ENV R_LIBS_SITE 
"/usr/local/lib/R/site-library:${R_LIBS_SITE}:/usr/lib/R/library"
 
 RUN pypy3 -m pip install numpy 'pandas<=2.0.3' scipy coverage matplotlib
-RUN python3.9 -m pip install numpy pyarrow 'pandas<=2.0.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
+RUN python3.9 -m pip install numpy 'pyarrow==12.0.1' 'pandas<=2.0.3' scipy 
unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 
'memory-profiler==0.60.0' 'scikit-learn==1.1.*'
 
 # Add Python deps for Spark Connect.
 RUN python3.9 -m pip install grpcio protobuf googleapis-common-protos 
grpcio-status


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45145][EXAMPLE] Add JavaSparkSQLCli example

2023-09-13 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 66f42b5b292 [SPARK-45145][EXAMPLE] Add JavaSparkSQLCli example
66f42b5b292 is described below

commit 66f42b5b292cb4afff14b42f0d31835067ebdf5e
Author: Dongjoon Hyun 
AuthorDate: Wed Sep 13 00:20:18 2023 -0700

[SPARK-45145][EXAMPLE] Add JavaSparkSQLCli example

### What changes were proposed in this pull request?

This PR aims to add a simple Java example, `JavaSparkSQLCli.java`. Like 
SparkPi example, we can take advantage of this minimal example with 
`bin/run-example` and we can run a simple SQL query as a canary test during RC 
testing and voting phase.

- https://spark.apache.org/docs/3.5.0/

```
./bin/run-example SparkPi 10
```

 After this PR,
```
$ bin/run-example sql.JavaSparkSQLCli "SELECT 'Spark SQL' col"
+-+
|col  |
+-+
|Spark SQL|
+-+
```

### Why are the changes needed?

Although there exists 
`org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver` and `bin/spark-sql` 
shell environment, it requires `-Phive -Phive-thriftserver` explicitly.
```
$ bin/spark-shell --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.0
  /_/
...

$ bin/run-example org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
...
Error: Failed to load class 
org.apache.spark.examples.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
Failed to load main class 
org.apache.spark.examples.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
You need to build Spark with -Phive and -Phive-thriftserver.
23/09/12 23:06:38 INFO ShutdownHookManager: Shutdown hook called
23/09/12 23:06:38 INFO ShutdownHookManager: Deleting directory 
/private/var/folders/d4/dr6zxyvd4cl38877bj3fxs_mgn/T/spark-2a9ee4a3-64c3-492c-a268-a71b9aa44a2a
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually.
```
$ build/sbt test:package
$ bin/run-example sql.JavaSparkSQLCli "SELECT 'Spark SQL' col"
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42900 from dongjoon-hyun/SPARK-45145.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/examples/sql/JavaSparkSQLCli.java | 41 ++
 1 file changed, 41 insertions(+)

diff --git 
a/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLCli.java 
b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLCli.java
new file mode 100644
index 000..d5b033c46bf
--- /dev/null
+++ b/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLCli.java
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql;
+
+import org.apache.spark.sql.SparkSession;
+
+/**
+ * Example Usage:
+ * 
+ * bin/run-example sql.JavaSparkSQLCli "SELECT 'Spark SQL' col"
+ * 
+ */
+public class JavaSparkSQLCli {
+
+  public static void main(String[] args) {
+SparkSession spark = SparkSession
+  .builder()
+  .appName("Java Spark SQL Cli")
+  .getOrCreate();
+
+for (String a: args) {
+  spark.sql(a).show(false);
+}
+
+spark.stop();
+  }
+}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org