[spark] branch master updated: [SPARK-41658][SPARK-41656][DOCS][FOLLOW-UP] Update JIRAs in skipped tests' comments

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 02f12eeed0c [SPARK-41658][SPARK-41656][DOCS][FOLLOW-UP] Update JIRAs 
in skipped tests' comments
02f12eeed0c is described below

commit 02f12eeed0ce27757edc83e99e05152113ea7f3c
Author: Hyukjin Kwon 
AuthorDate: Tue Jan 3 14:39:45 2023 +0900

[SPARK-41658][SPARK-41656][DOCS][FOLLOW-UP] Update JIRAs in skipped tests' 
comments

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/39347 and 
https://github.com/apache/spark/pull/39347, which updates the invalid JIRAs 
linked in the comments.

### Why are the changes needed?

To track the issues properly, and reenable skipped tests in the future.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Comment-only. Linter in CI should verify them. I also manually checked it 
in my local.

Closes #39354 from HyukjinKwon/SPARK-41658-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py | 4 ++--
 python/pyspark/sql/connect/functions.py | 5 ++---
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 0a69b6317f8..57c9e801c22 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1394,7 +1394,7 @@ def _test() -> None:
 sc, options={"spark.app.name": "sql.connect.dataframe tests"}
 )
 
-# TODO(SPARK-41819): Implement RDD.getNumPartitions
+# Spark Connect does not support RDD but the tests depend on them.
 del pyspark.sql.connect.dataframe.DataFrame.coalesce.__doc__
 del pyspark.sql.connect.dataframe.DataFrame.repartition.__doc__
 
@@ -1420,7 +1420,7 @@ def _test() -> None:
 del pyspark.sql.connect.dataframe.DataFrame.replace.__doc__
 del pyspark.sql.connect.dataframe.DataFrame.intersect.__doc__
 
-# TODO(SPARK-41826): Implement Dataframe.readStream
+# TODO(SPARK-41625): Support Structured Streaming
 del pyspark.sql.connect.dataframe.DataFrame.isStreaming.__doc__
 
 # TODO(SPARK-41827): groupBy requires all cols be Column or str
diff --git a/python/pyspark/sql/connect/functions.py 
b/python/pyspark/sql/connect/functions.py
index f2d3aa64728..6e688271a3f 100644
--- a/python/pyspark/sql/connect/functions.py
+++ b/python/pyspark/sql/connect/functions.py
@@ -2344,6 +2344,8 @@ def _test() -> None:
 globs["_spark"] = PySparkSession(
 sc, options={"spark.app.name": "sql.connect.functions tests"}
 )
+# Spark Connect does not support Spark Context but the test depends on 
that.
+del pyspark.sql.connect.functions.monotonically_increasing_id.__doc__
 
 # TODO(SPARK-41833): fix collect() output
 del pyspark.sql.connect.functions.array.__doc__
@@ -2406,9 +2408,6 @@ def _test() -> None:
 # TODO(SPARK-41836): Implement `transform_values` function
 del pyspark.sql.connect.functions.transform_values.__doc__
 
-# TODO(SPARK-41839): Implement SparkSession.sparkContext
-del pyspark.sql.connect.functions.monotonically_increasing_id.__doc__
-
 # TODO(SPARK-41840): Fix 'Column' object is not callable
 del pyspark.sql.connect.functions.first.__doc__
 del pyspark.sql.connect.functions.last.__doc__


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (b7a0fc4b7bd -> 4d6856e913c)

2023-01-02 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from b7a0fc4b7bd [SPARK-41658][CONNECT][TESTS] Enable doctests in 
pyspark.sql.connect.functions
 add 4d6856e913c [SPARK-41311][SQL][TESTS] Rewrite test 
RENAME_SRC_PATH_NOT_FOUND to trigger the error from user space

No new revisions were added by this update.

Summary of changes:
 .../sql/errors/QueryExecutionErrorsSuite.scala | 54 +-
 1 file changed, 31 insertions(+), 23 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (973b8ffc828 -> b7a0fc4b7bd)

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 973b8ffc828 [SPARK-41807][CORE] Remove non-existent error class: 
UNSUPPORTED_FEATURE.DISTRIBUTE_BY
 add b7a0fc4b7bd [SPARK-41658][CONNECT][TESTS] Enable doctests in 
pyspark.sql.connect.functions

No new revisions were added by this update.

Summary of changes:
 dev/sparktestsupport/modules.py |   1 +
 python/pyspark/sql/connect/functions.py | 156 
 2 files changed, 157 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (2cf11cdb04f -> 973b8ffc828)

2023-01-02 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2cf11cdb04f [SPARK-41854][PYTHON][BUILD] Automatic reformat/check 
python/setup.py
 add 973b8ffc828 [SPARK-41807][CORE] Remove non-existent error class: 
UNSUPPORTED_FEATURE.DISTRIBUTE_BY

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json | 5 -
 1 file changed, 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41854][PYTHON][BUILD] Automatic reformat/check python/setup.py

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2cf11cdb04f [SPARK-41854][PYTHON][BUILD] Automatic reformat/check 
python/setup.py
2cf11cdb04f is described below

commit 2cf11cdb04f4c8628a991e50470331c3a8682bcd
Author: Hyukjin Kwon 
AuthorDate: Tue Jan 3 13:05:29 2023 +0900

[SPARK-41854][PYTHON][BUILD] Automatic reformat/check python/setup.py

### What changes were proposed in this pull request?

This PR proposes to automatically reformat `python/setup.py` too.

### Why are the changes needed?

To make the development cycle easier.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

I manually checked via:

```bash
./dev/reformat-python
./dev/lint-python
```

Closes #39352 from HyukjinKwon/SPARK-41854.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 dev/lint-python |   2 +-
 dev/reformat-python |   2 +-
 python/setup.py | 240 
 3 files changed, 133 insertions(+), 111 deletions(-)

diff --git a/dev/lint-python b/dev/lint-python
index 59ce71980d9..f1f4e9f1070 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -220,7 +220,7 @@ function black_test {
 fi
 
 echo "starting black test..."
-BLACK_REPORT=$( ($BLACK_BUILD  --config dev/pyproject.toml --check 
python/pyspark dev) 2>&1)
+BLACK_REPORT=$( ($BLACK_BUILD  --config dev/pyproject.toml --check 
python/pyspark dev python/setup.py) 2>&1)
 BLACK_STATUS=$?
 
 if [ "$BLACK_STATUS" -ne 0 ]; then
diff --git a/dev/reformat-python b/dev/reformat-python
index ae2118ab631..9543f5713d1 100755
--- a/dev/reformat-python
+++ b/dev/reformat-python
@@ -29,4 +29,4 @@ if [ $? -ne 0 ]; then
 exit 1
 fi
 
-$BLACK_BUILD --config dev/pyproject.toml python/pyspark dev
+$BLACK_BUILD --config dev/pyproject.toml python/pyspark dev python/setup.py
diff --git a/python/setup.py b/python/setup.py
index 54115359a60..faba203a53a 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -25,19 +25,23 @@ from setuptools.command.install import install
 from shutil import copyfile, copytree, rmtree
 
 try:
-exec(open('pyspark/version.py').read())
+exec(open("pyspark/version.py").read())
 except IOError:
-print("Failed to load PySpark version file for packaging. You must be in 
Spark's python dir.",
-  file=sys.stderr)
+print(
+"Failed to load PySpark version file for packaging. You must be in 
Spark's python dir.",
+file=sys.stderr,
+)
 sys.exit(-1)
 try:
 spec = importlib.util.spec_from_file_location("install", 
"pyspark/install.py")
 install_module = importlib.util.module_from_spec(spec)
 spec.loader.exec_module(install_module)
 except IOError:
-print("Failed to load the installing module (pyspark/install.py) which had 
to be "
-  "packaged together.",
-  file=sys.stderr)
+print(
+"Failed to load the installing module (pyspark/install.py) which had 
to be "
+"packaged together.",
+file=sys.stderr,
+)
 sys.exit(-1)
 VERSION = __version__  # noqa
 # A temporary path so we can access above the Python project root and fetch 
scripts and jars we need
@@ -61,12 +65,16 @@ JARS_PATH = glob.glob(os.path.join(SPARK_HOME, 
"assembly/target/scala-*/jars/"))
 
 if len(JARS_PATH) == 1:
 JARS_PATH = JARS_PATH[0]
-elif (os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1):
+elif os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1:
 # Release mode puts the jars in a jars directory
 JARS_PATH = os.path.join(SPARK_HOME, "jars")
 elif len(JARS_PATH) > 1:
-print("Assembly jars exist for multiple scalas ({0}), please cleanup 
assembly/target".format(
-JARS_PATH), file=sys.stderr)
+print(
+"Assembly jars exist for multiple scalas ({0}), please cleanup 
assembly/target".format(
+JARS_PATH
+),
+file=sys.stderr,
+)
 sys.exit(-1)
 elif len(JARS_PATH) == 0 and not os.path.exists(TEMP_PATH):
 print(incorrect_invocation_message, file=sys.stderr)
@@ -89,8 +97,9 @@ LICENSES_TARGET = os.path.join(TEMP_PATH, "licenses")
 # This is important because we only want to build the symlink farm while under 
Spark otherwise we
 # want to use the symlink farm. And if the symlink farm exists under while 
under Spark (e.g. a
 # partially built sdist) we should error and have the user sort it out.
-in_spark = 
(os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or
-(os.path.isfile("../RELEASE") and 
len(glob.glob("../jars/spark*core*.jar")) == 1))
+in_spark = 

[spark] branch master updated: [SPARK-41656][CONNECT][TESTS] Enable doctests in pyspark.sql.connect.dataframe

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 840620fbda1 [SPARK-41656][CONNECT][TESTS] Enable doctests in 
pyspark.sql.connect.dataframe
840620fbda1 is described below

commit 840620fbda1fa7b01c8bea8d327a8b5d96f9f9ad
Author: Sandeep Singh 
AuthorDate: Tue Jan 3 11:10:11 2023 +0900

[SPARK-41656][CONNECT][TESTS] Enable doctests in 
pyspark.sql.connect.dataframe

### What changes were proposed in this pull request?
This PR proposes to enable doctests in pyspark.sql.connect.dataframe that 
is virtually the same as pyspark.sql.dataframe.

### Why are the changes needed?
To make sure on the PySpark compatibility and test coverage.

### Does this PR introduce any user-facing change?
No, doctest's only.

### How was this patch tested?
New Doctests Added

Closes #39346 from techaddict/SPARK-41656-pyspark.sql.connect.dataframe.

Authored-by: Sandeep Singh 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py |  1 +
 python/pyspark/sql/connect/dataframe.py | 99 -
 python/pyspark/sql/dataframe.py | 10 ++--
 3 files changed, 104 insertions(+), 6 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 0eeb3dd9218..2c399174c13 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -510,6 +510,7 @@ pyspark_connect = Module(
 "pyspark.sql.connect.window",
 "pyspark.sql.connect.column",
 "pyspark.sql.connect.readwriter",
+"pyspark.sql.connect.dataframe",
 # unittests
 "pyspark.sql.tests.connect.test_connect_plan",
 "pyspark.sql.tests.connect.test_connect_basic",
diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 95582e86390..0a69b6317f8 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -35,7 +35,7 @@ import pandas
 import warnings
 from collections.abc import Iterable
 
-from pyspark import _NoValue
+from pyspark import _NoValue, SparkContext, SparkConf
 from pyspark._globals import _NoValueType
 from pyspark.sql.types import DataType, StructType, Row
 
@@ -1373,3 +1373,100 @@ class DataFrameStatFunctions:
 
 
 DataFrameStatFunctions.__doc__ = PySparkDataFrameStatFunctions.__doc__
+
+
+def _test() -> None:
+import os
+import sys
+import doctest
+from pyspark.sql import SparkSession as PySparkSession
+from pyspark.testing.connectutils import should_test_connect, 
connect_requirement_message
+
+os.chdir(os.environ["SPARK_HOME"])
+
+if should_test_connect:
+import pyspark.sql.connect.dataframe
+
+globs = pyspark.sql.connect.dataframe.__dict__.copy()
+# Works around to create a regular Spark session
+sc = SparkContext("local[4]", "sql.connect.dataframe tests", 
conf=SparkConf())
+globs["_spark"] = PySparkSession(
+sc, options={"spark.app.name": "sql.connect.dataframe tests"}
+)
+
+# TODO(SPARK-41819): Implement RDD.getNumPartitions
+del pyspark.sql.connect.dataframe.DataFrame.coalesce.__doc__
+del pyspark.sql.connect.dataframe.DataFrame.repartition.__doc__
+
+# TODO(SPARK-41820): Fix SparkConnectException: requirement failed
+del 
pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView.__doc__
+del 
pyspark.sql.connect.dataframe.DataFrame.createOrReplaceTempView.__doc__
+
+# TODO(SPARK-41821): Fix DataFrame.describe
+del pyspark.sql.connect.dataframe.DataFrame.describe.__doc__
+
+# TODO(SPARK-41823): ambiguous column names
+del pyspark.sql.connect.dataframe.DataFrame.drop.__doc__
+del pyspark.sql.connect.dataframe.DataFrame.join.__doc__
+
+# TODO(SPARK-41824): DataFrame.explain format is different
+del pyspark.sql.connect.dataframe.DataFrame.explain.__doc__
+del pyspark.sql.connect.dataframe.DataFrame.hint.__doc__
+
+# TODO(SPARK-41825): Dataframe.show formatting int as double
+del pyspark.sql.connect.dataframe.DataFrame.fillna.__doc__
+del pyspark.sql.connect.dataframe.DataFrameNaFunctions.replace.__doc__
+del pyspark.sql.connect.dataframe.DataFrameNaFunctions.fill.__doc__
+del pyspark.sql.connect.dataframe.DataFrame.replace.__doc__
+del pyspark.sql.connect.dataframe.DataFrame.intersect.__doc__
+
+# TODO(SPARK-41826): Implement Dataframe.readStream
+del pyspark.sql.connect.dataframe.DataFrame.isStreaming.__doc__
+
+# TODO(SPARK-41827): groupBy requires all cols be Column or str
+del pyspark.sql.connect.dataframe.DataFrame.groupBy.__doc__
+
+# TODO(SPARK-41828): 

[spark] branch master updated: [SPARK-41804][SQL] Choose correct element size in `InterpretedUnsafeProjection` for array of UDTs

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cb869328ea7 [SPARK-41804][SQL] Choose correct element size in 
`InterpretedUnsafeProjection` for array of UDTs
cb869328ea7 is described below

commit cb869328ea7fcf95a4178a0db19a6fa821ce3f15
Author: Bruce Robbins 
AuthorDate: Tue Jan 3 10:22:35 2023 +0900

[SPARK-41804][SQL] Choose correct element size in 
`InterpretedUnsafeProjection` for array of UDTs

### What changes were proposed in this pull request?

Change `InterpretedUnsafeProjection#getElementSize` to choose the 
appropriate element size for the underlying SQL type of a UDT, rather than 
simply using the the default size of the underlying SQL type.

### Why are the changes needed?

Consider this query:
```
// create a file of vector data
import org.apache.spark.ml.linalg.{DenseVector, Vector}

case class TestRow(varr: Array[Vector])
val values = Array(0.1d, 0.2d, 0.3d)
val dv = new DenseVector(values).asInstanceOf[Vector]

val ds = Seq(TestRow(Array(dv, dv))).toDS
ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data")

// this works
spark.read.format("parquet").load("vector_data").collect

sql("set spark.sql.codegen.wholeStage=false")
sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")

// this will get an error
spark.read.format("parquet").load("vector_data").collect
```
The failures vary, incuding
* `VectorUDT` attempting to deserialize to a `SparseVector` (rather than a 
`DenseVector`)
* negative array size (for one of the nested arrays)
* JVM crash (SIGBUS error).

This is because `InterpretedUnsafeProjection` initializes the outer-most 
array writer with an element size of 17 (the size of the UDT's underlying 
struct), rather than an element size of 8, which would be appropriate for an 
array of structs.

When the outer-most array is later accessed, `UnsafeArrayData` assumes an 
element size of 8, so it picks up a garbage offset/size tuple for the second 
element.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit test.

Closes #39349 from bersprockets/udt_issue.

Authored-by: Bruce Robbins 
Signed-off-by: Hyukjin Kwon 
---
 .../expressions/InterpretedUnsafeProjection.scala|  2 ++
 .../catalyst/expressions/UnsafeRowConverterSuite.scala   | 16 
 2 files changed, 18 insertions(+)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala
index d87c0c006cf..9108a045c09 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala
@@ -294,6 +294,8 @@ object InterpretedUnsafeProjection {
   private def getElementSize(dataType: DataType): Int = dataType match {
 case NullType | StringType | BinaryType | CalendarIntervalType |
  _: DecimalType | _: StructType | _: ArrayType | _: MapType => 8
+case udt: UserDefinedType[_] =>
+  getElementSize(udt.sqlType)
 case _ => dataType.defaultSize
   }
 
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala
index 83dc8127828..cbab8894cb5 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala
@@ -687,4 +687,20 @@ class UnsafeRowConverterSuite extends SparkFunSuite with 
Matchers with PlanTestB
 val fields5 = Array[DataType](udt)
 assert(convertBackToInternalRow(udtRow, fields5) === udtRow)
   }
+
+  testBothCodegenAndInterpreted("SPARK-41804: Array of UDTs") {
+val udt = new ExampleBaseTypeUDT
+val objs = Seq(
+  udt.serialize(new ExampleSubClass(1)),
+  udt.serialize(new ExampleSubClass(2)))
+val arr = new GenericArrayData(objs)
+val row = new GenericInternalRow(Array[Any](arr))
+val unsafeProj = UnsafeProjection.create(Array[DataType](ArrayType(udt)))
+val unsafeRow = unsafeProj.apply(row)
+val unsafeOuterArray = unsafeRow.getArray(0)
+// get second element from unsafe array
+val unsafeStruct = unsafeOuterArray.getStruct(1, 1)
+val result = unsafeStruct.getInt(0)
+assert(result == 2)
+  }
 }



[spark] branch master updated: [MINOR] Fix typos

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a09e9dc1531 [MINOR] Fix typos
a09e9dc1531 is described below

commit a09e9dc1531bdef905d4609945c7747622928905
Author: smallzhongfeng 
AuthorDate: Tue Jan 3 09:57:51 2023 +0900

[MINOR] Fix typos

### What changes were proposed in this pull request?

Fix typo in ReceiverSupervisorImpl.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No need.

Closes #39340 from smallzhongfeng/fix-typos.

Authored-by: smallzhongfeng 
Signed-off-by: Hyukjin Kwon 
---
 core/src/main/scala/org/apache/spark/SparkContext.scala   | 4 ++--
 .../main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala | 4 ++--
 mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala| 2 +-
 .../src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala   | 2 +-
 .../spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala   | 2 +-
 .../spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala| 2 +-
 .../org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala  | 2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 5cbf2e83371..62e652ff9bb 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -3173,7 +3173,7 @@ object WritableConverter {
 
   implicit val bytesWritableConverterFn: () => WritableConverter[Array[Byte]] 
= {
 () => simpleWritableConverter[Array[Byte], BytesWritable] { bw =>
-  // getBytes method returns array which is longer then data to be returned
+  // getBytes method returns array which is longer than data to be returned
   Arrays.copyOfRange(bw.getBytes, 0, bw.getLength)
 }
   }
@@ -3204,7 +3204,7 @@ object WritableConverter {
 
   implicit def bytesWritableConverter(): WritableConverter[Array[Byte]] = {
 simpleWritableConverter[Array[Byte], BytesWritable] { bw =>
-  // getBytes method returns array which is longer then data to be returned
+  // getBytes method returns array which is longer than data to be returned
   Arrays.copyOfRange(bw.getBytes, 0, bw.getLength)
 }
   }
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala
index 0106c872297..b8563bed601 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala
@@ -397,7 +397,7 @@ private[evaluation] object SquaredEuclideanSilhouette 
extends Silhouette {
 val clustersStatsMap = SquaredEuclideanSilhouette
   .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol, 
weightCol)
 
-// Silhouette is reasonable only when the number of clusters is greater 
then 1
+// Silhouette is reasonable only when the number of clusters is greater 
than 1
 assert(clustersStatsMap.size > 1, "Number of clusters must be greater than 
one.")
 
 val bClustersStatsMap = 
dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
@@ -604,7 +604,7 @@ private[evaluation] object CosineSilhouette extends 
Silhouette {
 val clustersStatsMap = computeClusterStats(dfWithNormalizedFeatures, 
featuresCol,
   predictionCol, weightCol)
 
-// Silhouette is reasonable only when the number of clusters is greater 
then 1
+// Silhouette is reasonable only when the number of clusters is greater 
than 1
 assert(clustersStatsMap.size > 1, "Number of clusters must be greater than 
one.")
 
 val bClustersStatsMap = 
dataset.sparkSession.sparkContext.broadcast(clustersStatsMap)
diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
index bf9d07338db..8a124ae4f4c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
@@ -105,7 +105,7 @@ object Summarizer extends Logging {
* @return a builder.
* @throws IllegalArgumentException if one of the metric names is not 
understood.
*
-   * Note: Currently, the performance of this interface is about 2x~3x slower 
then using the RDD
+   * Note: Currently, the performance of this interface is about 2x~3x slower 
than using the RDD
* interface.
*/
   @Since("2.3.0")
diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 

[spark] branch master updated: [SPARK-41803][CONNECT][PYTHON] Add missing function `log(arg1, arg2)`

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c0f07d3c193 [SPARK-41803][CONNECT][PYTHON] Add missing function 
`log(arg1, arg2)`
c0f07d3c193 is described below

commit c0f07d3c193f6df92a736634be65bf42c3c4154c
Author: Ruifeng Zheng 
AuthorDate: Tue Jan 3 09:56:49 2023 +0900

[SPARK-41803][CONNECT][PYTHON] Add missing function `log(arg1, arg2)`

### What changes were proposed in this pull request?
Add missing function `log(arg1, arg2)`

### Why are the changes needed?
for API coverage

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
added ut

Closes #39339 from zhengruifeng/connect_function_log.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/functions.py   | 10 --
 python/pyspark/sql/tests/connect/test_connect_function.py |  6 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/sql/connect/functions.py 
b/python/pyspark/sql/connect/functions.py
index 436eb9f7c73..23196e888e3 100644
--- a/python/pyspark/sql/connect/functions.py
+++ b/python/pyspark/sql/connect/functions.py
@@ -28,6 +28,7 @@ from typing import (
 Tuple,
 Callable,
 ValuesView,
+cast,
 )
 
 from pyspark.sql.connect.column import Column
@@ -548,8 +549,13 @@ def hypot(col1: Union["ColumnOrName", float], col2: 
Union["ColumnOrName", float]
 hypot.__doc__ = pysparkfuncs.hypot.__doc__
 
 
-def log(col: "ColumnOrName") -> Column:
-return _invoke_function_over_columns("ln", col)
+def log(arg1: Union["ColumnOrName", float], arg2: Optional["ColumnOrName"] = 
None) -> Column:
+if arg2 is None:
+# in this case, arg1 should be "ColumnOrName"
+return _invoke_function("ln", _to_col(cast("ColumnOrName", arg1)))
+else:
+# in this case, arg1 should be a float
+return _invoke_function("log", lit(cast(float, arg1)), _to_col(arg2))
 
 
 log.__doc__ = pysparkfuncs.log.__doc__
diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py 
b/python/pyspark/sql/tests/connect/test_connect_function.py
index 6fea8da14c7..0dda3e99fcf 100644
--- a/python/pyspark/sql/tests/connect/test_connect_function.py
+++ b/python/pyspark/sql/tests/connect/test_connect_function.py
@@ -448,6 +448,12 @@ class SparkConnectFunctionTests(SparkConnectFuncTestCase):
 sdf.select(sfunc("b"), sfunc(sdf.c)).toPandas(),
 )
 
+# test log(arg1, arg2)
+self.assert_eq(
+cdf.select(CF.log(1.1, "b"), CF.log(1.2, cdf.c)).toPandas(),
+sdf.select(SF.log(1.1, "b"), SF.log(1.2, sdf.c)).toPandas(),
+)
+
 self.assert_eq(
 cdf.select(CF.atan2("b", cdf.c)).toPandas(),
 sdf.select(SF.atan2("b", sdf.c)).toPandas(),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41659][CONNECT] Enable doctests in pyspark.sql.connect.readwriter

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b804b53e4a5 [SPARK-41659][CONNECT] Enable doctests in 
pyspark.sql.connect.readwriter
b804b53e4a5 is described below

commit b804b53e4a5b971bf2db676f4d552fff383ab586
Author: Sandeep Singh 
AuthorDate: Tue Jan 3 09:41:57 2023 +0900

[SPARK-41659][CONNECT] Enable doctests in pyspark.sql.connect.readwriter

### What changes were proposed in this pull request?
This PR proposes to enable doctests in pyspark.sql.connect.readwriter that 
is virtually the same as pyspark.sql.readwriter.

### Why are the changes needed?
To make sure on the PySpark compatibility and test coverage.

### Does this PR introduce any user-facing change?
No, doctest's only.

### How was this patch tested?
New Doctests Added

Closes #39331 from techaddict/SPARK-41659-pyspark.sql.connect.readwriter.

Authored-by: Sandeep Singh 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py  |  1 +
 python/pyspark/sql/connect/plan.py   |  2 +-
 python/pyspark/sql/connect/readwriter.py | 60 
 python/pyspark/sql/readwriter.py | 18 +-
 4 files changed, 71 insertions(+), 10 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 99f1cc6894f..0eeb3dd9218 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -509,6 +509,7 @@ pyspark_connect = Module(
 "pyspark.sql.connect.session",
 "pyspark.sql.connect.window",
 "pyspark.sql.connect.column",
+"pyspark.sql.connect.readwriter",
 # unittests
 "pyspark.sql.tests.connect.test_connect_plan",
 "pyspark.sql.tests.connect.test_connect_basic",
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index f10687cc82e..48a8fa598e7 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -1264,7 +1264,7 @@ class WriteOperation(LogicalPlan):
 
 for k in self.options:
 if self.options[k] is None:
-del plan.write_operation.options[k]
+plan.write_operation.options.pop(k, None)
 else:
 plan.write_operation.options[k] = cast(str, self.options[k])
 
diff --git a/python/pyspark/sql/connect/readwriter.py 
b/python/pyspark/sql/connect/readwriter.py
index c3af39f864c..207ad8df74f 100644
--- a/python/pyspark/sql/connect/readwriter.py
+++ b/python/pyspark/sql/connect/readwriter.py
@@ -20,6 +20,7 @@ from typing import Dict
 from typing import Optional, Union, List, overload, Tuple, cast, Any
 from typing import TYPE_CHECKING
 
+from pyspark import SparkContext, SparkConf
 from pyspark.sql.connect.plan import Read, DataSource, LogicalPlan, 
WriteOperation
 from pyspark.sql.types import StructType
 from pyspark.sql.utils import to_str
@@ -458,3 +459,62 @@ class DataFrameWriter(OptionUtils):
 
 def jdbc(self, *args: Any, **kwargs: Any) -> None:
 raise NotImplementedError("jdbc() not supported for DataFrameWriter")
+
+
+def _test() -> None:
+import os
+import sys
+import doctest
+from pyspark.sql import SparkSession as PySparkSession
+from pyspark.testing.connectutils import should_test_connect, 
connect_requirement_message
+
+os.chdir(os.environ["SPARK_HOME"])
+
+if should_test_connect:
+import pyspark.sql.connect.readwriter
+
+globs = pyspark.sql.connect.readwriter.__dict__.copy()
+# Works around to create a regular Spark session
+sc = SparkContext("local[4]", "sql.connect.readwriter tests", 
conf=SparkConf())
+globs["_spark"] = PySparkSession(
+sc, options={"spark.app.name": "sql.connect.readwriter tests"}
+)
+
+# TODO(SPARK-41817): Support reading with schema
+del pyspark.sql.connect.readwriter.DataFrameReader.load.__doc__
+del pyspark.sql.connect.readwriter.DataFrameReader.option.__doc__
+del pyspark.sql.connect.readwriter.DataFrameWriter.csv.__doc__
+del pyspark.sql.connect.readwriter.DataFrameWriter.option.__doc__
+del pyspark.sql.connect.readwriter.DataFrameWriter.text.__doc__
+del pyspark.sql.connect.readwriter.DataFrameWriter.bucketBy.__doc__
+del pyspark.sql.connect.readwriter.DataFrameWriter.sortBy.__doc__
+
+# TODO(SPARK-41818): Support saveAsTable
+del pyspark.sql.connect.readwriter.DataFrameWriter.insertInto.__doc__
+del pyspark.sql.connect.readwriter.DataFrameWriter.saveAsTable.__doc__
+
+# Creates a remote Spark session.
+os.environ["SPARK_REMOTE"] = "sc://localhost"
+globs["spark"] = 
PySparkSession.builder.remote("sc://localhost").getOrCreate()
+
+ 

[spark] branch master updated: [SPARK-41810][CONNECT] Infer names from a list of dictionaries in SparkSession.createDataFrame

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 68c3354267d [SPARK-41810][CONNECT] Infer names from a list of 
dictionaries in SparkSession.createDataFrame
68c3354267d is described below

commit 68c3354267d30a96765a6592243205957d2cddf1
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 2 21:24:45 2023 +0900

[SPARK-41810][CONNECT] Infer names from a list of dictionaries in 
SparkSession.createDataFrame

### What changes were proposed in this pull request?

This PR proposes to support to infer field names when the input data is the 
list of dictionaries in `SparkSession.createDataFrame`.

For example,

```python
spark.createDataFrame([{"course": "dotNET", "earnings": 1, "year": 
2012}]).show()
```

**Before**:

```
+--+-++
|_1|   _2|  _3|
+--+-++
|dotNET|1|2012|
+--+-++
```

**After**:

```
+--+++
|course|earnings|year|
+--+++
|dotNET|   1|2012|
+--+++
```

### Why are the changes needed?

To match the behaviour with the regular PySpark.

### Does this PR introduce _any_ user-facing change?

No to end users because Spark Connect has not been released.

### How was this patch tested?

Unittest was added.

Closes #39344 from HyukjinKwon/SPARK-41746.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/session.py  | 16 +--
 .../sql/tests/connect/test_connect_basic.py| 24 --
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index a461372c08c..0233bde1c17 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -218,15 +218,21 @@ class SparkSession:
 
 else:
 _data = list(data)
-pdf = pd.DataFrame(_data)
 
-if _schema is None and isinstance(_data[0], Row):
+if _schema is None and (isinstance(_data[0], Row) or 
isinstance(_data[0], dict)):
+if isinstance(_data[0], dict):
+# Sort the data to respect inferred schema.
+# For dictionaries, we sort the schema in alphabetical 
order.
+_data = [dict(sorted(d.items())) for d in _data]
+
 _schema = self._inferSchemaFromList(_data, _cols)
 if _cols is not None:
 for i, name in enumerate(_cols):
 _schema.fields[i].name = name
 _schema.names[i] = name
 
+pdf = pd.DataFrame(_data)
+
 if _cols is None:
 _cols = ["_%s" % i for i in range(1, pdf.shape[1] + 1)]
 
@@ -342,11 +348,9 @@ def _test() -> None:
 # Spark Connect does not support to set master together.
 pyspark.sql.connect.session.SparkSession.__doc__ = None
 del pyspark.sql.connect.session.SparkSession.Builder.master.__doc__
-
-# TODO(SPARK-41746): SparkSession.createDataFrame does not respect the 
column names in
-#  dictionary
+# RDD API is not supported in Spark Connect.
 del pyspark.sql.connect.session.SparkSession.createDataFrame.__doc__
-del pyspark.sql.connect.session.SparkSession.read.__doc__
+
 # TODO(SPARK-41811): Implement SparkSession.sql's string formatter
 del pyspark.sql.connect.session.SparkSession.sql.__doc__
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 6a65e412dfd..7c17c5f6820 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -389,8 +389,8 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 self.connect.createDataFrame(data, "col1 int, col2 int, col3 
int").show()
 
 def test_with_local_rows(self):
-# SPARK-41789: Test creating a dataframe with list of Rows
-data = [
+# SPARK-41789, SPARK-41810: Test creating a dataframe with list of 
rows and dictionaries
+rows = [
 Row(course="dotNET", year=2012, earnings=1),
 Row(course="Java", year=2012, earnings=2),
 Row(course="dotNET", year=2012, earnings=5000),
@@ -398,19 +398,21 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 Row(course="Java", year=2013, earnings=3),
 Row(course="Scala", year=2022, earnings=None),
 ]
+dicts = [row.asDict() for row in rows]
 
-sdf = 

[spark] branch master updated: [SPARK-41745][CONNECT][TESTS][FOLLOW-UP] Reeanble related test cases

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a8e395b79fa [SPARK-41745][CONNECT][TESTS][FOLLOW-UP] Reeanble related 
test cases
a8e395b79fa is described below

commit a8e395b79fa2f16654da50c31644c4487d5ee804
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 2 19:59:46 2023 +0900

[SPARK-41745][CONNECT][TESTS][FOLLOW-UP] Reeanble related test cases

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/39313 that 
enables the skipped tests back.

### Why are the changes needed?

In order to make sure on the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually checked locally, and CI in this PR should verify them.

Closes #39342 from HyukjinKwon/SPARK-41745-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/column.py |  8 +---
 python/pyspark/sql/connect/column.py | 16 
 2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py
index cd7b6932c2f..f2264685f48 100644
--- a/python/pyspark/sql/column.py
+++ b/python/pyspark/sql/column.py
@@ -282,6 +282,8 @@ class Column:
 __ge__ = _bin_op("geq")
 __gt__ = _bin_op("gt")
 
+# TODO(SPARK-41812): DataFrame.join: ambiguous column
+# TODO(SPARK-41814): Column.eqNullSafe fails on NaN comparison
 _eqNullSafe_doc = """
 Equality test that is safe for null values.
 
@@ -317,9 +319,9 @@ class Column:
 ... Row(value = 'bar'),
 ... Row(value = None)
 ... ])
->>> df1.join(df2, df1["value"] == df2["value"]).count()
+>>> df1.join(df2, df1["value"] == df2["value"]).count()  # doctest: +SKIP
 0
->>> df1.join(df2, df1["value"].eqNullSafe(df2["value"])).count()
+>>> df1.join(df2, df1["value"].eqNullSafe(df2["value"])).count()  # 
doctest: +SKIP
 1
 >>> df2 = spark.createDataFrame([
 ... Row(id=1, value=float('NaN')),
@@ -330,7 +332,7 @@ class Column:
 ... df2['value'].eqNullSafe(None),
 ... df2['value'].eqNullSafe(float('NaN')),
 ... df2['value'].eqNullSafe(42.0)
-... ).show()
+... ).show()  # doctest: +SKIP
 ++---++
 |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)|
 ++---++
diff --git a/python/pyspark/sql/connect/column.py 
b/python/pyspark/sql/connect/column.py
index d9f96325c17..6fda15e084a 100644
--- a/python/pyspark/sql/connect/column.py
+++ b/python/pyspark/sql/connect/column.py
@@ -441,25 +441,17 @@ def _test() -> None:
 # Creates a remote Spark session.
 os.environ["SPARK_REMOTE"] = "sc://localhost"
 globs["spark"] = 
PySparkSession.builder.remote("sc://localhost").getOrCreate()
+# Spark Connect has a different string representation for Column.
+del pyspark.sql.connect.column.Column.getItem.__doc__
 
 # TODO(SPARK-41746): SparkSession.createDataFrame does not support 
nested datatypes
 del pyspark.sql.connect.column.Column.dropFields.__doc__
 # TODO(SPARK-41772): Enable 
pyspark.sql.connect.column.Column.withField doctest
 del pyspark.sql.connect.column.Column.withField.__doc__
-# TODO(SPARK-41745): SparkSession.createDataFrame does not respect the 
column names in
-#  the row
-del pyspark.sql.connect.column.Column.bitwiseAND.__doc__
-del pyspark.sql.connect.column.Column.bitwiseOR.__doc__
-del pyspark.sql.connect.column.Column.bitwiseXOR.__doc__
-# TODO(SPARK-41745): SparkSession.createDataFrame does not respect the 
column names in
-#  the row
-del pyspark.sql.connect.column.Column.eqNullSafe.__doc__
-# TODO(SPARK-41745): SparkSession.createDataFrame does not respect the 
column names in
-#  the row
-del pyspark.sql.connect.column.Column.isNotNull.__doc__
+# TODO(SPARK-41815): Column.isNull returns nan instead of None
 del pyspark.sql.connect.column.Column.isNull.__doc__
+# TODO(SPARK-41746): SparkSession.createDataFrame does not support 
nested datatypes
 del pyspark.sql.connect.column.Column.getField.__doc__
-del pyspark.sql.connect.column.Column.getItem.__doc__
 
 (failure_count, test_count) = doctest.testmod(
 pyspark.sql.connect.column,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41657][CONNECT][DOCS][TESTS] Enable doctests in pyspark.sql.connect.session

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 39569d5a8ac [SPARK-41657][CONNECT][DOCS][TESTS] Enable doctests in 
pyspark.sql.connect.session
39569d5a8ac is described below

commit 39569d5a8ac0bf192748220d28f76dfe3fc357d3
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 2 19:58:46 2023 +0900

[SPARK-41657][CONNECT][DOCS][TESTS] Enable doctests in 
pyspark.sql.connect.session

### What changes were proposed in this pull request?

This PR proposes to enable doctests in `pyspark.sql.connect.session` that 
is virtually the same as `pyspark.sql.session`.

### Why are the changes needed?

To make sure on the PySpark compatibility and test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

CI in this PR should test this out.

Closes #39341 from HyukjinKwon/SPARK-41657.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py   |  1 +
 python/pyspark/sql/connect/session.py | 60 +++
 python/pyspark/sql/session.py |  4 +--
 3 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index dff17792148..99f1cc6894f 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -506,6 +506,7 @@ pyspark_connect = Module(
 # doctests
 "pyspark.sql.connect.catalog",
 "pyspark.sql.connect.group",
+"pyspark.sql.connect.session",
 "pyspark.sql.connect.window",
 "pyspark.sql.connect.column",
 # unittests
diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index ae228317696..a461372c08c 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -22,6 +22,7 @@ import numpy as np
 import pandas as pd
 import pyarrow as pa
 
+from pyspark import SparkContext, SparkConf
 from pyspark.sql.session import classproperty, SparkSession as PySparkSession
 from pyspark.sql.types import (
 _infer_schema,
@@ -311,3 +312,62 @@ class SparkSession:
 
 
 SparkSession.__doc__ = PySparkSession.__doc__
+
+
+def _test() -> None:
+import os
+import sys
+import doctest
+from pyspark.sql import SparkSession as PySparkSession
+from pyspark.testing.connectutils import should_test_connect, 
connect_requirement_message
+
+os.chdir(os.environ["SPARK_HOME"])
+
+if should_test_connect:
+import pyspark.sql.connect.session
+
+globs = pyspark.sql.connect.session.__dict__.copy()
+# Works around to create a regular Spark session
+sc = SparkContext("local[4]", "sql.connect.session tests", 
conf=SparkConf())
+globs["_spark"] = PySparkSession(
+sc, options={"spark.app.name": "sql.connect.session tests"}
+)
+
+# Creates a remote Spark session.
+os.environ["SPARK_REMOTE"] = "sc://localhost"
+globs["spark"] = 
PySparkSession.builder.remote("sc://localhost").getOrCreate()
+
+# Uses PySpark session to test builder.
+globs["SparkSession"] = PySparkSession
+# Spark Connect does not support to set master together.
+pyspark.sql.connect.session.SparkSession.__doc__ = None
+del pyspark.sql.connect.session.SparkSession.Builder.master.__doc__
+
+# TODO(SPARK-41746): SparkSession.createDataFrame does not respect the 
column names in
+#  dictionary
+del pyspark.sql.connect.session.SparkSession.createDataFrame.__doc__
+del pyspark.sql.connect.session.SparkSession.read.__doc__
+# TODO(SPARK-41811): Implement SparkSession.sql's string formatter
+del pyspark.sql.connect.session.SparkSession.sql.__doc__
+
+(failure_count, test_count) = doctest.testmod(
+pyspark.sql.connect.session,
+globs=globs,
+optionflags=doctest.ELLIPSIS
+| doctest.NORMALIZE_WHITESPACE
+| doctest.IGNORE_EXCEPTION_DETAIL,
+)
+
+globs["spark"].stop()
+globs["_spark"].stop()
+if failure_count:
+sys.exit(-1)
+else:
+print(
+f"Skipping pyspark.sql.connect.session doctests: 
{connect_requirement_message}",
+file=sys.stderr,
+)
+
+
+if __name__ == "__main__":
+_test()
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 3f99dc7ab84..1e4e6f5e3ad 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -676,7 +676,7 @@ class SparkSession(SparkConversionMixin):
 Examples
 
 >>> spark.catalog
-
+<...Catalog object ...>
 
  

[spark] branch master updated (9bfc8460d63 -> fde3b1a1fa8)

2023-01-02 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 9bfc8460d63 [SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect 
documentation and script to dev/ and Python documentation
 add fde3b1a1fa8 [SPARK-41809][CONNECT][PYTHON] Make function `from_json` 
support DataType Schema

No new revisions were added by this update.

Summary of changes:
 .../sql/connect/planner/SparkConnectPlanner.scala  | 37 +-
 python/pyspark/sql/connect/functions.py| 10 --
 .../sql/tests/connect/test_connect_function.py |  7 ++--
 3 files changed, 47 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect documentation and script to dev/ and Python documentation

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9bfc8460d63 [SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect 
documentation and script to dev/ and Python documentation
9bfc8460d63 is described below

commit 9bfc8460d6379e38a775969f463f1de81474a0ae
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 2 17:36:44 2023 +0900

[SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect documentation and 
script to dev/ and Python documentation

### What changes were proposed in this pull request?

This PR takes over https://github.com/apache/spark/pull/39211 and 
https://github.com/apache/spark/pull/38477 that proposes:

- Move `connector/connect/dev/generate_protos.sh` → 
`dev/generate_protos.sh` to be consistent with other places
- Move Python-specific development guides into 
`python/docs/source/development/testing.rst`

### Why are the changes needed?

To keep the project structure and documentation consistent.

### Does this PR introduce _any_ user-facing change?

Python-specific development guides for Spark Connect will be added in 
https://spark.apache.org/docs/latest/api/python/development/testing.html.

### How was this patch tested?

I manually tested:

```
./dev/generate_protos.sh
./dev/check-codegen-python.py
```

I also manually verified the Python documentation.

Closes #39338 from HyukjinKwon/SPARK-41705.

Lead-authored-by: Hyukjin Kwon 
Co-authored-by: Ted Yu 
Co-authored-by: Rui Wang 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml   |  2 +-
 connector/connect/README.md| 74 +++---
 ...k-codegen-python.py => connect-check-protos.py} |  4 +-
 .../connect-gen-protos.sh  |  4 +-
 python/docs/source/development/contributing.rst|  2 +
 python/docs/source/development/testing.rst | 52 ++-
 6 files changed, 68 insertions(+), 70 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 443fbf47942..17c4f06dc28 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -611,7 +611,7 @@ jobs:
 - name: Python linter
   run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
 - name: Python code generation check
-  run: if test -f ./dev/check-codegen-python.py; then 
PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 
./dev/check-codegen-python.py; fi
+  run: if test -f ./dev/connect-check-protos.py; then 
PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 
./dev/connect-check-protos.py; fi
 - name: R linter
   run: ./dev/lint-r
 - name: JS linter
diff --git a/connector/connect/README.md b/connector/connect/README.md
index d5cc767c744..4f2e06678dd 100644
--- a/connector/connect/README.md
+++ b/connector/connect/README.md
@@ -1,29 +1,28 @@
-# Spark Connect - Developer Documentation
+# Spark Connect
 
 **Spark Connect is a strictly experimental feature and under heavy development.
 All APIs should be considered volatile and should not be used in production.**
 
 This module contains the implementation of Spark Connect which is a logical 
plan
 facade for the implementation in Spark. Spark Connect is directly integrated 
into the build
-of Spark. To enable it, you only need to activate the driver plugin for Spark 
Connect.
+of Spark.
 
 The documentation linked here is specifically for developers of Spark Connect 
and not
 directly intended to be end-user documentation.
 
+## Development Topics
 
-## Getting Started 
+### Guidelines for new clients
 
-### Build
+When contributing a new client please be aware that we strive to have a common
+user experience across all languages. Please follow the below guidelines:
 
-```bash
-./build/mvn -Phive clean package
-```
+* [Connection string configuration](docs/client-connection-string.md)
+* [Adding new messages](docs/adding-proto-messages.md) in the Spark Connect 
protocol.
 
-or
+### Python client developement
 
-```bash
-./build/sbt -Phive clean package
-```
+Python-specific developement guidelines are located in 
[python/docs/source/development/testing.rst](https://github.com/apache/spark/blob/master/python/docs/source/development/testing.rst)
 that is published at [Development 
tab](https://spark.apache.org/docs/latest/api/python/development/index.html) in 
PySpark documentation.
 
 ### Build with user-defined `protoc` and `protoc-gen-grpc-java`
 
@@ -48,56 +47,3 @@ export 
CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe
 The user-defined `protoc` and `protoc-gen-grpc-java` binary files can be 
produced in the user's compilation environment by source code compilation, 
 for compilation 

[spark] branch master updated (470beda2231 -> f704d7e1b51)

2023-01-02 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 470beda2231 [SPARK-41571][SQL] Assign name to _LEGACY_ERROR_TEMP_2310
 add f704d7e1b51 [SPARK-41066][CONNECT][PYTHON][FOLLOWUP] Simplify the 
server code and add comments for `DataFrame.sampleBy`

No new revisions were added by this update.

Summary of changes:
 .../connect/common/src/main/protobuf/spark/connect/relations.proto   | 2 ++
 .../org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala   | 5 ++---
 python/pyspark/sql/connect/proto/relations_pb2.pyi   | 2 ++
 3 files changed, 6 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org