[spark] branch master updated: [SPARK-41658][SPARK-41656][DOCS][FOLLOW-UP] Update JIRAs in skipped tests' comments
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 02f12eeed0c [SPARK-41658][SPARK-41656][DOCS][FOLLOW-UP] Update JIRAs in skipped tests' comments 02f12eeed0c is described below commit 02f12eeed0ce27757edc83e99e05152113ea7f3c Author: Hyukjin Kwon AuthorDate: Tue Jan 3 14:39:45 2023 +0900 [SPARK-41658][SPARK-41656][DOCS][FOLLOW-UP] Update JIRAs in skipped tests' comments ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/39347 and https://github.com/apache/spark/pull/39347, which updates the invalid JIRAs linked in the comments. ### Why are the changes needed? To track the issues properly, and reenable skipped tests in the future. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Comment-only. Linter in CI should verify them. I also manually checked it in my local. Closes #39354 from HyukjinKwon/SPARK-41658-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/dataframe.py | 4 ++-- python/pyspark/sql/connect/functions.py | 5 ++--- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 0a69b6317f8..57c9e801c22 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -1394,7 +1394,7 @@ def _test() -> None: sc, options={"spark.app.name": "sql.connect.dataframe tests"} ) -# TODO(SPARK-41819): Implement RDD.getNumPartitions +# Spark Connect does not support RDD but the tests depend on them. del pyspark.sql.connect.dataframe.DataFrame.coalesce.__doc__ del pyspark.sql.connect.dataframe.DataFrame.repartition.__doc__ @@ -1420,7 +1420,7 @@ def _test() -> None: del pyspark.sql.connect.dataframe.DataFrame.replace.__doc__ del pyspark.sql.connect.dataframe.DataFrame.intersect.__doc__ -# TODO(SPARK-41826): Implement Dataframe.readStream +# TODO(SPARK-41625): Support Structured Streaming del pyspark.sql.connect.dataframe.DataFrame.isStreaming.__doc__ # TODO(SPARK-41827): groupBy requires all cols be Column or str diff --git a/python/pyspark/sql/connect/functions.py b/python/pyspark/sql/connect/functions.py index f2d3aa64728..6e688271a3f 100644 --- a/python/pyspark/sql/connect/functions.py +++ b/python/pyspark/sql/connect/functions.py @@ -2344,6 +2344,8 @@ def _test() -> None: globs["_spark"] = PySparkSession( sc, options={"spark.app.name": "sql.connect.functions tests"} ) +# Spark Connect does not support Spark Context but the test depends on that. +del pyspark.sql.connect.functions.monotonically_increasing_id.__doc__ # TODO(SPARK-41833): fix collect() output del pyspark.sql.connect.functions.array.__doc__ @@ -2406,9 +2408,6 @@ def _test() -> None: # TODO(SPARK-41836): Implement `transform_values` function del pyspark.sql.connect.functions.transform_values.__doc__ -# TODO(SPARK-41839): Implement SparkSession.sparkContext -del pyspark.sql.connect.functions.monotonically_increasing_id.__doc__ - # TODO(SPARK-41840): Fix 'Column' object is not callable del pyspark.sql.connect.functions.first.__doc__ del pyspark.sql.connect.functions.last.__doc__ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b7a0fc4b7bd -> 4d6856e913c)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b7a0fc4b7bd [SPARK-41658][CONNECT][TESTS] Enable doctests in pyspark.sql.connect.functions add 4d6856e913c [SPARK-41311][SQL][TESTS] Rewrite test RENAME_SRC_PATH_NOT_FOUND to trigger the error from user space No new revisions were added by this update. Summary of changes: .../sql/errors/QueryExecutionErrorsSuite.scala | 54 +- 1 file changed, 31 insertions(+), 23 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (973b8ffc828 -> b7a0fc4b7bd)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 973b8ffc828 [SPARK-41807][CORE] Remove non-existent error class: UNSUPPORTED_FEATURE.DISTRIBUTE_BY add b7a0fc4b7bd [SPARK-41658][CONNECT][TESTS] Enable doctests in pyspark.sql.connect.functions No new revisions were added by this update. Summary of changes: dev/sparktestsupport/modules.py | 1 + python/pyspark/sql/connect/functions.py | 156 2 files changed, 157 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2cf11cdb04f -> 973b8ffc828)
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 2cf11cdb04f [SPARK-41854][PYTHON][BUILD] Automatic reformat/check python/setup.py add 973b8ffc828 [SPARK-41807][CORE] Remove non-existent error class: UNSUPPORTED_FEATURE.DISTRIBUTE_BY No new revisions were added by this update. Summary of changes: core/src/main/resources/error/error-classes.json | 5 - 1 file changed, 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41854][PYTHON][BUILD] Automatic reformat/check python/setup.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2cf11cdb04f [SPARK-41854][PYTHON][BUILD] Automatic reformat/check python/setup.py 2cf11cdb04f is described below commit 2cf11cdb04f4c8628a991e50470331c3a8682bcd Author: Hyukjin Kwon AuthorDate: Tue Jan 3 13:05:29 2023 +0900 [SPARK-41854][PYTHON][BUILD] Automatic reformat/check python/setup.py ### What changes were proposed in this pull request? This PR proposes to automatically reformat `python/setup.py` too. ### Why are the changes needed? To make the development cycle easier. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I manually checked via: ```bash ./dev/reformat-python ./dev/lint-python ``` Closes #39352 from HyukjinKwon/SPARK-41854. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- dev/lint-python | 2 +- dev/reformat-python | 2 +- python/setup.py | 240 3 files changed, 133 insertions(+), 111 deletions(-) diff --git a/dev/lint-python b/dev/lint-python index 59ce71980d9..f1f4e9f1070 100755 --- a/dev/lint-python +++ b/dev/lint-python @@ -220,7 +220,7 @@ function black_test { fi echo "starting black test..." -BLACK_REPORT=$( ($BLACK_BUILD --config dev/pyproject.toml --check python/pyspark dev) 2>&1) +BLACK_REPORT=$( ($BLACK_BUILD --config dev/pyproject.toml --check python/pyspark dev python/setup.py) 2>&1) BLACK_STATUS=$? if [ "$BLACK_STATUS" -ne 0 ]; then diff --git a/dev/reformat-python b/dev/reformat-python index ae2118ab631..9543f5713d1 100755 --- a/dev/reformat-python +++ b/dev/reformat-python @@ -29,4 +29,4 @@ if [ $? -ne 0 ]; then exit 1 fi -$BLACK_BUILD --config dev/pyproject.toml python/pyspark dev +$BLACK_BUILD --config dev/pyproject.toml python/pyspark dev python/setup.py diff --git a/python/setup.py b/python/setup.py index 54115359a60..faba203a53a 100755 --- a/python/setup.py +++ b/python/setup.py @@ -25,19 +25,23 @@ from setuptools.command.install import install from shutil import copyfile, copytree, rmtree try: -exec(open('pyspark/version.py').read()) +exec(open("pyspark/version.py").read()) except IOError: -print("Failed to load PySpark version file for packaging. You must be in Spark's python dir.", - file=sys.stderr) +print( +"Failed to load PySpark version file for packaging. You must be in Spark's python dir.", +file=sys.stderr, +) sys.exit(-1) try: spec = importlib.util.spec_from_file_location("install", "pyspark/install.py") install_module = importlib.util.module_from_spec(spec) spec.loader.exec_module(install_module) except IOError: -print("Failed to load the installing module (pyspark/install.py) which had to be " - "packaged together.", - file=sys.stderr) +print( +"Failed to load the installing module (pyspark/install.py) which had to be " +"packaged together.", +file=sys.stderr, +) sys.exit(-1) VERSION = __version__ # noqa # A temporary path so we can access above the Python project root and fetch scripts and jars we need @@ -61,12 +65,16 @@ JARS_PATH = glob.glob(os.path.join(SPARK_HOME, "assembly/target/scala-*/jars/")) if len(JARS_PATH) == 1: JARS_PATH = JARS_PATH[0] -elif (os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1): +elif os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1: # Release mode puts the jars in a jars directory JARS_PATH = os.path.join(SPARK_HOME, "jars") elif len(JARS_PATH) > 1: -print("Assembly jars exist for multiple scalas ({0}), please cleanup assembly/target".format( -JARS_PATH), file=sys.stderr) +print( +"Assembly jars exist for multiple scalas ({0}), please cleanup assembly/target".format( +JARS_PATH +), +file=sys.stderr, +) sys.exit(-1) elif len(JARS_PATH) == 0 and not os.path.exists(TEMP_PATH): print(incorrect_invocation_message, file=sys.stderr) @@ -89,8 +97,9 @@ LICENSES_TARGET = os.path.join(TEMP_PATH, "licenses") # This is important because we only want to build the symlink farm while under Spark otherwise we # want to use the symlink farm. And if the symlink farm exists under while under Spark (e.g. a # partially built sdist) we should error and have the user sort it out. -in_spark = (os.path.isfile("../core/src/main/scala/org/apache/spark/SparkContext.scala") or -(os.path.isfile("../RELEASE") and len(glob.glob("../jars/spark*core*.jar")) == 1)) +in_spark =
[spark] branch master updated: [SPARK-41656][CONNECT][TESTS] Enable doctests in pyspark.sql.connect.dataframe
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 840620fbda1 [SPARK-41656][CONNECT][TESTS] Enable doctests in pyspark.sql.connect.dataframe 840620fbda1 is described below commit 840620fbda1fa7b01c8bea8d327a8b5d96f9f9ad Author: Sandeep Singh AuthorDate: Tue Jan 3 11:10:11 2023 +0900 [SPARK-41656][CONNECT][TESTS] Enable doctests in pyspark.sql.connect.dataframe ### What changes were proposed in this pull request? This PR proposes to enable doctests in pyspark.sql.connect.dataframe that is virtually the same as pyspark.sql.dataframe. ### Why are the changes needed? To make sure on the PySpark compatibility and test coverage. ### Does this PR introduce any user-facing change? No, doctest's only. ### How was this patch tested? New Doctests Added Closes #39346 from techaddict/SPARK-41656-pyspark.sql.connect.dataframe. Authored-by: Sandeep Singh Signed-off-by: Hyukjin Kwon --- dev/sparktestsupport/modules.py | 1 + python/pyspark/sql/connect/dataframe.py | 99 - python/pyspark/sql/dataframe.py | 10 ++-- 3 files changed, 104 insertions(+), 6 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 0eeb3dd9218..2c399174c13 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -510,6 +510,7 @@ pyspark_connect = Module( "pyspark.sql.connect.window", "pyspark.sql.connect.column", "pyspark.sql.connect.readwriter", +"pyspark.sql.connect.dataframe", # unittests "pyspark.sql.tests.connect.test_connect_plan", "pyspark.sql.tests.connect.test_connect_basic", diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 95582e86390..0a69b6317f8 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -35,7 +35,7 @@ import pandas import warnings from collections.abc import Iterable -from pyspark import _NoValue +from pyspark import _NoValue, SparkContext, SparkConf from pyspark._globals import _NoValueType from pyspark.sql.types import DataType, StructType, Row @@ -1373,3 +1373,100 @@ class DataFrameStatFunctions: DataFrameStatFunctions.__doc__ = PySparkDataFrameStatFunctions.__doc__ + + +def _test() -> None: +import os +import sys +import doctest +from pyspark.sql import SparkSession as PySparkSession +from pyspark.testing.connectutils import should_test_connect, connect_requirement_message + +os.chdir(os.environ["SPARK_HOME"]) + +if should_test_connect: +import pyspark.sql.connect.dataframe + +globs = pyspark.sql.connect.dataframe.__dict__.copy() +# Works around to create a regular Spark session +sc = SparkContext("local[4]", "sql.connect.dataframe tests", conf=SparkConf()) +globs["_spark"] = PySparkSession( +sc, options={"spark.app.name": "sql.connect.dataframe tests"} +) + +# TODO(SPARK-41819): Implement RDD.getNumPartitions +del pyspark.sql.connect.dataframe.DataFrame.coalesce.__doc__ +del pyspark.sql.connect.dataframe.DataFrame.repartition.__doc__ + +# TODO(SPARK-41820): Fix SparkConnectException: requirement failed +del pyspark.sql.connect.dataframe.DataFrame.createOrReplaceGlobalTempView.__doc__ +del pyspark.sql.connect.dataframe.DataFrame.createOrReplaceTempView.__doc__ + +# TODO(SPARK-41821): Fix DataFrame.describe +del pyspark.sql.connect.dataframe.DataFrame.describe.__doc__ + +# TODO(SPARK-41823): ambiguous column names +del pyspark.sql.connect.dataframe.DataFrame.drop.__doc__ +del pyspark.sql.connect.dataframe.DataFrame.join.__doc__ + +# TODO(SPARK-41824): DataFrame.explain format is different +del pyspark.sql.connect.dataframe.DataFrame.explain.__doc__ +del pyspark.sql.connect.dataframe.DataFrame.hint.__doc__ + +# TODO(SPARK-41825): Dataframe.show formatting int as double +del pyspark.sql.connect.dataframe.DataFrame.fillna.__doc__ +del pyspark.sql.connect.dataframe.DataFrameNaFunctions.replace.__doc__ +del pyspark.sql.connect.dataframe.DataFrameNaFunctions.fill.__doc__ +del pyspark.sql.connect.dataframe.DataFrame.replace.__doc__ +del pyspark.sql.connect.dataframe.DataFrame.intersect.__doc__ + +# TODO(SPARK-41826): Implement Dataframe.readStream +del pyspark.sql.connect.dataframe.DataFrame.isStreaming.__doc__ + +# TODO(SPARK-41827): groupBy requires all cols be Column or str +del pyspark.sql.connect.dataframe.DataFrame.groupBy.__doc__ + +# TODO(SPARK-41828):
[spark] branch master updated: [SPARK-41804][SQL] Choose correct element size in `InterpretedUnsafeProjection` for array of UDTs
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cb869328ea7 [SPARK-41804][SQL] Choose correct element size in `InterpretedUnsafeProjection` for array of UDTs cb869328ea7 is described below commit cb869328ea7fcf95a4178a0db19a6fa821ce3f15 Author: Bruce Robbins AuthorDate: Tue Jan 3 10:22:35 2023 +0900 [SPARK-41804][SQL] Choose correct element size in `InterpretedUnsafeProjection` for array of UDTs ### What changes were proposed in this pull request? Change `InterpretedUnsafeProjection#getElementSize` to choose the appropriate element size for the underlying SQL type of a UDT, rather than simply using the the default size of the underlying SQL type. ### Why are the changes needed? Consider this query: ``` // create a file of vector data import org.apache.spark.ml.linalg.{DenseVector, Vector} case class TestRow(varr: Array[Vector]) val values = Array(0.1d, 0.2d, 0.3d) val dv = new DenseVector(values).asInstanceOf[Vector] val ds = Seq(TestRow(Array(dv, dv))).toDS ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") // this works spark.read.format("parquet").load("vector_data").collect sql("set spark.sql.codegen.wholeStage=false") sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") // this will get an error spark.read.format("parquet").load("vector_data").collect ``` The failures vary, incuding * `VectorUDT` attempting to deserialize to a `SparseVector` (rather than a `DenseVector`) * negative array size (for one of the nested arrays) * JVM crash (SIGBUS error). This is because `InterpretedUnsafeProjection` initializes the outer-most array writer with an element size of 17 (the size of the UDT's underlying struct), rather than an element size of 8, which would be appropriate for an array of structs. When the outer-most array is later accessed, `UnsafeArrayData` assumes an element size of 8, so it picks up a garbage offset/size tuple for the second element. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #39349 from bersprockets/udt_issue. Authored-by: Bruce Robbins Signed-off-by: Hyukjin Kwon --- .../expressions/InterpretedUnsafeProjection.scala| 2 ++ .../catalyst/expressions/UnsafeRowConverterSuite.scala | 16 2 files changed, 18 insertions(+) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala index d87c0c006cf..9108a045c09 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala @@ -294,6 +294,8 @@ object InterpretedUnsafeProjection { private def getElementSize(dataType: DataType): Int = dataType match { case NullType | StringType | BinaryType | CalendarIntervalType | _: DecimalType | _: StructType | _: ArrayType | _: MapType => 8 +case udt: UserDefinedType[_] => + getElementSize(udt.sqlType) case _ => dataType.defaultSize } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala index 83dc8127828..cbab8894cb5 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/UnsafeRowConverterSuite.scala @@ -687,4 +687,20 @@ class UnsafeRowConverterSuite extends SparkFunSuite with Matchers with PlanTestB val fields5 = Array[DataType](udt) assert(convertBackToInternalRow(udtRow, fields5) === udtRow) } + + testBothCodegenAndInterpreted("SPARK-41804: Array of UDTs") { +val udt = new ExampleBaseTypeUDT +val objs = Seq( + udt.serialize(new ExampleSubClass(1)), + udt.serialize(new ExampleSubClass(2))) +val arr = new GenericArrayData(objs) +val row = new GenericInternalRow(Array[Any](arr)) +val unsafeProj = UnsafeProjection.create(Array[DataType](ArrayType(udt))) +val unsafeRow = unsafeProj.apply(row) +val unsafeOuterArray = unsafeRow.getArray(0) +// get second element from unsafe array +val unsafeStruct = unsafeOuterArray.getStruct(1, 1) +val result = unsafeStruct.getInt(0) +assert(result == 2) + } }
[spark] branch master updated: [MINOR] Fix typos
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a09e9dc1531 [MINOR] Fix typos a09e9dc1531 is described below commit a09e9dc1531bdef905d4609945c7747622928905 Author: smallzhongfeng AuthorDate: Tue Jan 3 09:57:51 2023 +0900 [MINOR] Fix typos ### What changes were proposed in this pull request? Fix typo in ReceiverSupervisorImpl. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No need. Closes #39340 from smallzhongfeng/fix-typos. Authored-by: smallzhongfeng Signed-off-by: Hyukjin Kwon --- core/src/main/scala/org/apache/spark/SparkContext.scala | 4 ++-- .../main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala | 4 ++-- mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala| 2 +- .../src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala | 2 +- .../spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala | 2 +- .../spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala| 2 +- .../org/apache/spark/streaming/receiver/ReceiverSupervisorImpl.scala | 2 +- 7 files changed, 9 insertions(+), 9 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index 5cbf2e83371..62e652ff9bb 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -3173,7 +3173,7 @@ object WritableConverter { implicit val bytesWritableConverterFn: () => WritableConverter[Array[Byte]] = { () => simpleWritableConverter[Array[Byte], BytesWritable] { bw => - // getBytes method returns array which is longer then data to be returned + // getBytes method returns array which is longer than data to be returned Arrays.copyOfRange(bw.getBytes, 0, bw.getLength) } } @@ -3204,7 +3204,7 @@ object WritableConverter { implicit def bytesWritableConverter(): WritableConverter[Array[Byte]] = { simpleWritableConverter[Array[Byte], BytesWritable] { bw => - // getBytes method returns array which is longer then data to be returned + // getBytes method returns array which is longer than data to be returned Arrays.copyOfRange(bw.getBytes, 0, bw.getLength) } } diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala index 0106c872297..b8563bed601 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala @@ -397,7 +397,7 @@ private[evaluation] object SquaredEuclideanSilhouette extends Silhouette { val clustersStatsMap = SquaredEuclideanSilhouette .computeClusterStats(dfWithSquaredNorm, predictionCol, featuresCol, weightCol) -// Silhouette is reasonable only when the number of clusters is greater then 1 +// Silhouette is reasonable only when the number of clusters is greater than 1 assert(clustersStatsMap.size > 1, "Number of clusters must be greater than one.") val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap) @@ -604,7 +604,7 @@ private[evaluation] object CosineSilhouette extends Silhouette { val clustersStatsMap = computeClusterStats(dfWithNormalizedFeatures, featuresCol, predictionCol, weightCol) -// Silhouette is reasonable only when the number of clusters is greater then 1 +// Silhouette is reasonable only when the number of clusters is greater than 1 assert(clustersStatsMap.size > 1, "Number of clusters must be greater than one.") val bClustersStatsMap = dataset.sparkSession.sparkContext.broadcast(clustersStatsMap) diff --git a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala index bf9d07338db..8a124ae4f4c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala @@ -105,7 +105,7 @@ object Summarizer extends Logging { * @return a builder. * @throws IllegalArgumentException if one of the metric names is not understood. * - * Note: Currently, the performance of this interface is about 2x~3x slower then using the RDD + * Note: Currently, the performance of this interface is about 2x~3x slower than using the RDD * interface. */ @Since("2.3.0") diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
[spark] branch master updated: [SPARK-41803][CONNECT][PYTHON] Add missing function `log(arg1, arg2)`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c0f07d3c193 [SPARK-41803][CONNECT][PYTHON] Add missing function `log(arg1, arg2)` c0f07d3c193 is described below commit c0f07d3c193f6df92a736634be65bf42c3c4154c Author: Ruifeng Zheng AuthorDate: Tue Jan 3 09:56:49 2023 +0900 [SPARK-41803][CONNECT][PYTHON] Add missing function `log(arg1, arg2)` ### What changes were proposed in this pull request? Add missing function `log(arg1, arg2)` ### Why are the changes needed? for API coverage ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added ut Closes #39339 from zhengruifeng/connect_function_log. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/functions.py | 10 -- python/pyspark/sql/tests/connect/test_connect_function.py | 6 ++ 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/connect/functions.py b/python/pyspark/sql/connect/functions.py index 436eb9f7c73..23196e888e3 100644 --- a/python/pyspark/sql/connect/functions.py +++ b/python/pyspark/sql/connect/functions.py @@ -28,6 +28,7 @@ from typing import ( Tuple, Callable, ValuesView, +cast, ) from pyspark.sql.connect.column import Column @@ -548,8 +549,13 @@ def hypot(col1: Union["ColumnOrName", float], col2: Union["ColumnOrName", float] hypot.__doc__ = pysparkfuncs.hypot.__doc__ -def log(col: "ColumnOrName") -> Column: -return _invoke_function_over_columns("ln", col) +def log(arg1: Union["ColumnOrName", float], arg2: Optional["ColumnOrName"] = None) -> Column: +if arg2 is None: +# in this case, arg1 should be "ColumnOrName" +return _invoke_function("ln", _to_col(cast("ColumnOrName", arg1))) +else: +# in this case, arg1 should be a float +return _invoke_function("log", lit(cast(float, arg1)), _to_col(arg2)) log.__doc__ = pysparkfuncs.log.__doc__ diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py b/python/pyspark/sql/tests/connect/test_connect_function.py index 6fea8da14c7..0dda3e99fcf 100644 --- a/python/pyspark/sql/tests/connect/test_connect_function.py +++ b/python/pyspark/sql/tests/connect/test_connect_function.py @@ -448,6 +448,12 @@ class SparkConnectFunctionTests(SparkConnectFuncTestCase): sdf.select(sfunc("b"), sfunc(sdf.c)).toPandas(), ) +# test log(arg1, arg2) +self.assert_eq( +cdf.select(CF.log(1.1, "b"), CF.log(1.2, cdf.c)).toPandas(), +sdf.select(SF.log(1.1, "b"), SF.log(1.2, sdf.c)).toPandas(), +) + self.assert_eq( cdf.select(CF.atan2("b", cdf.c)).toPandas(), sdf.select(SF.atan2("b", sdf.c)).toPandas(), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41659][CONNECT] Enable doctests in pyspark.sql.connect.readwriter
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b804b53e4a5 [SPARK-41659][CONNECT] Enable doctests in pyspark.sql.connect.readwriter b804b53e4a5 is described below commit b804b53e4a5b971bf2db676f4d552fff383ab586 Author: Sandeep Singh AuthorDate: Tue Jan 3 09:41:57 2023 +0900 [SPARK-41659][CONNECT] Enable doctests in pyspark.sql.connect.readwriter ### What changes were proposed in this pull request? This PR proposes to enable doctests in pyspark.sql.connect.readwriter that is virtually the same as pyspark.sql.readwriter. ### Why are the changes needed? To make sure on the PySpark compatibility and test coverage. ### Does this PR introduce any user-facing change? No, doctest's only. ### How was this patch tested? New Doctests Added Closes #39331 from techaddict/SPARK-41659-pyspark.sql.connect.readwriter. Authored-by: Sandeep Singh Signed-off-by: Hyukjin Kwon --- dev/sparktestsupport/modules.py | 1 + python/pyspark/sql/connect/plan.py | 2 +- python/pyspark/sql/connect/readwriter.py | 60 python/pyspark/sql/readwriter.py | 18 +- 4 files changed, 71 insertions(+), 10 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 99f1cc6894f..0eeb3dd9218 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -509,6 +509,7 @@ pyspark_connect = Module( "pyspark.sql.connect.session", "pyspark.sql.connect.window", "pyspark.sql.connect.column", +"pyspark.sql.connect.readwriter", # unittests "pyspark.sql.tests.connect.test_connect_plan", "pyspark.sql.tests.connect.test_connect_basic", diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py index f10687cc82e..48a8fa598e7 100644 --- a/python/pyspark/sql/connect/plan.py +++ b/python/pyspark/sql/connect/plan.py @@ -1264,7 +1264,7 @@ class WriteOperation(LogicalPlan): for k in self.options: if self.options[k] is None: -del plan.write_operation.options[k] +plan.write_operation.options.pop(k, None) else: plan.write_operation.options[k] = cast(str, self.options[k]) diff --git a/python/pyspark/sql/connect/readwriter.py b/python/pyspark/sql/connect/readwriter.py index c3af39f864c..207ad8df74f 100644 --- a/python/pyspark/sql/connect/readwriter.py +++ b/python/pyspark/sql/connect/readwriter.py @@ -20,6 +20,7 @@ from typing import Dict from typing import Optional, Union, List, overload, Tuple, cast, Any from typing import TYPE_CHECKING +from pyspark import SparkContext, SparkConf from pyspark.sql.connect.plan import Read, DataSource, LogicalPlan, WriteOperation from pyspark.sql.types import StructType from pyspark.sql.utils import to_str @@ -458,3 +459,62 @@ class DataFrameWriter(OptionUtils): def jdbc(self, *args: Any, **kwargs: Any) -> None: raise NotImplementedError("jdbc() not supported for DataFrameWriter") + + +def _test() -> None: +import os +import sys +import doctest +from pyspark.sql import SparkSession as PySparkSession +from pyspark.testing.connectutils import should_test_connect, connect_requirement_message + +os.chdir(os.environ["SPARK_HOME"]) + +if should_test_connect: +import pyspark.sql.connect.readwriter + +globs = pyspark.sql.connect.readwriter.__dict__.copy() +# Works around to create a regular Spark session +sc = SparkContext("local[4]", "sql.connect.readwriter tests", conf=SparkConf()) +globs["_spark"] = PySparkSession( +sc, options={"spark.app.name": "sql.connect.readwriter tests"} +) + +# TODO(SPARK-41817): Support reading with schema +del pyspark.sql.connect.readwriter.DataFrameReader.load.__doc__ +del pyspark.sql.connect.readwriter.DataFrameReader.option.__doc__ +del pyspark.sql.connect.readwriter.DataFrameWriter.csv.__doc__ +del pyspark.sql.connect.readwriter.DataFrameWriter.option.__doc__ +del pyspark.sql.connect.readwriter.DataFrameWriter.text.__doc__ +del pyspark.sql.connect.readwriter.DataFrameWriter.bucketBy.__doc__ +del pyspark.sql.connect.readwriter.DataFrameWriter.sortBy.__doc__ + +# TODO(SPARK-41818): Support saveAsTable +del pyspark.sql.connect.readwriter.DataFrameWriter.insertInto.__doc__ +del pyspark.sql.connect.readwriter.DataFrameWriter.saveAsTable.__doc__ + +# Creates a remote Spark session. +os.environ["SPARK_REMOTE"] = "sc://localhost" +globs["spark"] = PySparkSession.builder.remote("sc://localhost").getOrCreate() + +
[spark] branch master updated: [SPARK-41810][CONNECT] Infer names from a list of dictionaries in SparkSession.createDataFrame
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 68c3354267d [SPARK-41810][CONNECT] Infer names from a list of dictionaries in SparkSession.createDataFrame 68c3354267d is described below commit 68c3354267d30a96765a6592243205957d2cddf1 Author: Hyukjin Kwon AuthorDate: Mon Jan 2 21:24:45 2023 +0900 [SPARK-41810][CONNECT] Infer names from a list of dictionaries in SparkSession.createDataFrame ### What changes were proposed in this pull request? This PR proposes to support to infer field names when the input data is the list of dictionaries in `SparkSession.createDataFrame`. For example, ```python spark.createDataFrame([{"course": "dotNET", "earnings": 1, "year": 2012}]).show() ``` **Before**: ``` +--+-++ |_1| _2| _3| +--+-++ |dotNET|1|2012| +--+-++ ``` **After**: ``` +--+++ |course|earnings|year| +--+++ |dotNET| 1|2012| +--+++ ``` ### Why are the changes needed? To match the behaviour with the regular PySpark. ### Does this PR introduce _any_ user-facing change? No to end users because Spark Connect has not been released. ### How was this patch tested? Unittest was added. Closes #39344 from HyukjinKwon/SPARK-41746. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/session.py | 16 +-- .../sql/tests/connect/test_connect_basic.py| 24 -- 2 files changed, 23 insertions(+), 17 deletions(-) diff --git a/python/pyspark/sql/connect/session.py b/python/pyspark/sql/connect/session.py index a461372c08c..0233bde1c17 100644 --- a/python/pyspark/sql/connect/session.py +++ b/python/pyspark/sql/connect/session.py @@ -218,15 +218,21 @@ class SparkSession: else: _data = list(data) -pdf = pd.DataFrame(_data) -if _schema is None and isinstance(_data[0], Row): +if _schema is None and (isinstance(_data[0], Row) or isinstance(_data[0], dict)): +if isinstance(_data[0], dict): +# Sort the data to respect inferred schema. +# For dictionaries, we sort the schema in alphabetical order. +_data = [dict(sorted(d.items())) for d in _data] + _schema = self._inferSchemaFromList(_data, _cols) if _cols is not None: for i, name in enumerate(_cols): _schema.fields[i].name = name _schema.names[i] = name +pdf = pd.DataFrame(_data) + if _cols is None: _cols = ["_%s" % i for i in range(1, pdf.shape[1] + 1)] @@ -342,11 +348,9 @@ def _test() -> None: # Spark Connect does not support to set master together. pyspark.sql.connect.session.SparkSession.__doc__ = None del pyspark.sql.connect.session.SparkSession.Builder.master.__doc__ - -# TODO(SPARK-41746): SparkSession.createDataFrame does not respect the column names in -# dictionary +# RDD API is not supported in Spark Connect. del pyspark.sql.connect.session.SparkSession.createDataFrame.__doc__ -del pyspark.sql.connect.session.SparkSession.read.__doc__ + # TODO(SPARK-41811): Implement SparkSession.sql's string formatter del pyspark.sql.connect.session.SparkSession.sql.__doc__ diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py b/python/pyspark/sql/tests/connect/test_connect_basic.py index 6a65e412dfd..7c17c5f6820 100644 --- a/python/pyspark/sql/tests/connect/test_connect_basic.py +++ b/python/pyspark/sql/tests/connect/test_connect_basic.py @@ -389,8 +389,8 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase): self.connect.createDataFrame(data, "col1 int, col2 int, col3 int").show() def test_with_local_rows(self): -# SPARK-41789: Test creating a dataframe with list of Rows -data = [ +# SPARK-41789, SPARK-41810: Test creating a dataframe with list of rows and dictionaries +rows = [ Row(course="dotNET", year=2012, earnings=1), Row(course="Java", year=2012, earnings=2), Row(course="dotNET", year=2012, earnings=5000), @@ -398,19 +398,21 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase): Row(course="Java", year=2013, earnings=3), Row(course="Scala", year=2022, earnings=None), ] +dicts = [row.asDict() for row in rows] -sdf =
[spark] branch master updated: [SPARK-41745][CONNECT][TESTS][FOLLOW-UP] Reeanble related test cases
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a8e395b79fa [SPARK-41745][CONNECT][TESTS][FOLLOW-UP] Reeanble related test cases a8e395b79fa is described below commit a8e395b79fa2f16654da50c31644c4487d5ee804 Author: Hyukjin Kwon AuthorDate: Mon Jan 2 19:59:46 2023 +0900 [SPARK-41745][CONNECT][TESTS][FOLLOW-UP] Reeanble related test cases ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/39313 that enables the skipped tests back. ### Why are the changes needed? In order to make sure on the test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually checked locally, and CI in this PR should verify them. Closes #39342 from HyukjinKwon/SPARK-41745-followup. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/column.py | 8 +--- python/pyspark/sql/connect/column.py | 16 2 files changed, 9 insertions(+), 15 deletions(-) diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py index cd7b6932c2f..f2264685f48 100644 --- a/python/pyspark/sql/column.py +++ b/python/pyspark/sql/column.py @@ -282,6 +282,8 @@ class Column: __ge__ = _bin_op("geq") __gt__ = _bin_op("gt") +# TODO(SPARK-41812): DataFrame.join: ambiguous column +# TODO(SPARK-41814): Column.eqNullSafe fails on NaN comparison _eqNullSafe_doc = """ Equality test that is safe for null values. @@ -317,9 +319,9 @@ class Column: ... Row(value = 'bar'), ... Row(value = None) ... ]) ->>> df1.join(df2, df1["value"] == df2["value"]).count() +>>> df1.join(df2, df1["value"] == df2["value"]).count() # doctest: +SKIP 0 ->>> df1.join(df2, df1["value"].eqNullSafe(df2["value"])).count() +>>> df1.join(df2, df1["value"].eqNullSafe(df2["value"])).count() # doctest: +SKIP 1 >>> df2 = spark.createDataFrame([ ... Row(id=1, value=float('NaN')), @@ -330,7 +332,7 @@ class Column: ... df2['value'].eqNullSafe(None), ... df2['value'].eqNullSafe(float('NaN')), ... df2['value'].eqNullSafe(42.0) -... ).show() +... ).show() # doctest: +SKIP ++---++ |(value <=> NULL)|(value <=> NaN)|(value <=> 42.0)| ++---++ diff --git a/python/pyspark/sql/connect/column.py b/python/pyspark/sql/connect/column.py index d9f96325c17..6fda15e084a 100644 --- a/python/pyspark/sql/connect/column.py +++ b/python/pyspark/sql/connect/column.py @@ -441,25 +441,17 @@ def _test() -> None: # Creates a remote Spark session. os.environ["SPARK_REMOTE"] = "sc://localhost" globs["spark"] = PySparkSession.builder.remote("sc://localhost").getOrCreate() +# Spark Connect has a different string representation for Column. +del pyspark.sql.connect.column.Column.getItem.__doc__ # TODO(SPARK-41746): SparkSession.createDataFrame does not support nested datatypes del pyspark.sql.connect.column.Column.dropFields.__doc__ # TODO(SPARK-41772): Enable pyspark.sql.connect.column.Column.withField doctest del pyspark.sql.connect.column.Column.withField.__doc__ -# TODO(SPARK-41745): SparkSession.createDataFrame does not respect the column names in -# the row -del pyspark.sql.connect.column.Column.bitwiseAND.__doc__ -del pyspark.sql.connect.column.Column.bitwiseOR.__doc__ -del pyspark.sql.connect.column.Column.bitwiseXOR.__doc__ -# TODO(SPARK-41745): SparkSession.createDataFrame does not respect the column names in -# the row -del pyspark.sql.connect.column.Column.eqNullSafe.__doc__ -# TODO(SPARK-41745): SparkSession.createDataFrame does not respect the column names in -# the row -del pyspark.sql.connect.column.Column.isNotNull.__doc__ +# TODO(SPARK-41815): Column.isNull returns nan instead of None del pyspark.sql.connect.column.Column.isNull.__doc__ +# TODO(SPARK-41746): SparkSession.createDataFrame does not support nested datatypes del pyspark.sql.connect.column.Column.getField.__doc__ -del pyspark.sql.connect.column.Column.getItem.__doc__ (failure_count, test_count) = doctest.testmod( pyspark.sql.connect.column, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-41657][CONNECT][DOCS][TESTS] Enable doctests in pyspark.sql.connect.session
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 39569d5a8ac [SPARK-41657][CONNECT][DOCS][TESTS] Enable doctests in pyspark.sql.connect.session 39569d5a8ac is described below commit 39569d5a8ac0bf192748220d28f76dfe3fc357d3 Author: Hyukjin Kwon AuthorDate: Mon Jan 2 19:58:46 2023 +0900 [SPARK-41657][CONNECT][DOCS][TESTS] Enable doctests in pyspark.sql.connect.session ### What changes were proposed in this pull request? This PR proposes to enable doctests in `pyspark.sql.connect.session` that is virtually the same as `pyspark.sql.session`. ### Why are the changes needed? To make sure on the PySpark compatibility and test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? CI in this PR should test this out. Closes #39341 from HyukjinKwon/SPARK-41657. Authored-by: Hyukjin Kwon Signed-off-by: Hyukjin Kwon --- dev/sparktestsupport/modules.py | 1 + python/pyspark/sql/connect/session.py | 60 +++ python/pyspark/sql/session.py | 4 +-- 3 files changed, 63 insertions(+), 2 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index dff17792148..99f1cc6894f 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -506,6 +506,7 @@ pyspark_connect = Module( # doctests "pyspark.sql.connect.catalog", "pyspark.sql.connect.group", +"pyspark.sql.connect.session", "pyspark.sql.connect.window", "pyspark.sql.connect.column", # unittests diff --git a/python/pyspark/sql/connect/session.py b/python/pyspark/sql/connect/session.py index ae228317696..a461372c08c 100644 --- a/python/pyspark/sql/connect/session.py +++ b/python/pyspark/sql/connect/session.py @@ -22,6 +22,7 @@ import numpy as np import pandas as pd import pyarrow as pa +from pyspark import SparkContext, SparkConf from pyspark.sql.session import classproperty, SparkSession as PySparkSession from pyspark.sql.types import ( _infer_schema, @@ -311,3 +312,62 @@ class SparkSession: SparkSession.__doc__ = PySparkSession.__doc__ + + +def _test() -> None: +import os +import sys +import doctest +from pyspark.sql import SparkSession as PySparkSession +from pyspark.testing.connectutils import should_test_connect, connect_requirement_message + +os.chdir(os.environ["SPARK_HOME"]) + +if should_test_connect: +import pyspark.sql.connect.session + +globs = pyspark.sql.connect.session.__dict__.copy() +# Works around to create a regular Spark session +sc = SparkContext("local[4]", "sql.connect.session tests", conf=SparkConf()) +globs["_spark"] = PySparkSession( +sc, options={"spark.app.name": "sql.connect.session tests"} +) + +# Creates a remote Spark session. +os.environ["SPARK_REMOTE"] = "sc://localhost" +globs["spark"] = PySparkSession.builder.remote("sc://localhost").getOrCreate() + +# Uses PySpark session to test builder. +globs["SparkSession"] = PySparkSession +# Spark Connect does not support to set master together. +pyspark.sql.connect.session.SparkSession.__doc__ = None +del pyspark.sql.connect.session.SparkSession.Builder.master.__doc__ + +# TODO(SPARK-41746): SparkSession.createDataFrame does not respect the column names in +# dictionary +del pyspark.sql.connect.session.SparkSession.createDataFrame.__doc__ +del pyspark.sql.connect.session.SparkSession.read.__doc__ +# TODO(SPARK-41811): Implement SparkSession.sql's string formatter +del pyspark.sql.connect.session.SparkSession.sql.__doc__ + +(failure_count, test_count) = doctest.testmod( +pyspark.sql.connect.session, +globs=globs, +optionflags=doctest.ELLIPSIS +| doctest.NORMALIZE_WHITESPACE +| doctest.IGNORE_EXCEPTION_DETAIL, +) + +globs["spark"].stop() +globs["_spark"].stop() +if failure_count: +sys.exit(-1) +else: +print( +f"Skipping pyspark.sql.connect.session doctests: {connect_requirement_message}", +file=sys.stderr, +) + + +if __name__ == "__main__": +_test() diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py index 3f99dc7ab84..1e4e6f5e3ad 100644 --- a/python/pyspark/sql/session.py +++ b/python/pyspark/sql/session.py @@ -676,7 +676,7 @@ class SparkSession(SparkConversionMixin): Examples >>> spark.catalog - +<...Catalog object ...>
[spark] branch master updated (9bfc8460d63 -> fde3b1a1fa8)
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9bfc8460d63 [SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect documentation and script to dev/ and Python documentation add fde3b1a1fa8 [SPARK-41809][CONNECT][PYTHON] Make function `from_json` support DataType Schema No new revisions were added by this update. Summary of changes: .../sql/connect/planner/SparkConnectPlanner.scala | 37 +- python/pyspark/sql/connect/functions.py| 10 -- .../sql/tests/connect/test_connect_function.py | 7 ++-- 3 files changed, 47 insertions(+), 7 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect documentation and script to dev/ and Python documentation
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9bfc8460d63 [SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect documentation and script to dev/ and Python documentation 9bfc8460d63 is described below commit 9bfc8460d6379e38a775969f463f1de81474a0ae Author: Hyukjin Kwon AuthorDate: Mon Jan 2 17:36:44 2023 +0900 [SPARK-40993][SPARK-41705][CONNECT] Move Spark Connect documentation and script to dev/ and Python documentation ### What changes were proposed in this pull request? This PR takes over https://github.com/apache/spark/pull/39211 and https://github.com/apache/spark/pull/38477 that proposes: - Move `connector/connect/dev/generate_protos.sh` → `dev/generate_protos.sh` to be consistent with other places - Move Python-specific development guides into `python/docs/source/development/testing.rst` ### Why are the changes needed? To keep the project structure and documentation consistent. ### Does this PR introduce _any_ user-facing change? Python-specific development guides for Spark Connect will be added in https://spark.apache.org/docs/latest/api/python/development/testing.html. ### How was this patch tested? I manually tested: ``` ./dev/generate_protos.sh ./dev/check-codegen-python.py ``` I also manually verified the Python documentation. Closes #39338 from HyukjinKwon/SPARK-41705. Lead-authored-by: Hyukjin Kwon Co-authored-by: Ted Yu Co-authored-by: Rui Wang Signed-off-by: Hyukjin Kwon --- .github/workflows/build_and_test.yml | 2 +- connector/connect/README.md| 74 +++--- ...k-codegen-python.py => connect-check-protos.py} | 4 +- .../connect-gen-protos.sh | 4 +- python/docs/source/development/contributing.rst| 2 + python/docs/source/development/testing.rst | 52 ++- 6 files changed, 68 insertions(+), 70 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 443fbf47942..17c4f06dc28 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -611,7 +611,7 @@ jobs: - name: Python linter run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python - name: Python code generation check - run: if test -f ./dev/check-codegen-python.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/check-codegen-python.py; fi + run: if test -f ./dev/connect-check-protos.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/connect-check-protos.py; fi - name: R linter run: ./dev/lint-r - name: JS linter diff --git a/connector/connect/README.md b/connector/connect/README.md index d5cc767c744..4f2e06678dd 100644 --- a/connector/connect/README.md +++ b/connector/connect/README.md @@ -1,29 +1,28 @@ -# Spark Connect - Developer Documentation +# Spark Connect **Spark Connect is a strictly experimental feature and under heavy development. All APIs should be considered volatile and should not be used in production.** This module contains the implementation of Spark Connect which is a logical plan facade for the implementation in Spark. Spark Connect is directly integrated into the build -of Spark. To enable it, you only need to activate the driver plugin for Spark Connect. +of Spark. The documentation linked here is specifically for developers of Spark Connect and not directly intended to be end-user documentation. +## Development Topics -## Getting Started +### Guidelines for new clients -### Build +When contributing a new client please be aware that we strive to have a common +user experience across all languages. Please follow the below guidelines: -```bash -./build/mvn -Phive clean package -``` +* [Connection string configuration](docs/client-connection-string.md) +* [Adding new messages](docs/adding-proto-messages.md) in the Spark Connect protocol. -or +### Python client developement -```bash -./build/sbt -Phive clean package -``` +Python-specific developement guidelines are located in [python/docs/source/development/testing.rst](https://github.com/apache/spark/blob/master/python/docs/source/development/testing.rst) that is published at [Development tab](https://spark.apache.org/docs/latest/api/python/development/index.html) in PySpark documentation. ### Build with user-defined `protoc` and `protoc-gen-grpc-java` @@ -48,56 +47,3 @@ export CONNECT_PLUGIN_EXEC_PATH=/path-to-protoc-gen-grpc-java-exe The user-defined `protoc` and `protoc-gen-grpc-java` binary files can be produced in the user's compilation environment by source code compilation, for compilation
[spark] branch master updated (470beda2231 -> f704d7e1b51)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 470beda2231 [SPARK-41571][SQL] Assign name to _LEGACY_ERROR_TEMP_2310 add f704d7e1b51 [SPARK-41066][CONNECT][PYTHON][FOLLOWUP] Simplify the server code and add comments for `DataFrame.sampleBy` No new revisions were added by this update. Summary of changes: .../connect/common/src/main/protobuf/spark/connect/relations.proto | 2 ++ .../org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala | 5 ++--- python/pyspark/sql/connect/proto/relations_pb2.pyi | 2 ++ 3 files changed, 6 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org