[spark] branch master updated: [SPARK-37108][R] Expose make_date expression in R
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5f997c7 [SPARK-37108][R] Expose make_date expression in R 5f997c7 is described below commit 5f997c78c83551942b6c5a8ec6344547b86ae68a Author: Leona Yoda AuthorDate: Thu Nov 4 12:25:12 2021 +0900 [SPARK-37108][R] Expose make_date expression in R ### What changes were proposed in this pull request? Expose `make_date` API on SparkR ### Why are the changes needed? `make_date` APIs on Scala and PySpark were added by [SPARK-34356](https://github.com/apache/spark/pull/34356), this PR aims to cover the API on SparkR. ### Does this PR introduce _any_ user-facing change? Yes, users can call the API by SparkR ### How was this patch tested? unit tests. Closes #34480 from yoda-mon/make-date-r. Authored-by: Leona Yoda Signed-off-by: Kousuke Saruta --- R/pkg/NAMESPACE | 1 + R/pkg/R/functions.R | 26 ++ R/pkg/R/generics.R| 4 R/pkg/tests/fulltests/test_sparkSQL.R | 14 ++ 4 files changed, 45 insertions(+) diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 10bb02a..6e0557c 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -343,6 +343,7 @@ exportMethods("%<=>%", "lower", "lpad", "ltrim", + "make_date", "map_concat", "map_entries", "map_filter", diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index fdbf48b..48d4fe8 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -41,6 +41,8 @@ NULL #' @param x Column to compute on. In \code{window}, it must be a time Column of #' \code{TimestampType}. This is not used with \code{current_date} and #' \code{current_timestamp} +#' @param y Column to compute on. +#' @param z Column to compute on. #' @param format The format for the given dates or timestamps in Column \code{x}. See the #' format used in the following methods: #' \itemize{ @@ -1467,6 +1469,30 @@ setMethod("ltrim", }) #' @details +#' \code{make_date}: Create date from year, month and day fields. +#' +#' @rdname column_datetime_functions +#' @aliases make_date make_date,Column-method +#' @note make_date since 3.3.0 +#' @examples +#' +#' \dontrun{ +#' df <- createDataFrame( +#' list(list(2021, 10, 22), list(2021, 13, 1), +#'list(2021, 2, 29), list(2020, 2, 29)), +#' list("year", "month", "day") +#' ) +#' tmp <- head(select(df, make_date(df$year, df$month, df$day))) +#' head(tmp)} +setMethod("make_date", + signature(x = "Column", y = "Column", z = "Column"), + function(x, y, z) { +jc <- callJStatic("org.apache.spark.sql.functions", "make_date", + x@jc, y@jc, z@jc) +column(jc) + }) + +#' @details #' \code{max}: Returns the maximum value of the expression in a group. #' #' @rdname column_aggregate_functions diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index af19e72..5fe2ec6 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -1158,6 +1158,10 @@ setGeneric("lpad", function(x, len, pad) { standardGeneric("lpad") }) #' @name NULL setGeneric("ltrim", function(x, trimString) { standardGeneric("ltrim") }) +#' @rdname column_datetime_functions +#' @name NULL +setGeneric("make_date", function(x, y, z) { standardGeneric("make_date") }) + #' @rdname column_collection_functions #' @name NULL setGeneric("map_concat", function(x, ...) { standardGeneric("map_concat") }) diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R b/R/pkg/tests/fulltests/test_sparkSQL.R index b6e02bb..0e46324e 100644 --- a/R/pkg/tests/fulltests/test_sparkSQL.R +++ b/R/pkg/tests/fulltests/test_sparkSQL.R @@ -2050,6 +2050,20 @@ test_that("date functions on a DataFrame", { Sys.setenv(TZ = .originalTimeZone) }) +test_that("SPARK-37108: expose make_date expression in R", { + df <- createDataFrame( +list(list(2021, 10, 22), list(2021, 13, 1), + list(2021, 2, 29), list(2020, 2, 29)), +list("year", "month", "day") + ) + expect <- createDataFrame( +list(list(as.Date("2021-10-22")), NA, NA, list(as.Date("2020-02-29"))), +list("make_date(year, month, day)") + ) + actual <- select(df, make_date(df$year, df$month, df$day)) + expect_equal(collect(expect), collect(actual)) +}) + test_that("greatest() and least() on a DataFrame", { l <- list(list(a = 1, b = 2), list(a = 3, b = 4)) df <- createDataFrame(l) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail:
[spark] branch master updated: [SPARK-36566][K8S] Add Spark appname as a label to pods
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bb77e7b [SPARK-36566][K8S] Add Spark appname as a label to pods bb77e7b is described below commit bb77e7b30cd6865af45bc861fa20f5d139572fa7 Author: Yikun Jiang AuthorDate: Wed Nov 3 16:08:49 2021 -0700 [SPARK-36566][K8S] Add Spark appname as a label to pods ### What changes were proposed in this pull request? Add Spark appname as a label to pods. Note that: - there were a `SPARK_APP_ID_LABEL`: is the unique spark APP applicationId with spark prefix (like "spark-{applicationId}") - `SPARK_APP_NAME_LABEL` in this patch is the Spark APP name, it's more friendly and readable for k8s cluster maintainer to figure out the spark app name of specific pods. ### Why are the changes needed? Then we can find out all pods (driver/executor) list by using: `k get pods -l spark.app.name=xxx`, also can figure out the spark app name of specific pods. ### Does this PR introduce _any_ user-facing change? Add label to pods. ### How was this patch tested? Add UT Closes #34460 from Yikun/SPARK-36566. Authored-by: Yikun Jiang Signed-off-by: Dongjoon Hyun --- .../scala/org/apache/spark/deploy/k8s/Constants.scala | 1 + .../org/apache/spark/deploy/k8s/KubernetesConf.scala | 18 ++ .../deploy/k8s/features/BasicDriverFeatureStep.scala | 1 + .../deploy/k8s/features/BasicExecutorFeatureStep.scala | 4 .../apache/spark/deploy/k8s/KubernetesConfSuite.scala | 9 + .../k8s/features/BasicDriverFeatureStepSuite.scala | 7 +-- .../k8s/features/BasicExecutorFeatureStepSuite.scala | 10 +++--- 7 files changed, 45 insertions(+), 5 deletions(-) diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala index f4b362b..baa10a6 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala @@ -20,6 +20,7 @@ private[spark] object Constants { // Labels val SPARK_APP_ID_LABEL = "spark-app-selector" + val SPARK_APP_NAME_LABEL = "spark-app-name" val SPARK_EXECUTOR_ID_LABEL = "spark-exec-id" val SPARK_RESOURCE_PROFILE_ID_LABEL = "spark-exec-resourceprofile-id" val SPARK_ROLE_LABEL = "spark-role" diff --git a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala index 0eef6e1..b94de88 100644 --- a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala +++ b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala @@ -19,6 +19,7 @@ package org.apache.spark.deploy.k8s import java.util.{Locale, UUID} import io.fabric8.kubernetes.api.model.{LocalObjectReference, LocalObjectReferenceBuilder, Pod} +import org.apache.commons.lang3.StringUtils import org.apache.spark.SparkConf import org.apache.spark.deploy.k8s.Config._ @@ -94,6 +95,7 @@ private[spark] class KubernetesDriverConf( override def labels: Map[String, String] = { val presetLabels = Map( SPARK_APP_ID_LABEL -> appId, + SPARK_APP_NAME_LABEL -> KubernetesConf.getAppNameLabel(appName), SPARK_ROLE_LABEL -> SPARK_POD_DRIVER_ROLE) val driverCustomLabels = KubernetesUtils.parsePrefixedKeyValuePairs( sparkConf, KUBERNETES_DRIVER_LABEL_PREFIX) @@ -155,6 +157,7 @@ private[spark] class KubernetesExecutorConf( val presetLabels = Map( SPARK_EXECUTOR_ID_LABEL -> executorId, SPARK_APP_ID_LABEL -> appId, + SPARK_APP_NAME_LABEL -> KubernetesConf.getAppNameLabel(appName), SPARK_ROLE_LABEL -> SPARK_POD_EXECUTOR_ROLE, SPARK_RESOURCE_PROFILE_ID_LABEL -> resourceProfileId.toString) @@ -248,6 +251,21 @@ private[spark] object KubernetesConf { .replaceAll("-+", "-") } + def getAppNameLabel(appName: String): String = { +// According to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels, +// must be 63 characters or less to follow the DNS label standard, so take the 63 characters +// of the appName name as the label. +StringUtils.abbreviate( + s"$appName" +.trim +.toLowerCase(Locale.ROOT) +.replaceAll("[^a-z0-9\\-]", "-") +.replaceAll("-+", "-"), + "", + KUBERNETES_DNSNAME_MAX_LENGTH +) + } + /** * Build a resources name based on the vendor device plugin naming * convention of:
[spark] branch branch-3.2 updated: [MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only
This is an automated email from the ASF dual-hosted git repository. huaxingao pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 8804ed9 [MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only 8804ed9 is described below commit 8804ed9536f5e1bb3a24963086e8b093a0f71d2f Author: Huaxin Gao AuthorDate: Wed Nov 3 14:34:33 2021 -0700 [MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only ### What changes were proposed in this pull request? To specify JDBC aggregate push down is for DS V2 only. This change is for both 3.2 and master. ### Why are the changes needed? To make the doc clear so user won't use aggregate push down in DS v1. ### Does this PR introduce _any_ user-facing change? No. Doc change only ### How was this patch tested? Manually checked. Closes #34465 from huaxingao/doc_minor. Authored-by: Huaxin Gao Signed-off-by: Huaxin Gao (cherry picked from commit 44151be395852a5ef7e5784607f4e69da5e61569) Signed-off-by: Huaxin Gao --- docs/sql-data-sources-jdbc.md | 2 +- .../org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md index 315f476..4d38923 100644 --- a/docs/sql-data-sources-jdbc.md +++ b/docs/sql-data-sources-jdbc.md @@ -241,7 +241,7 @@ logging into the data sources. pushDownAggregate false - The option to enable or disable aggregate push-down into the JDBC data source. The default value is false, in which case Spark will not push down aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will be pushed down to the JDBC data source. Aggregate push-down is usually turned off when the aggregate is performed faster by Spark than by the JDBC data source. Please note that aggregates can be pushed down if and only if all the aggregate functions and the rel [...] + The option to enable or disable aggregate push-down in V2 JDBC data source. The default value is false, in which case Spark will not push down aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will be pushed down to the JDBC data source. Aggregate push-down is usually turned off when the aggregate is performed faster by Spark than by the JDBC data source. Please note that aggregates can be pushed down if and only if all the aggregate functions and the relate [...] read diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala index 8b2ae2b..43d0481 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala @@ -189,6 +189,7 @@ class JDBCOptions( val pushDownPredicate = parameters.getOrElse(JDBC_PUSHDOWN_PREDICATE, "true").toBoolean // An option to allow/disallow pushing down aggregate into JDBC data source + // This only applies to Data Source V2 JDBC val pushDownAggregate = parameters.getOrElse(JDBC_PUSHDOWN_AGGREGATE, "false").toBoolean // The local path of user's keytab file, which is assumed to be pre-uploaded to all nodes either - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b874bf5 -> 44151be)
This is an automated email from the ASF dual-hosted git repository. huaxingao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b874bf5 [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants add 44151be [MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only No new revisions were added by this update. Summary of changes: docs/sql-data-sources-jdbc.md | 2 +- .../org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b874bf5 [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants b874bf5 is described below commit b874bf5dca4f1b7272f458350eb153e7b272f8c8 Author: zero323 AuthorDate: Thu Nov 4 02:06:48 2021 +0900 [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants ### What changes were proposed in this pull request? This pull request synchronizes `RDD.toDF` annotations with `SparkSession.createDataFrame` and `SQLContext.createDataFrame` variants. Additionally, it fixes recent regression in `SQLContext.createDataFrame` (SPARK-37077), where `RDD` is no longer consider a valid input. ### Why are the changes needed? - Adds support for providing `str` schema. - Add supports for converting `RDDs` of "atomic" values, if schema is provided. Additionally it introduces a `TypeVar` representing supported "atomic" values. This was done to avoid issue with manual data tests, where the following ```python sc.parallelize([1]).toDF(schema=IntegerType()) ``` results in ``` error: No overload variant of "toDF" of "RDD" matches argument type "IntegerType" [call-overload] note: Possible overload variants: note: def toDF(self, schema: Union[List[str], Tuple[str, ...], None] = ..., sampleRatio: Optional[float] = ...) -> DataFrame note: def toDF(self, schema: Union[StructType, str, None] = ...) -> DataFrame ``` when `Union` type is used (this problem doesn't surface when non-self bound is used). ### Does this PR introduce _any_ user-facing change? Type checker only. Please note, that these annotations serve primarily to support documentation, as checks on `self` types are still very limited. ### How was this patch tested? Existing tests and manual data tests. __Note__: Updated data tests to reflect new expected traceback, after reversal in #34477 Closes #34478 from zero323/SPARK-36894. Authored-by: zero323 Signed-off-by: Kousuke Saruta --- python/pyspark/rdd.pyi | 15 ++--- python/pyspark/sql/_typing.pyi | 11 +++ python/pyspark/sql/context.py| 38 ++ python/pyspark/sql/session.py| 40 +--- python/pyspark/sql/tests/typing/test_session.yml | 8 ++--- 5 files changed, 71 insertions(+), 41 deletions(-) diff --git a/python/pyspark/rdd.pyi b/python/pyspark/rdd.pyi index a810a2c..84481d3 100644 --- a/python/pyspark/rdd.pyi +++ b/python/pyspark/rdd.pyi @@ -55,8 +55,8 @@ from pyspark.resource.requests import ( # noqa: F401 from pyspark.resource.profile import ResourceProfile from pyspark.statcounter import StatCounter from pyspark.sql.dataframe import DataFrame -from pyspark.sql.types import StructType -from pyspark.sql._typing import RowLike +from pyspark.sql.types import AtomicType, StructType +from pyspark.sql._typing import AtomicValue, RowLike from py4j.java_gateway import JavaObject # type: ignore[import] T = TypeVar("T") @@ -445,11 +445,18 @@ class RDD(Generic[T]): @overload def toDF( self: RDD[RowLike], -schema: Optional[List[str]] = ..., +schema: Optional[Union[List[str], Tuple[str, ...]]] = ..., sampleRatio: Optional[float] = ..., ) -> DataFrame: ... @overload -def toDF(self: RDD[RowLike], schema: Optional[StructType] = ...) -> DataFrame: ... +def toDF( +self: RDD[RowLike], schema: Optional[Union[StructType, str]] = ... +) -> DataFrame: ... +@overload +def toDF( +self: RDD[AtomicValue], +schema: Union[AtomicType, str], +) -> DataFrame: ... class RDDBarrier(Generic[T]): rdd: RDD[T] diff --git a/python/pyspark/sql/_typing.pyi b/python/pyspark/sql/_typing.pyi index 1a3bd8f..b6b4606 100644 --- a/python/pyspark/sql/_typing.pyi +++ b/python/pyspark/sql/_typing.pyi @@ -42,6 +42,17 @@ AtomicDataTypeOrString = Union[pyspark.sql.types.AtomicType, str] DataTypeOrString = Union[pyspark.sql.types.DataType, str] OptionalPrimitiveType = Optional[PrimitiveType] +AtomicValue = TypeVar( +"AtomicValue", +datetime.datetime, +datetime.date, +decimal.Decimal, +bool, +str, +int, +float, +) + RowLike = TypeVar("RowLike", List[Any], Tuple[Any, ...], pyspark.sql.types.Row) class SupportsOpen(Protocol): diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py index 7d27c55..eba2087 100644 --- a/python/pyspark/sql/context.py +++
[spark] branch master updated: Revert "[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants."
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8687138 Revert "[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants." 8687138 is described below commit 86871386b063d8f7a8b5b42eb327a3900525af58 Author: Kousuke Saruta AuthorDate: Wed Nov 3 23:01:38 2021 +0900 Revert "[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants." This reverts commit 855da09f02f3007a2c36e7a738d4dc81fd95569a. See [this comment](https://github.com/apache/spark/pull/34146#issuecomment-959136935). Closes #34477 from sarutak/revert-SPARK-37077. Authored-by: Kousuke Saruta Signed-off-by: Kousuke Saruta --- python/pyspark/rdd.pyi | 15 --- python/pyspark/sql/_typing.pyi | 11 --- python/pyspark/sql/context.py | 38 ++ python/pyspark/sql/session.py | 40 +++- 4 files changed, 37 insertions(+), 67 deletions(-) diff --git a/python/pyspark/rdd.pyi b/python/pyspark/rdd.pyi index 84481d3..a810a2c 100644 --- a/python/pyspark/rdd.pyi +++ b/python/pyspark/rdd.pyi @@ -55,8 +55,8 @@ from pyspark.resource.requests import ( # noqa: F401 from pyspark.resource.profile import ResourceProfile from pyspark.statcounter import StatCounter from pyspark.sql.dataframe import DataFrame -from pyspark.sql.types import AtomicType, StructType -from pyspark.sql._typing import AtomicValue, RowLike +from pyspark.sql.types import StructType +from pyspark.sql._typing import RowLike from py4j.java_gateway import JavaObject # type: ignore[import] T = TypeVar("T") @@ -445,18 +445,11 @@ class RDD(Generic[T]): @overload def toDF( self: RDD[RowLike], -schema: Optional[Union[List[str], Tuple[str, ...]]] = ..., +schema: Optional[List[str]] = ..., sampleRatio: Optional[float] = ..., ) -> DataFrame: ... @overload -def toDF( -self: RDD[RowLike], schema: Optional[Union[StructType, str]] = ... -) -> DataFrame: ... -@overload -def toDF( -self: RDD[AtomicValue], -schema: Union[AtomicType, str], -) -> DataFrame: ... +def toDF(self: RDD[RowLike], schema: Optional[StructType] = ...) -> DataFrame: ... class RDDBarrier(Generic[T]): rdd: RDD[T] diff --git a/python/pyspark/sql/_typing.pyi b/python/pyspark/sql/_typing.pyi index b6b4606..1a3bd8f 100644 --- a/python/pyspark/sql/_typing.pyi +++ b/python/pyspark/sql/_typing.pyi @@ -42,17 +42,6 @@ AtomicDataTypeOrString = Union[pyspark.sql.types.AtomicType, str] DataTypeOrString = Union[pyspark.sql.types.DataType, str] OptionalPrimitiveType = Optional[PrimitiveType] -AtomicValue = TypeVar( -"AtomicValue", -datetime.datetime, -datetime.date, -decimal.Decimal, -bool, -str, -int, -float, -) - RowLike = TypeVar("RowLike", List[Any], Tuple[Any, ...], pyspark.sql.types.Row) class SupportsOpen(Protocol): diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py index eba2087..7d27c55 100644 --- a/python/pyspark/sql/context.py +++ b/python/pyspark/sql/context.py @@ -48,11 +48,13 @@ from pyspark.conf import SparkConf if TYPE_CHECKING: from pyspark.sql._typing import ( -AtomicValue, -RowLike, UserDefinedFunctionLike, +RowLike, +DateTimeLiteral, +LiteralType, +DecimalLiteral ) -from pyspark.sql.pandas._typing import DataFrameLike as PandasDataFrameLike +from pyspark.sql.pandas._typing import DataFrameLike __all__ = ["SQLContext", "HiveContext"] @@ -321,8 +323,7 @@ class SQLContext(object): @overload def createDataFrame( self, -data: Union["RDD[RowLike]", Iterable["RowLike"]], -schema: Union[List[str], Tuple[str, ...]] = ..., +data: Iterable["RowLike"], samplingRatio: Optional[float] = ..., ) -> DataFrame: ... @@ -330,9 +331,8 @@ class SQLContext(object): @overload def createDataFrame( self, -data: Union["RDD[RowLike]", Iterable["RowLike"]], -schema: Union[StructType, str], -*, +data: Iterable["RowLike"], +schema: Union[List[str], Tuple[str, ...]] = ..., verifySchema: bool = ..., ) -> DataFrame: ... @@ -340,10 +340,7 @@ class SQLContext(object): @overload def createDataFrame( self, -data: Union[ -"RDD[AtomicValue]", -Iterable["AtomicValue"], -], +data: Iterable[Union["DateTimeLiteral", "LiteralType", "DecimalLiteral"]], schema: Union[AtomicType, str], verifySchema: bool = ...,
[spark] branch master updated: [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants
This is an automated email from the ASF dual-hosted git repository. zero323 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 855da09 [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants 855da09 is described below commit 855da09f02f3007a2c36e7a738d4dc81fd95569a Author: zero323 AuthorDate: Wed Nov 3 13:10:48 2021 +0100 [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants ### What changes were proposed in this pull request? This pull request synchronizes `RDD.toDF` annotations with `SparkSession.createDataFrame` and `SQLContext.createDataFrame` variants. Additionally, it fixes recent regression in `SQLContext.createDataFrame` (SPARK-37077), where `RDD` is no longer consider a valid input. ### Why are the changes needed? - Adds support for providing `str` schema. - Add supports for converting `RDDs` of "atomic" values, if schema is provided. Additionally it introduces a `TypeVar` representing supported "atomic" values. This was done to avoid issue with manual data tests, where the following ```python sc.parallelize([1]).toDF(schema=IntegerType()) ``` results in ``` error: No overload variant of "toDF" of "RDD" matches argument type "IntegerType" [call-overload] note: Possible overload variants: note: def toDF(self, schema: Union[List[str], Tuple[str, ...], None] = ..., sampleRatio: Optional[float] = ...) -> DataFrame note: def toDF(self, schema: Union[StructType, str, None] = ...) -> DataFrame ``` when `Union` type is used (this problem doesn't surface when non-self bound is used). ### Does this PR introduce _any_ user-facing change? Type checker only. Please note, that these annotations serve primarily to support documentation, as checks on `self` types are still very limited. ### How was this patch tested? Existing tests and manual data tests. Closes #34146 from zero323/SPARK-36894. Authored-by: zero323 Signed-off-by: zero323 --- python/pyspark/rdd.pyi | 15 +++ python/pyspark/sql/_typing.pyi | 11 +++ python/pyspark/sql/context.py | 38 -- python/pyspark/sql/session.py | 40 +--- 4 files changed, 67 insertions(+), 37 deletions(-) diff --git a/python/pyspark/rdd.pyi b/python/pyspark/rdd.pyi index a810a2c..84481d3 100644 --- a/python/pyspark/rdd.pyi +++ b/python/pyspark/rdd.pyi @@ -55,8 +55,8 @@ from pyspark.resource.requests import ( # noqa: F401 from pyspark.resource.profile import ResourceProfile from pyspark.statcounter import StatCounter from pyspark.sql.dataframe import DataFrame -from pyspark.sql.types import StructType -from pyspark.sql._typing import RowLike +from pyspark.sql.types import AtomicType, StructType +from pyspark.sql._typing import AtomicValue, RowLike from py4j.java_gateway import JavaObject # type: ignore[import] T = TypeVar("T") @@ -445,11 +445,18 @@ class RDD(Generic[T]): @overload def toDF( self: RDD[RowLike], -schema: Optional[List[str]] = ..., +schema: Optional[Union[List[str], Tuple[str, ...]]] = ..., sampleRatio: Optional[float] = ..., ) -> DataFrame: ... @overload -def toDF(self: RDD[RowLike], schema: Optional[StructType] = ...) -> DataFrame: ... +def toDF( +self: RDD[RowLike], schema: Optional[Union[StructType, str]] = ... +) -> DataFrame: ... +@overload +def toDF( +self: RDD[AtomicValue], +schema: Union[AtomicType, str], +) -> DataFrame: ... class RDDBarrier(Generic[T]): rdd: RDD[T] diff --git a/python/pyspark/sql/_typing.pyi b/python/pyspark/sql/_typing.pyi index 1a3bd8f..b6b4606 100644 --- a/python/pyspark/sql/_typing.pyi +++ b/python/pyspark/sql/_typing.pyi @@ -42,6 +42,17 @@ AtomicDataTypeOrString = Union[pyspark.sql.types.AtomicType, str] DataTypeOrString = Union[pyspark.sql.types.DataType, str] OptionalPrimitiveType = Optional[PrimitiveType] +AtomicValue = TypeVar( +"AtomicValue", +datetime.datetime, +datetime.date, +decimal.Decimal, +bool, +str, +int, +float, +) + RowLike = TypeVar("RowLike", List[Any], Tuple[Any, ...], pyspark.sql.types.Row) class SupportsOpen(Protocol): diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py index 7d27c55..eba2087 100644 --- a/python/pyspark/sql/context.py +++ b/python/pyspark/sql/context.py @@ -48,13 +48,11 @@ from pyspark.conf import SparkConf if TYPE_CHECKING: from pyspark.sql._typing import ( -UserDefinedFunctionLike, +AtomicValue,
[spark] branch branch-3.2 updated: [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new c2f147e [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page c2f147e is described below commit c2f147eff8e3e353cfb43f5d45f19f174fb26773 Author: Hyukjin Kwon AuthorDate: Wed Nov 3 20:54:50 2021 +0900 [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page ### What changes were proposed in this pull request? This PR proposes to fix the broken link in the legacy page. Currently it links to: ![Screen Shot 2021-11-03 at 6 34 32 PM](https://user-images.githubusercontent.com/6477701/140037221-b7963e47-12f5-49f3-8290-8560c99c62c2.png) ![Screen Shot 2021-11-03 at 6 34 30 PM](https://user-images.githubusercontent.com/6477701/140037225-c21070fc-907f-41bb-a421-747810ae5b0d.png) It should link to: ![Screen Shot 2021-11-03 at 6 34 35 PM](https://user-images.githubusercontent.com/6477701/140037246-dd14760f-5487-4b8b-b3f6-e9495f1d4ec9.png) ![Screen Shot 2021-11-03 at 6 34 38 PM](https://user-images.githubusercontent.com/6477701/140037251-3f97e992-6660-4ce9-9c66-77855d3c0a64.png) ### Why are the changes needed? For users to easily navigate from legacy page to newer page. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug in documentation. ### How was this patch tested? Manually built the side and checked the link Closes #34475 from HyukjinKwon/minor-doc-fix-py. Authored-by: Hyukjin Kwon Signed-off-by: Kousuke Saruta (cherry picked from commit ab7e5030b23ccb8ef6aa43645e909457f9d68ffa) Signed-off-by: Kousuke Saruta --- python/docs/source/user_guide/arrow_pandas.rst | 2 +- python/docs/source/user_guide/sql/arrow_pandas.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/python/docs/source/user_guide/arrow_pandas.rst b/python/docs/source/user_guide/arrow_pandas.rst index c1b68a6..60c11b7 100644 --- a/python/docs/source/user_guide/arrow_pandas.rst +++ b/python/docs/source/user_guide/arrow_pandas.rst @@ -21,4 +21,4 @@ Apache Arrow in PySpark === -This page has been moved to `Apache Arrow in PySpark <../sql/arrow_pandas.rst>`_. +This page has been moved to `Apache Arrow in PySpark `_. diff --git a/python/docs/source/user_guide/sql/arrow_pandas.rst b/python/docs/source/user_guide/sql/arrow_pandas.rst index 1767624..78d3e7a 100644 --- a/python/docs/source/user_guide/sql/arrow_pandas.rst +++ b/python/docs/source/user_guide/sql/arrow_pandas.rst @@ -343,7 +343,7 @@ Supported SQL Types Currently, all Spark SQL data types are supported by Arrow-based conversion except :class:`ArrayType` of :class:`TimestampType`, and nested :class:`StructType`. -:class: `MapType` is only supported when using PyArrow 2.0.0 and above. +:class:`MapType` is only supported when using PyArrow 2.0.0 and above. Setting Arrow Batch Size - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (9babf9a -> ab7e503)
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 9babf9a [SPARK-37200][SQL] Support drop index for Data Source V2 add ab7e503 [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page No new revisions were added by this update. Summary of changes: python/docs/source/user_guide/arrow_pandas.rst | 2 +- python/docs/source/user_guide/sql/arrow_pandas.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2afa52d -> 9babf9a)
This is an automated email from the ASF dual-hosted git repository. huaxingao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 2afa52d [SPARK-37179][SQL] ANSI mode: Add a config to allow casting between Datetime and Numeric add 9babf9a [SPARK-37200][SQL] Support drop index for Data Source V2 No new revisions were added by this update. Summary of changes: .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala | 21 +++-- .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala | 4 ++-- .../apache/spark/sql/catalyst/parser/SqlBase.g4| 1 + .../spark/sql/catalyst/parser/AstBuilder.scala | 15 + .../sql/catalyst/plans/logical/v2Commands.scala| 12 ++ .../spark/sql/catalyst/parser/DDLParserSuite.scala | 8 +++ .../datasources/v2/DataSourceV2Strategy.scala | 8 +++ .../{AlterTableExec.scala => DropIndexExec.scala} | 26 ++ .../execution/datasources/v2/jdbc/JDBCTable.scala | 4 ++-- .../spark/sql/hive/execution/HiveQuerySuite.scala | 5 + 10 files changed, 80 insertions(+), 24 deletions(-) copy sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/{AlterTableExec.scala => DropIndexExec.scala} (70%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (59c55dd -> 2afa52d)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 59c55dd [SPARK-24774][SQL][FOLLOWUP] Remove unused code in SchemaConverters.scala add 2afa52d [SPARK-37179][SQL] ANSI mode: Add a config to allow casting between Datetime and Numeric No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/Cast.scala | 51 ++ .../spark/sql/catalyst/util/DateTimeUtils.scala| 16 +++- .../spark/sql/errors/QueryExecutionErrors.scala| 4 +- .../org/apache/spark/sql/internal/SQLConf.scala| 12 +++ .../catalyst/expressions/AnsiCastSuiteBase.scala | 103 + 5 files changed, 164 insertions(+), 22 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org