date:20211103

[spark] branch master updated: [SPARK-37108][R] Expose make_date expression in R

2021-11-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5f997c7  [SPARK-37108][R] Expose make_date expression in R
5f997c7 is described below

commit 5f997c78c83551942b6c5a8ec6344547b86ae68a
Author: Leona Yoda 
AuthorDate: Thu Nov 4 12:25:12 2021 +0900

[SPARK-37108][R] Expose make_date expression in R

### What changes were proposed in this pull request?

Expose `make_date` API on  SparkR

### Why are the changes needed?

`make_date` APIs on Scala and PySpark were added by 
[SPARK-34356](https://github.com/apache/spark/pull/34356), this PR aims to 
cover the API on SparkR.

### Does this PR introduce _any_ user-facing change?

Yes, users can call the API by SparkR

### How was this patch tested?

unit tests.

Closes #34480 from yoda-mon/make-date-r.

Authored-by: Leona Yoda 
Signed-off-by: Kousuke Saruta 
---
 R/pkg/NAMESPACE   |  1 +
 R/pkg/R/functions.R   | 26 ++
 R/pkg/R/generics.R|  4 
 R/pkg/tests/fulltests/test_sparkSQL.R | 14 ++
 4 files changed, 45 insertions(+)

diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 10bb02a..6e0557c 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -343,6 +343,7 @@ exportMethods("%<=>%",
   "lower",
   "lpad",
   "ltrim",
+  "make_date",
   "map_concat",
   "map_entries",
   "map_filter",
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index fdbf48b..48d4fe8 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -41,6 +41,8 @@ NULL
 #' @param x Column to compute on. In \code{window}, it must be a time Column of
 #'  \code{TimestampType}. This is not used with \code{current_date} and
 #'  \code{current_timestamp}
+#' @param y Column to compute on.
+#' @param z Column to compute on.
 #' @param format The format for the given dates or timestamps in Column 
\code{x}. See the
 #'   format used in the following methods:
 #'   \itemize{
@@ -1467,6 +1469,30 @@ setMethod("ltrim",
   })
 
 #' @details
+#' \code{make_date}: Create date from year, month and day fields.
+#'
+#' @rdname column_datetime_functions
+#' @aliases make_date make_date,Column-method
+#' @note make_date since 3.3.0
+#' @examples
+#'
+#' \dontrun{
+#' df <- createDataFrame(
+#'   list(list(2021, 10, 22), list(2021, 13, 1),
+#'list(2021, 2, 29), list(2020, 2, 29)),
+#'   list("year", "month", "day")
+#' )
+#' tmp <- head(select(df, make_date(df$year, df$month, df$day)))
+#' head(tmp)}
+setMethod("make_date",
+  signature(x = "Column", y = "Column", z = "Column"),
+  function(x, y, z) {
+jc <- callJStatic("org.apache.spark.sql.functions", "make_date",
+  x@jc, y@jc, z@jc)
+column(jc)
+  })
+
+#' @details
 #' \code{max}: Returns the maximum value of the expression in a group.
 #'
 #' @rdname column_aggregate_functions
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index af19e72..5fe2ec6 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -1158,6 +1158,10 @@ setGeneric("lpad", function(x, len, pad) { 
standardGeneric("lpad") })
 #' @name NULL
 setGeneric("ltrim", function(x, trimString) { standardGeneric("ltrim") })
 
+#' @rdname column_datetime_functions
+#' @name NULL
+setGeneric("make_date", function(x, y, z) { standardGeneric("make_date") })
+
 #' @rdname column_collection_functions
 #' @name NULL
 setGeneric("map_concat", function(x, ...) { standardGeneric("map_concat") })
diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index b6e02bb..0e46324e 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -2050,6 +2050,20 @@ test_that("date functions on a DataFrame", {
   Sys.setenv(TZ = .originalTimeZone)
 })
 
+test_that("SPARK-37108: expose make_date expression in R", {
+  df <- createDataFrame(
+list(list(2021, 10, 22), list(2021, 13, 1),
+ list(2021, 2, 29), list(2020, 2, 29)),
+list("year", "month", "day")
+  )
+  expect <- createDataFrame(
+list(list(as.Date("2021-10-22")), NA, NA, list(as.Date("2020-02-29"))),
+list("make_date(year, month, day)")
+  )
+  actual <- select(df, make_date(df$year, df$month, df$day))
+  expect_equal(collect(expect), collect(actual))
+})
+
 test_that("greatest() and least() on a DataFrame", {
   l <- list(list(a = 1, b = 2), list(a = 3, b = 4))
   df <- createDataFrame(l)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail:

[spark] branch master updated: [SPARK-36566][K8S] Add Spark appname as a label to pods

2021-11-03 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bb77e7b  [SPARK-36566][K8S] Add Spark appname as a label to pods
bb77e7b is described below

commit bb77e7b30cd6865af45bc861fa20f5d139572fa7
Author: Yikun Jiang 
AuthorDate: Wed Nov 3 16:08:49 2021 -0700

[SPARK-36566][K8S] Add Spark appname as a label to pods

### What changes were proposed in this pull request?
Add Spark appname as a label to pods.

Note that:
- there were a `SPARK_APP_ID_LABEL`: is the unique spark APP applicationId 
with spark prefix (like "spark-{applicationId}")
- `SPARK_APP_NAME_LABEL` in this patch is the Spark APP name, it's more 
friendly and readable for k8s cluster maintainer to figure out the spark app 
name of specific pods.

### Why are the changes needed?
Then we can find out all pods (driver/executor) list by using: `k get pods 
-l spark.app.name=xxx`, also can figure out the spark app name of specific pods.

### Does this PR introduce _any_ user-facing change?
Add label to pods.

### How was this patch tested?
Add UT

Closes #34460 from Yikun/SPARK-36566.

Authored-by: Yikun Jiang 
Signed-off-by: Dongjoon Hyun 
---
 .../scala/org/apache/spark/deploy/k8s/Constants.scala  |  1 +
 .../org/apache/spark/deploy/k8s/KubernetesConf.scala   | 18 ++
 .../deploy/k8s/features/BasicDriverFeatureStep.scala   |  1 +
 .../deploy/k8s/features/BasicExecutorFeatureStep.scala |  4 
 .../apache/spark/deploy/k8s/KubernetesConfSuite.scala  |  9 +
 .../k8s/features/BasicDriverFeatureStepSuite.scala |  7 +--
 .../k8s/features/BasicExecutorFeatureStepSuite.scala   | 10 +++---
 7 files changed, 45 insertions(+), 5 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
index f4b362b..baa10a6 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Constants.scala
@@ -20,6 +20,7 @@ private[spark] object Constants {
 
   // Labels
   val SPARK_APP_ID_LABEL = "spark-app-selector"
+  val SPARK_APP_NAME_LABEL = "spark-app-name"
   val SPARK_EXECUTOR_ID_LABEL = "spark-exec-id"
   val SPARK_RESOURCE_PROFILE_ID_LABEL = "spark-exec-resourceprofile-id"
   val SPARK_ROLE_LABEL = "spark-role"
diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
index 0eef6e1..b94de88 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesConf.scala
@@ -19,6 +19,7 @@ package org.apache.spark.deploy.k8s
 import java.util.{Locale, UUID}
 
 import io.fabric8.kubernetes.api.model.{LocalObjectReference, 
LocalObjectReferenceBuilder, Pod}
+import org.apache.commons.lang3.StringUtils
 
 import org.apache.spark.SparkConf
 import org.apache.spark.deploy.k8s.Config._
@@ -94,6 +95,7 @@ private[spark] class KubernetesDriverConf(
   override def labels: Map[String, String] = {
 val presetLabels = Map(
   SPARK_APP_ID_LABEL -> appId,
+  SPARK_APP_NAME_LABEL -> KubernetesConf.getAppNameLabel(appName),
   SPARK_ROLE_LABEL -> SPARK_POD_DRIVER_ROLE)
 val driverCustomLabels = KubernetesUtils.parsePrefixedKeyValuePairs(
   sparkConf, KUBERNETES_DRIVER_LABEL_PREFIX)
@@ -155,6 +157,7 @@ private[spark] class KubernetesExecutorConf(
 val presetLabels = Map(
   SPARK_EXECUTOR_ID_LABEL -> executorId,
   SPARK_APP_ID_LABEL -> appId,
+  SPARK_APP_NAME_LABEL -> KubernetesConf.getAppNameLabel(appName),
   SPARK_ROLE_LABEL -> SPARK_POD_EXECUTOR_ROLE,
   SPARK_RESOURCE_PROFILE_ID_LABEL -> resourceProfileId.toString)
 
@@ -248,6 +251,21 @@ private[spark] object KubernetesConf {
   .replaceAll("-+", "-")
   }
 
+  def getAppNameLabel(appName: String): String = {
+// According to 
https://kubernetes.io/docs/concepts/overview/working-with-objects/labels,
+// must be 63 characters or less to follow the DNS label standard, so take 
the 63 characters
+// of the appName name as the label.
+StringUtils.abbreviate(
+  s"$appName"
+.trim
+.toLowerCase(Locale.ROOT)
+.replaceAll("[^a-z0-9\\-]", "-")
+.replaceAll("-+", "-"),
+  "",
+  KUBERNETES_DNSNAME_MAX_LENGTH
+)
+  }
+
   /**
* Build a resources name based on the vendor device plugin naming
* convention of:

[spark] branch branch-3.2 updated: [MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only

2021-11-03 Thread huaxingao

This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 8804ed9  [MINOR][SQL][DOCS] Document JDBC aggregate push down is for 
DSV2 only
8804ed9 is described below

commit 8804ed9536f5e1bb3a24963086e8b093a0f71d2f
Author: Huaxin Gao 
AuthorDate: Wed Nov 3 14:34:33 2021 -0700

[MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only

### What changes were proposed in this pull request?
To specify JDBC aggregate push down is for DS V2 only. This change is for 
both 3.2 and master.

### Why are the changes needed?
To make the doc clear so user won't use aggregate push down in DS v1.

### Does this PR introduce _any_ user-facing change?
No. Doc change only

### How was this patch tested?
Manually checked.

Closes #34465 from huaxingao/doc_minor.

Authored-by: Huaxin Gao 
Signed-off-by: Huaxin Gao 
(cherry picked from commit 44151be395852a5ef7e5784607f4e69da5e61569)
Signed-off-by: Huaxin Gao 
---
 docs/sql-data-sources-jdbc.md   | 2 +-
 .../org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md
index 315f476..4d38923 100644
--- a/docs/sql-data-sources-jdbc.md
+++ b/docs/sql-data-sources-jdbc.md
@@ -241,7 +241,7 @@ logging into the data sources.
 pushDownAggregate
 false
 
- The option to enable or disable aggregate push-down into the JDBC data 
source. The default value is false, in which case Spark will not push down 
aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will 
be pushed down to the JDBC data source. Aggregate push-down is usually turned 
off when the aggregate is performed faster by Spark than by the JDBC data 
source. Please note that aggregates can be pushed down if and only if all the 
aggregate functions and the rel [...]
+ The option to enable or disable aggregate push-down in V2 JDBC data 
source. The default value is false, in which case Spark will not push down 
aggregates to the JDBC data source. Otherwise, if sets to true, aggregates will 
be pushed down to the JDBC data source. Aggregate push-down is usually turned 
off when the aggregate is performed faster by Spark than by the JDBC data 
source. Please note that aggregates can be pushed down if and only if all the 
aggregate functions and the relate [...]
 
 read
   
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
index 8b2ae2b..43d0481 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
@@ -189,6 +189,7 @@ class JDBCOptions(
   val pushDownPredicate = parameters.getOrElse(JDBC_PUSHDOWN_PREDICATE, 
"true").toBoolean
 
   // An option to allow/disallow pushing down aggregate into JDBC data source
+  // This only applies to Data Source V2 JDBC
   val pushDownAggregate = parameters.getOrElse(JDBC_PUSHDOWN_AGGREGATE, 
"false").toBoolean
 
   // The local path of user's keytab file, which is assumed to be pre-uploaded 
to all nodes either

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b874bf5 -> 44151be)

2021-11-03 Thread huaxingao

This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b874bf5  [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF 
annotations with SparkSession and SQLContext .createDataFrame variants
 add 44151be  [MINOR][SQL][DOCS] Document JDBC aggregate push down is for 
DSV2 only

No new revisions were added by this update.

Summary of changes:
 docs/sql-data-sources-jdbc.md   | 2 +-
 .../org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala   | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants

2021-11-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b874bf5  [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF 
annotations with SparkSession and SQLContext .createDataFrame variants
b874bf5 is described below

commit b874bf5dca4f1b7272f458350eb153e7b272f8c8
Author: zero323 
AuthorDate: Thu Nov 4 02:06:48 2021 +0900

[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with 
SparkSession and SQLContext .createDataFrame variants

### What changes were proposed in this pull request?

This pull request synchronizes `RDD.toDF` annotations with 
`SparkSession.createDataFrame` and `SQLContext.createDataFrame` variants.

Additionally, it fixes recent regression in `SQLContext.createDataFrame` 
(SPARK-37077), where `RDD` is no longer consider a valid input.

### Why are the changes needed?

- Adds support for providing `str` schema.
- Add supports for converting `RDDs` of "atomic" values, if schema is 
provided.

Additionally it introduces a `TypeVar` representing supported "atomic" 
values. This was done to avoid issue with manual data tests, where the following

```python
sc.parallelize([1]).toDF(schema=IntegerType())
```

results in

```
error: No overload variant of "toDF" of "RDD" matches argument type 
"IntegerType"  [call-overload]
note: Possible overload variants:
note: def toDF(self, schema: Union[List[str], Tuple[str, ...], None] = 
..., sampleRatio: Optional[float] = ...) -> DataFrame
note: def toDF(self, schema: Union[StructType, str, None] = ...) -> 
DataFrame
```

when `Union` type is used (this problem doesn't surface when non-self bound 
is used).

### Does this PR introduce _any_ user-facing change?

Type checker only.

Please note, that these annotations serve primarily to support 
documentation, as checks on `self` types are still very limited.

### How was this patch tested?

Existing tests and manual data tests.

__Note__:

Updated data tests to reflect new expected traceback, after reversal in 
#34477

Closes #34478 from zero323/SPARK-36894.

Authored-by: zero323 
Signed-off-by: Kousuke Saruta 
---
 python/pyspark/rdd.pyi   | 15 ++---
 python/pyspark/sql/_typing.pyi   | 11 +++
 python/pyspark/sql/context.py| 38 ++
 python/pyspark/sql/session.py| 40 +---
 python/pyspark/sql/tests/typing/test_session.yml |  8 ++---
 5 files changed, 71 insertions(+), 41 deletions(-)

diff --git a/python/pyspark/rdd.pyi b/python/pyspark/rdd.pyi
index a810a2c..84481d3 100644
--- a/python/pyspark/rdd.pyi
+++ b/python/pyspark/rdd.pyi
@@ -55,8 +55,8 @@ from pyspark.resource.requests import (  # noqa: F401
 from pyspark.resource.profile import ResourceProfile
 from pyspark.statcounter import StatCounter
 from pyspark.sql.dataframe import DataFrame
-from pyspark.sql.types import StructType
-from pyspark.sql._typing import RowLike
+from pyspark.sql.types import AtomicType, StructType
+from pyspark.sql._typing import AtomicValue, RowLike
 from py4j.java_gateway import JavaObject  # type: ignore[import]
 
 T = TypeVar("T")
@@ -445,11 +445,18 @@ class RDD(Generic[T]):
 @overload
 def toDF(
 self: RDD[RowLike],
-schema: Optional[List[str]] = ...,
+schema: Optional[Union[List[str], Tuple[str, ...]]] = ...,
 sampleRatio: Optional[float] = ...,
 ) -> DataFrame: ...
 @overload
-def toDF(self: RDD[RowLike], schema: Optional[StructType] = ...) -> 
DataFrame: ...
+def toDF(
+self: RDD[RowLike], schema: Optional[Union[StructType, str]] = ...
+) -> DataFrame: ...
+@overload
+def toDF(
+self: RDD[AtomicValue],
+schema: Union[AtomicType, str],
+) -> DataFrame: ...
 
 class RDDBarrier(Generic[T]):
 rdd: RDD[T]
diff --git a/python/pyspark/sql/_typing.pyi b/python/pyspark/sql/_typing.pyi
index 1a3bd8f..b6b4606 100644
--- a/python/pyspark/sql/_typing.pyi
+++ b/python/pyspark/sql/_typing.pyi
@@ -42,6 +42,17 @@ AtomicDataTypeOrString = Union[pyspark.sql.types.AtomicType, 
str]
 DataTypeOrString = Union[pyspark.sql.types.DataType, str]
 OptionalPrimitiveType = Optional[PrimitiveType]
 
+AtomicValue = TypeVar(
+"AtomicValue",
+datetime.datetime,
+datetime.date,
+decimal.Decimal,
+bool,
+str,
+int,
+float,
+)
+
 RowLike = TypeVar("RowLike", List[Any], Tuple[Any, ...], pyspark.sql.types.Row)
 
 class SupportsOpen(Protocol):
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index 7d27c55..eba2087 100644
--- a/python/pyspark/sql/context.py
+++

[spark] branch master updated: Revert "[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants."

2021-11-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8687138  Revert "[SPARK-36894][SPARK-37077][PYTHON] Synchronize 
RDD.toDF annotations with SparkSession and SQLContext .createDataFrame 
variants."
8687138 is described below

commit 86871386b063d8f7a8b5b42eb327a3900525af58
Author: Kousuke Saruta 
AuthorDate: Wed Nov 3 23:01:38 2021 +0900

Revert "[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations 
with SparkSession and SQLContext .createDataFrame variants."

This reverts commit 855da09f02f3007a2c36e7a738d4dc81fd95569a.
See [this 
comment](https://github.com/apache/spark/pull/34146#issuecomment-959136935).

Closes #34477 from sarutak/revert-SPARK-37077.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
---
 python/pyspark/rdd.pyi | 15 ---
 python/pyspark/sql/_typing.pyi | 11 ---
 python/pyspark/sql/context.py  | 38 ++
 python/pyspark/sql/session.py  | 40 +++-
 4 files changed, 37 insertions(+), 67 deletions(-)

diff --git a/python/pyspark/rdd.pyi b/python/pyspark/rdd.pyi
index 84481d3..a810a2c 100644
--- a/python/pyspark/rdd.pyi
+++ b/python/pyspark/rdd.pyi
@@ -55,8 +55,8 @@ from pyspark.resource.requests import (  # noqa: F401
 from pyspark.resource.profile import ResourceProfile
 from pyspark.statcounter import StatCounter
 from pyspark.sql.dataframe import DataFrame
-from pyspark.sql.types import AtomicType, StructType
-from pyspark.sql._typing import AtomicValue, RowLike
+from pyspark.sql.types import StructType
+from pyspark.sql._typing import RowLike
 from py4j.java_gateway import JavaObject  # type: ignore[import]
 
 T = TypeVar("T")
@@ -445,18 +445,11 @@ class RDD(Generic[T]):
 @overload
 def toDF(
 self: RDD[RowLike],
-schema: Optional[Union[List[str], Tuple[str, ...]]] = ...,
+schema: Optional[List[str]] = ...,
 sampleRatio: Optional[float] = ...,
 ) -> DataFrame: ...
 @overload
-def toDF(
-self: RDD[RowLike], schema: Optional[Union[StructType, str]] = ...
-) -> DataFrame: ...
-@overload
-def toDF(
-self: RDD[AtomicValue],
-schema: Union[AtomicType, str],
-) -> DataFrame: ...
+def toDF(self: RDD[RowLike], schema: Optional[StructType] = ...) -> 
DataFrame: ...
 
 class RDDBarrier(Generic[T]):
 rdd: RDD[T]
diff --git a/python/pyspark/sql/_typing.pyi b/python/pyspark/sql/_typing.pyi
index b6b4606..1a3bd8f 100644
--- a/python/pyspark/sql/_typing.pyi
+++ b/python/pyspark/sql/_typing.pyi
@@ -42,17 +42,6 @@ AtomicDataTypeOrString = Union[pyspark.sql.types.AtomicType, 
str]
 DataTypeOrString = Union[pyspark.sql.types.DataType, str]
 OptionalPrimitiveType = Optional[PrimitiveType]
 
-AtomicValue = TypeVar(
-"AtomicValue",
-datetime.datetime,
-datetime.date,
-decimal.Decimal,
-bool,
-str,
-int,
-float,
-)
-
 RowLike = TypeVar("RowLike", List[Any], Tuple[Any, ...], pyspark.sql.types.Row)
 
 class SupportsOpen(Protocol):
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index eba2087..7d27c55 100644
--- a/python/pyspark/sql/context.py
+++ b/python/pyspark/sql/context.py
@@ -48,11 +48,13 @@ from pyspark.conf import SparkConf
 
 if TYPE_CHECKING:
 from pyspark.sql._typing import (
-AtomicValue,
-RowLike,
 UserDefinedFunctionLike,
+RowLike,
+DateTimeLiteral,
+LiteralType,
+DecimalLiteral
 )
-from pyspark.sql.pandas._typing import DataFrameLike as PandasDataFrameLike
+from pyspark.sql.pandas._typing import DataFrameLike
 
 __all__ = ["SQLContext", "HiveContext"]
 
@@ -321,8 +323,7 @@ class SQLContext(object):
 @overload
 def createDataFrame(
 self,
-data: Union["RDD[RowLike]", Iterable["RowLike"]],
-schema: Union[List[str], Tuple[str, ...]] = ...,
+data: Iterable["RowLike"],
 samplingRatio: Optional[float] = ...,
 ) -> DataFrame:
 ...
@@ -330,9 +331,8 @@ class SQLContext(object):
 @overload
 def createDataFrame(
 self,
-data: Union["RDD[RowLike]", Iterable["RowLike"]],
-schema: Union[StructType, str],
-*,
+data: Iterable["RowLike"],
+schema: Union[List[str], Tuple[str, ...]] = ...,
 verifySchema: bool = ...,
 ) -> DataFrame:
 ...
@@ -340,10 +340,7 @@ class SQLContext(object):
 @overload
 def createDataFrame(
 self,
-data: Union[
-"RDD[AtomicValue]",
-Iterable["AtomicValue"],
-],
+data: Iterable[Union["DateTimeLiteral", "LiteralType", 
"DecimalLiteral"]],
 schema: Union[AtomicType, str],
 verifySchema: bool = ...,

[spark] branch master updated: [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants

2021-11-03 Thread zero323

This is an automated email from the ASF dual-hosted git repository.

zero323 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 855da09  [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF 
annotations with SparkSession and SQLContext .createDataFrame variants
855da09 is described below

commit 855da09f02f3007a2c36e7a738d4dc81fd95569a
Author: zero323 
AuthorDate: Wed Nov 3 13:10:48 2021 +0100

[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with 
SparkSession and SQLContext .createDataFrame variants

### What changes were proposed in this pull request?

This pull request synchronizes `RDD.toDF` annotations with 
`SparkSession.createDataFrame` and `SQLContext.createDataFrame` variants.

Additionally, it fixes recent regression in `SQLContext.createDataFrame` 
(SPARK-37077), where `RDD` is no longer consider a valid input.

### Why are the changes needed?

- Adds support for providing `str` schema.
- Add supports for converting `RDDs` of "atomic" values, if schema is 
provided.

Additionally it introduces a `TypeVar` representing supported "atomic" 
values. This was done to avoid issue with manual data tests, where the following

```python
sc.parallelize([1]).toDF(schema=IntegerType())
```

results in

```
error: No overload variant of "toDF" of "RDD" matches argument type 
"IntegerType"  [call-overload]
note: Possible overload variants:
note: def toDF(self, schema: Union[List[str], Tuple[str, ...], None] = 
..., sampleRatio: Optional[float] = ...) -> DataFrame
note: def toDF(self, schema: Union[StructType, str, None] = ...) -> 
DataFrame
```

when `Union` type is used (this problem doesn't surface when non-self bound 
is used).

### Does this PR introduce _any_ user-facing change?

Type checker only.

Please note, that these annotations serve primarily to support 
documentation, as checks on `self` types are still very limited.

### How was this patch tested?

Existing tests and manual data tests.

Closes #34146 from zero323/SPARK-36894.

Authored-by: zero323 
Signed-off-by: zero323 
---
 python/pyspark/rdd.pyi | 15 +++
 python/pyspark/sql/_typing.pyi | 11 +++
 python/pyspark/sql/context.py  | 38 --
 python/pyspark/sql/session.py  | 40 +---
 4 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/python/pyspark/rdd.pyi b/python/pyspark/rdd.pyi
index a810a2c..84481d3 100644
--- a/python/pyspark/rdd.pyi
+++ b/python/pyspark/rdd.pyi
@@ -55,8 +55,8 @@ from pyspark.resource.requests import (  # noqa: F401
 from pyspark.resource.profile import ResourceProfile
 from pyspark.statcounter import StatCounter
 from pyspark.sql.dataframe import DataFrame
-from pyspark.sql.types import StructType
-from pyspark.sql._typing import RowLike
+from pyspark.sql.types import AtomicType, StructType
+from pyspark.sql._typing import AtomicValue, RowLike
 from py4j.java_gateway import JavaObject  # type: ignore[import]
 
 T = TypeVar("T")
@@ -445,11 +445,18 @@ class RDD(Generic[T]):
 @overload
 def toDF(
 self: RDD[RowLike],
-schema: Optional[List[str]] = ...,
+schema: Optional[Union[List[str], Tuple[str, ...]]] = ...,
 sampleRatio: Optional[float] = ...,
 ) -> DataFrame: ...
 @overload
-def toDF(self: RDD[RowLike], schema: Optional[StructType] = ...) -> 
DataFrame: ...
+def toDF(
+self: RDD[RowLike], schema: Optional[Union[StructType, str]] = ...
+) -> DataFrame: ...
+@overload
+def toDF(
+self: RDD[AtomicValue],
+schema: Union[AtomicType, str],
+) -> DataFrame: ...
 
 class RDDBarrier(Generic[T]):
 rdd: RDD[T]
diff --git a/python/pyspark/sql/_typing.pyi b/python/pyspark/sql/_typing.pyi
index 1a3bd8f..b6b4606 100644
--- a/python/pyspark/sql/_typing.pyi
+++ b/python/pyspark/sql/_typing.pyi
@@ -42,6 +42,17 @@ AtomicDataTypeOrString = Union[pyspark.sql.types.AtomicType, 
str]
 DataTypeOrString = Union[pyspark.sql.types.DataType, str]
 OptionalPrimitiveType = Optional[PrimitiveType]
 
+AtomicValue = TypeVar(
+"AtomicValue",
+datetime.datetime,
+datetime.date,
+decimal.Decimal,
+bool,
+str,
+int,
+float,
+)
+
 RowLike = TypeVar("RowLike", List[Any], Tuple[Any, ...], pyspark.sql.types.Row)
 
 class SupportsOpen(Protocol):
diff --git a/python/pyspark/sql/context.py b/python/pyspark/sql/context.py
index 7d27c55..eba2087 100644
--- a/python/pyspark/sql/context.py
+++ b/python/pyspark/sql/context.py
@@ -48,13 +48,11 @@ from pyspark.conf import SparkConf
 
 if TYPE_CHECKING:
 from pyspark.sql._typing import (
-UserDefinedFunctionLike,
+AtomicValue,

[spark] branch branch-3.2 updated: [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page

2021-11-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new c2f147e  [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow 
in PySpark page
c2f147e is described below

commit c2f147eff8e3e353cfb43f5d45f19f174fb26773
Author: Hyukjin Kwon 
AuthorDate: Wed Nov 3 20:54:50 2021 +0900

[MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page

### What changes were proposed in this pull request?

This PR proposes to fix the broken link in the legacy page. Currently it 
links to:

![Screen Shot 2021-11-03 at 6 34 32 
PM](https://user-images.githubusercontent.com/6477701/140037221-b7963e47-12f5-49f3-8290-8560c99c62c2.png)
![Screen Shot 2021-11-03 at 6 34 30 
PM](https://user-images.githubusercontent.com/6477701/140037225-c21070fc-907f-41bb-a421-747810ae5b0d.png)

It should link to:

![Screen Shot 2021-11-03 at 6 34 35 
PM](https://user-images.githubusercontent.com/6477701/140037246-dd14760f-5487-4b8b-b3f6-e9495f1d4ec9.png)
![Screen Shot 2021-11-03 at 6 34 38 
PM](https://user-images.githubusercontent.com/6477701/140037251-3f97e992-6660-4ce9-9c66-77855d3c0a64.png)

### Why are the changes needed?

For users to easily navigate from legacy page to newer page.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes a bug in documentation.

### How was this patch tested?

Manually built the side and checked the link

Closes #34475 from HyukjinKwon/minor-doc-fix-py.

Authored-by: Hyukjin Kwon 
Signed-off-by: Kousuke Saruta 
(cherry picked from commit ab7e5030b23ccb8ef6aa43645e909457f9d68ffa)
Signed-off-by: Kousuke Saruta 
---
 python/docs/source/user_guide/arrow_pandas.rst | 2 +-
 python/docs/source/user_guide/sql/arrow_pandas.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/docs/source/user_guide/arrow_pandas.rst 
b/python/docs/source/user_guide/arrow_pandas.rst
index c1b68a6..60c11b7 100644
--- a/python/docs/source/user_guide/arrow_pandas.rst
+++ b/python/docs/source/user_guide/arrow_pandas.rst
@@ -21,4 +21,4 @@
 Apache Arrow in PySpark
 ===
 
-This page has been moved to `Apache Arrow in PySpark 
<../sql/arrow_pandas.rst>`_.
+This page has been moved to `Apache Arrow in PySpark `_.
diff --git a/python/docs/source/user_guide/sql/arrow_pandas.rst 
b/python/docs/source/user_guide/sql/arrow_pandas.rst
index 1767624..78d3e7a 100644
--- a/python/docs/source/user_guide/sql/arrow_pandas.rst
+++ b/python/docs/source/user_guide/sql/arrow_pandas.rst
@@ -343,7 +343,7 @@ Supported SQL Types
 
 Currently, all Spark SQL data types are supported by Arrow-based conversion 
except
 :class:`ArrayType` of :class:`TimestampType`, and nested :class:`StructType`.
-:class: `MapType` is only supported when using PyArrow 2.0.0 and above.
+:class:`MapType` is only supported when using PyArrow 2.0.0 and above.
 
 Setting Arrow Batch Size
 

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (9babf9a -> ab7e503)

2021-11-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9babf9a  [SPARK-37200][SQL] Support drop index for Data Source V2
 add ab7e503  [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow 
in PySpark page

No new revisions were added by this update.

Summary of changes:
 python/docs/source/user_guide/arrow_pandas.rst | 2 +-
 python/docs/source/user_guide/sql/arrow_pandas.rst | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (2afa52d -> 9babf9a)

2021-11-03 Thread huaxingao

This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 2afa52d  [SPARK-37179][SQL] ANSI mode: Add a config to allow casting 
between Datetime and Numeric
 add 9babf9a  [SPARK-37200][SQL] Support drop index for Data Source V2

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/jdbc/v2/MySQLIntegrationSuite.scala  | 21 +++--
 .../org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala  |  4 ++--
 .../apache/spark/sql/catalyst/parser/SqlBase.g4|  1 +
 .../spark/sql/catalyst/parser/AstBuilder.scala | 15 +
 .../sql/catalyst/plans/logical/v2Commands.scala| 12 ++
 .../spark/sql/catalyst/parser/DDLParserSuite.scala |  8 +++
 .../datasources/v2/DataSourceV2Strategy.scala  |  8 +++
 .../{AlterTableExec.scala => DropIndexExec.scala}  | 26 ++
 .../execution/datasources/v2/jdbc/JDBCTable.scala  |  4 ++--
 .../spark/sql/hive/execution/HiveQuerySuite.scala  |  5 +
 10 files changed, 80 insertions(+), 24 deletions(-)
 copy 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/{AlterTableExec.scala
 => DropIndexExec.scala} (70%)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (59c55dd -> 2afa52d)

2021-11-03 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 59c55dd  [SPARK-24774][SQL][FOLLOWUP] Remove unused code in 
SchemaConverters.scala
 add 2afa52d  [SPARK-37179][SQL] ANSI mode: Add a config to allow casting 
between Datetime and Numeric

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/expressions/Cast.scala  |  51 ++
 .../spark/sql/catalyst/util/DateTimeUtils.scala|  16 +++-
 .../spark/sql/errors/QueryExecutionErrors.scala|   4 +-
 .../org/apache/spark/sql/internal/SQLConf.scala|  12 +++
 .../catalyst/expressions/AnsiCastSuiteBase.scala   | 103 +
 5 files changed, 164 insertions(+), 22 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-37108][R] Expose make_date expression in R

[spark] branch master updated: [SPARK-36566][K8S] Add Spark appname as a label to pods

[spark] branch branch-3.2 updated: [MINOR][SQL][DOCS] Document JDBC aggregate push down is for DSV2 only

[spark] branch master updated (b874bf5 -> 44151be)

[spark] branch master updated: [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants

[spark] branch master updated: Revert "[SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants."

[spark] branch master updated: [SPARK-36894][SPARK-37077][PYTHON] Synchronize RDD.toDF annotations with SparkSession and SQLContext .createDataFrame variants

[spark] branch branch-3.2 updated: [MINOR][PYTHON][DOCS] Fix broken link in legacy Apache Arrow in PySpark page

[spark] branch master updated (9babf9a -> ab7e503)

[spark] branch master updated (2afa52d -> 9babf9a)

[spark] branch master updated (59c55dd -> 2afa52d)

11 matches

Site Navigation

Mail list logo

Footer information