date:20210903

[spark] branch master updated (9b262e7 -> 019203c)

2021-09-03 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 9b262e7  [SPARK-36401][PYTHON] Implement Series.cov
 add 019203c  [SPARK-36655][PYTHON] Add `versionadded` for API added in 
Spark 3.3.0

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/frame.py | 2 ++
 1 file changed, 2 insertions(+)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36401][PYTHON] Implement Series.cov

2021-09-03 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9b262e7  [SPARK-36401][PYTHON] Implement Series.cov
9b262e7 is described below

commit 9b262e722d717bebf7d2aa19af0658cf3b8e1e85
Author: dgd-contributor 
AuthorDate: Fri Sep 3 10:41:27 2021 -0700

[SPARK-36401][PYTHON] Implement Series.cov

### What changes were proposed in this pull request?

Implement Series.cov

### Why are the changes needed?

That is supported in pandas. We should support that as well.

### Does this PR introduce _any_ user-facing change?

Yes. Series.cov can be used.

```python
>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198])
>>> s1.cov(s2)
-0.016857626527158744
>>> reset_option("compute.ops_on_diff_frames")
```

### How was this patch tested?

Unit tests

Closes #33752 from dgd-contributor/SPARK-36401_Implement_Series.cov.

Authored-by: dgd-contributor 
Signed-off-by: Takuya UESHIN 
---
 .../source/reference/pyspark.pandas/series.rst |  1 +
 python/pyspark/pandas/missing/series.py|  1 -
 python/pyspark/pandas/series.py| 50 ++
 .../pandas/tests/test_ops_on_diff_frames.py| 37 
 python/pyspark/pandas/tests/test_series.py | 45 +++
 5 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/reference/pyspark.pandas/series.rst 
b/python/docs/source/reference/pyspark.pandas/series.rst
index 1e32773..337f686 100644
--- a/python/docs/source/reference/pyspark.pandas/series.rst
+++ b/python/docs/source/reference/pyspark.pandas/series.rst
@@ -138,6 +138,7 @@ Computations / Descriptive Stats
Series.clip
Series.corr
Series.count
+   Series.cov
Series.cummax
Series.cummin
Series.cumsum
diff --git a/python/pyspark/pandas/missing/series.py 
b/python/pyspark/pandas/missing/series.py
index ef3f38b..f6ea8d3 100644
--- a/python/pyspark/pandas/missing/series.py
+++ b/python/pyspark/pandas/missing/series.py
@@ -40,7 +40,6 @@ class MissingPandasLikeSeries(object):
 autocorr = _unsupported_function("autocorr")
 combine = _unsupported_function("combine")
 convert_dtypes = _unsupported_function("convert_dtypes")
-cov = _unsupported_function("cov")
 ewm = _unsupported_function("ewm")
 infer_objects = _unsupported_function("infer_objects")
 interpolate = _unsupported_function("interpolate")
diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index 568754c..7e45566 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -944,6 +944,56 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
 
 return lmask & rmask
 
+def cov(self, other: "Series", min_periods: Optional[int] = None) -> float:
+"""
+Compute covariance with Series, excluding missing values.
+
+.. versionadded:: 3.3.0
+
+Parameters
+--
+other : Series
+Series with which to compute the covariance.
+min_periods : int, optional
+Minimum number of observations needed to have a valid result.
+
+Returns
+---
+float
+Covariance between Series and other
+
+Examples
+
+>>> from pyspark.pandas.config import set_option, reset_option
+>>> set_option("compute.ops_on_diff_frames", True)
+>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
+>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198])
+>>> s1.cov(s2)
+-0.016857626527158744
+>>> reset_option("compute.ops_on_diff_frames")
+"""
+if not isinstance(other, Series):
+raise TypeError("unsupported type: %s" % type(other))
+if not np.issubdtype(self.dtype, np.number):
+raise TypeError("unsupported dtype: %s" % self.dtype)
+if not np.issubdtype(other.dtype, np.number):
+raise TypeError("unsupported dtype: %s" % other.dtype)
+
+min_periods = 1 if min_periods is None else min_periods
+
+if same_anchor(self, other):
+sdf = self._internal.spark_frame.select(self.spark.column, 
other.spark.column)
+else:
+combined = combine_frames(self.to_frame(), other.to_frame())
+sdf = 
combined._internal.spark_frame.select(*combined._internal.data_spark_columns)
+
+sdf = sdf.dropna()
+
+if len(sdf.head(min_periods)) < min_periods:
+return np.nan
+else:
+return

[spark] branch branch-3.2 updated: [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config

2021-09-03 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new aa96a37  [SPARK-36659][SQL] Promote 
spark.sql.execution.topKSortFallbackThreshold to a user-facing config
aa96a37 is described below

commit aa96a374b2f8b0bf4b004792d6b7729e27e349e5
Author: Kent Yao 
AuthorDate: Fri Sep 3 19:11:37 2021 +0800

[SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to 
a user-facing config

### What changes were proposed in this pull request?

Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing 
config

### Why are the changes needed?

spark.sql.execution.topKSortFallbackThreshold now is an internal config 
hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world 
cases, if the K is very big,  there would be performance issues.

It's better to leave this choice to users

### Does this PR introduce _any_ user-facing change?

 spark.sql.execution.topKSortFallbackThreshold is now user-facing

### How was this patch tested?

passing GA

Closes #33904 from yaooqinn/SPARK-36659.

Authored-by: Kent Yao 
Signed-off-by: Kent Yao 
(cherry picked from commit 7f1ad7be18b07d880ba92c470e64bf7458b8366a)
Signed-off-by: Dongjoon Hyun 
---
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 1 -
 1 file changed, 1 deletion(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index b137fd2..11b3097 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -2637,7 +2637,6 @@ object SQLConf {
 
   val TOP_K_SORT_FALLBACK_THRESHOLD =
 buildConf("spark.sql.execution.topKSortFallbackThreshold")
-  .internal()
   .doc("In SQL queries with a SORT followed by a LIMIT like " +
   "'SELECT x FROM t ORDER BY y LIMIT m', if m is under this threshold, 
do a top-K sort" +
   " in memory, otherwise do a global sort which spills to disk if 
necessary.")

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.1 updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && ste

2021-09-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new c1f8d75  [SPARK-36639][SQL] Fix an issue that sequence builtin 
function causes ArrayIndexOutOfBoundsException if the arguments are under the 
condition of start == stop && step < 0
c1f8d75 is described below

commit c1f8d759a3d75885e694e8c468ee6beea70131a3
Author: Kousuke Saruta 
AuthorDate: Fri Sep 3 23:25:18 2021 +0900

[SPARK-36639][SQL] Fix an issue that sequence builtin function causes 
ArrayIndexOutOfBoundsException if the arguments are under the condition of 
start == stop && step < 0

### What changes were proposed in this pull request?

This PR fixes an issue that `sequence` builtin function causes 
`ArrayIndexOutOfBoundsException` if the arguments are under the condition of 
`start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 
month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT 
sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33895 from sarutak/fix-sequence-issue.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
(cherry picked from commit cf3bc65e69dcb0f8ba3dee89642d082265edab31)
Signed-off-by: Kousuke Saruta 
---
 .../catalyst/expressions/collectionOperations.scala|  4 ++--
 .../expressions/CollectionExpressionsSuite.scala   | 18 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index b341895..bb2163c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -2711,7 +2711,7 @@ object Sequence {
 val maxEstimatedArrayLength =
   getSequenceLength(startMicros, stopMicros, intervalStepInMicros)
 
-val stepSign = if (stopMicros >= startMicros) +1 else -1
+val stepSign = if (intervalStepInMicros > 0) +1 else -1
 val exclusiveItem = stopMicros + stepSign
 val arr = new Array[T](maxEstimatedArrayLength)
 var t = startMicros
@@ -2786,7 +2786,7 @@ object Sequence {
  |
  |  $sequenceLengthCode
  |
- |  final int $stepSign = $stopMicros >= $startMicros ? +1 : -1;
+ |  final int $stepSign = $intervalInMicros > 0 ? +1 : -1;
  |  final long $exclusiveItem = $stopMicros + $stepSign;
  |
  |  $arr = new $elemType[$arrLength];
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
index 095894b..d79f06f 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
@@ -1888,6 +1888,24 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   Seq(Date.valueOf("2018-01-01")))
   }
 
+  test("SPARK-36639: Start and end equal in month range with a negative step") 
{
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 day"))),
+  Seq(Date.valueOf("2018-01-01")))
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 month"))),
+  Seq(Date.valueOf("2018-01-01")))
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 year"))),
+  Seq(Date.valueOf("2018-01-01")))
+  }
+
   test("SPARK-33386: element_at ArrayIndexOutOfBoundsException") {
 Seq(true, false).foreach { ansiEnabled =>
   withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail:

[spark] branch branch-3.2 updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && ste

2021-09-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new a3901ed  [SPARK-36639][SQL] Fix an issue that sequence builtin 
function causes ArrayIndexOutOfBoundsException if the arguments are under the 
condition of start == stop && step < 0
a3901ed is described below

commit a3901ed3848d21fd36bb5aa265ef8e8d74d8e324
Author: Kousuke Saruta 
AuthorDate: Fri Sep 3 23:25:18 2021 +0900

[SPARK-36639][SQL] Fix an issue that sequence builtin function causes 
ArrayIndexOutOfBoundsException if the arguments are under the condition of 
start == stop && step < 0

### What changes were proposed in this pull request?

This PR fixes an issue that `sequence` builtin function causes 
`ArrayIndexOutOfBoundsException` if the arguments are under the condition of 
`start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 
month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT 
sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33895 from sarutak/fix-sequence-issue.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
(cherry picked from commit cf3bc65e69dcb0f8ba3dee89642d082265edab31)
Signed-off-by: Kousuke Saruta 
---
 .../catalyst/expressions/collectionOperations.scala|  4 ++--
 .../expressions/CollectionExpressionsSuite.scala   | 18 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 6cbab86..ce17231 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -2903,7 +2903,7 @@ object Sequence {
 val maxEstimatedArrayLength =
   getSequenceLength(startMicros, stopMicros, input3, 
intervalStepInMicros)
 
-val stepSign = if (stopMicros >= startMicros) +1 else -1
+val stepSign = if (intervalStepInMicros > 0) +1 else -1
 val exclusiveItem = stopMicros + stepSign
 val arr = new Array[T](maxEstimatedArrayLength)
 var t = startMicros
@@ -2989,7 +2989,7 @@ object Sequence {
  |
  |  $sequenceLengthCode
  |
- |  final int $stepSign = $stopMicros >= $startMicros ? +1 : -1;
+ |  final int $stepSign = $intervalInMicros > 0 ? +1 : -1;
  |  final long $exclusiveItem = $stopMicros + $stepSign;
  |
  |  $arr = new $elemType[$arrLength];
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
index caa5e96..e8f5f07 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
@@ -2232,6 +2232,24 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   Seq(Date.valueOf("2018-01-01")))
   }
 
+  test("SPARK-36639: Start and end equal in month range with a negative step") 
{
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 day"))),
+  Seq(Date.valueOf("2018-01-01")))
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 month"))),
+  Seq(Date.valueOf("2018-01-01")))
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 year"))),
+  Seq(Date.valueOf("2018-01-01")))
+  }
+
   test("SPARK-33386: element_at ArrayIndexOutOfBoundsException") {
 Seq(true, false).foreach { ansiEnabled =>
   withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail:

[spark] branch master updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step <

2021-09-03 Thread sarutak

This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new cf3bc65  [SPARK-36639][SQL] Fix an issue that sequence builtin 
function causes ArrayIndexOutOfBoundsException if the arguments are under the 
condition of start == stop && step < 0
cf3bc65 is described below

commit cf3bc65e69dcb0f8ba3dee89642d082265edab31
Author: Kousuke Saruta 
AuthorDate: Fri Sep 3 23:25:18 2021 +0900

[SPARK-36639][SQL] Fix an issue that sequence builtin function causes 
ArrayIndexOutOfBoundsException if the arguments are under the condition of 
start == stop && step < 0

### What changes were proposed in this pull request?

This PR fixes an issue that `sequence` builtin function causes 
`ArrayIndexOutOfBoundsException` if the arguments are under the condition of 
`start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 
month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT 
sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33895 from sarutak/fix-sequence-issue.

Authored-by: Kousuke Saruta 
Signed-off-by: Kousuke Saruta 
---
 .../catalyst/expressions/collectionOperations.scala|  4 ++--
 .../expressions/CollectionExpressionsSuite.scala   | 18 ++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 6cbab86..ce17231 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -2903,7 +2903,7 @@ object Sequence {
 val maxEstimatedArrayLength =
   getSequenceLength(startMicros, stopMicros, input3, 
intervalStepInMicros)
 
-val stepSign = if (stopMicros >= startMicros) +1 else -1
+val stepSign = if (intervalStepInMicros > 0) +1 else -1
 val exclusiveItem = stopMicros + stepSign
 val arr = new Array[T](maxEstimatedArrayLength)
 var t = startMicros
@@ -2989,7 +2989,7 @@ object Sequence {
  |
  |  $sequenceLengthCode
  |
- |  final int $stepSign = $stopMicros >= $startMicros ? +1 : -1;
+ |  final int $stepSign = $intervalInMicros > 0 ? +1 : -1;
  |  final long $exclusiveItem = $stopMicros + $stepSign;
  |
  |  $arr = new $elemType[$arrLength];
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
index 8f35cf3..688ee61 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
@@ -2249,6 +2249,24 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   Seq(Date.valueOf("2018-01-01")))
   }
 
+  test("SPARK-36639: Start and end equal in month range with a negative step") 
{
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 day"))),
+  Seq(Date.valueOf("2018-01-01")))
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 month"))),
+  Seq(Date.valueOf("2018-01-01")))
+checkEvaluation(new Sequence(
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(Date.valueOf("2018-01-01")),
+  Literal(stringToInterval("interval -1 year"))),
+  Seq(Date.valueOf("2018-01-01")))
+  }
+
   test("SPARK-33386: element_at ArrayIndexOutOfBoundsException") {
 Seq(true, false).foreach { ansiEnabled =>
   withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) {

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36609][PYTHON] Add `errors` argument for `ps.to_numeric`

2021-09-03 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 61b3223  [SPARK-36609][PYTHON] Add `errors` argument for 
`ps.to_numeric`
61b3223 is described below

commit 61b3223f475c9cb8265807fbbe51563db699f7ca
Author: itholic 
AuthorDate: Fri Sep 3 21:33:30 2021 +0900

[SPARK-36609][PYTHON] Add `errors` argument for `ps.to_numeric`

### What changes were proposed in this pull request?

This PR proposes to support `errors` argument for `ps.to_numeric` such as 
pandas does.

Note that we don't support the `ignore` when the `arg` is pandas-on-Spark 
Series for now.

### Why are the changes needed?

We should match the behavior to pandas' as much as possible.

Also in the [recent blog 
post](https://medium.com/chuck.connell.3/pandas-on-databricks-via-koalas-a-review-9876b0a92541),
 the author pointed out we're missing this feature.

Seems like it's the kind of feature that commonly used in data science.

### Does this PR introduce _any_ user-facing change?

Now the `errors` argument is available for `ps.to_numeric`.

### How was this patch tested?

Unittests.

Closes #33882 from itholic/SPARK-36609.

Lead-authored-by: itholic 
Co-authored-by: Hyukjin Kwon 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/namespace.py| 28 +---
 python/pyspark/pandas/tests/test_namespace.py | 48 +++
 2 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/python/pyspark/pandas/namespace.py 
b/python/pyspark/pandas/namespace.py
index e964263..d0e4aba 100644
--- a/python/pyspark/pandas/namespace.py
+++ b/python/pyspark/pandas/namespace.py
@@ -2747,13 +2747,20 @@ def merge(
 
 
 @no_type_check
-def to_numeric(arg):
+def to_numeric(arg, errors="raise"):
 """
 Convert argument to a numeric type.
 
 Parameters
 --
 arg : scalar, list, tuple, 1-d array, or Series
+Argument to be converted.
+errors : {'raise', 'coerce'}, default 'raise'
+* If 'coerce', then invalid parsing will be set as NaN.
+* If 'raise', then invalid parsing will raise an exception.
+* If 'ignore', then invalid parsing will return the input.
+
+.. note:: 'ignore' doesn't work yet when `arg` is pandas-on-Spark 
Series.
 
 Returns
 ---
@@ -2783,6 +2790,7 @@ def to_numeric(arg):
 dtype: float32
 
 If given Series contains invalid value to cast float, just cast it to 
`np.nan`
+when `errors` is set to "coerce".
 
 >>> psser = ps.Series(['apple', '1.0', '2', '-3'])
 >>> psser
@@ -2792,7 +2800,7 @@ def to_numeric(arg):
 3   -3
 dtype: object
 
->>> ps.to_numeric(psser)
+>>> ps.to_numeric(psser, errors="coerce")
 0NaN
 11.0
 22.0
@@ -2814,9 +2822,21 @@ def to_numeric(arg):
 1.0
 """
 if isinstance(arg, Series):
-return arg._with_new_scol(arg.spark.column.cast("float"))
+if errors == "coerce":
+return arg._with_new_scol(arg.spark.column.cast("float"))
+elif errors == "raise":
+scol = arg.spark.column
+scol_casted = scol.cast("float")
+cond = F.when(
+F.assert_true(scol.isNull() | 
scol_casted.isNotNull()).isNull(), scol_casted
+)
+return arg._with_new_scol(cond)
+elif errors == "ignore":
+raise NotImplementedError("'ignore' is not implemented yet, when 
the `arg` is Series.")
+else:
+raise ValueError("invalid error value specified")
 else:
-return pd.to_numeric(arg)
+return pd.to_numeric(arg, errors=errors)
 
 
 def broadcast(obj: DataFrame) -> DataFrame:
diff --git a/python/pyspark/pandas/tests/test_namespace.py 
b/python/pyspark/pandas/tests/test_namespace.py
index 76a269d..29578a9 100644
--- a/python/pyspark/pandas/tests/test_namespace.py
+++ b/python/pyspark/pandas/tests/test_namespace.py
@@ -336,6 +336,54 @@ class NamespaceTest(PandasOnSparkTestCase, SQLTestUtils):
 lambda: read_delta("fake_path", version="0", 
timestamp="2021-06-22"),
 )
 
+def test_to_numeric(self):
+pser = pd.Series(["1", "2", None, "4", "hello"])
+psser = ps.from_pandas(pser)
+
+# "coerce" and "raise" with Series that contains un-parsable data.
+self.assert_eq(
+pd.to_numeric(pser, errors="coerce"), ps.to_numeric(psser, 
errors="coerce"), almost=True
+)
+
+# "raise" with Series that contains parsable data only.
+pser = pd.Series(["1", "2", None, "4", "5.0"])
+psser = ps.from_pandas(pser)
+
+self.assert_eq(
+

[spark] branch master updated (d3e3df1 -> 7f1ad7b)

2021-09-03 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d3e3df1  [SPARK-36644][SQL] Push down boolean column filter
 add 7f1ad7b  [SPARK-36659][SQL] Promote 
spark.sql.execution.topKSortFallbackThreshold to a user-facing config

No new revisions were added by this update.

Summary of changes:
 sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 1 -
 1 file changed, 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36644][SQL] Push down boolean column filter

2021-09-03 Thread dbtsai

This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d3e3df1  [SPARK-36644][SQL] Push down boolean column filter
d3e3df1 is described below

commit d3e3df17aac577e163cab7e085624f94f07c748e
Author: Kazuyuki Tanimura 
AuthorDate: Fri Sep 3 07:39:14 2021 +

[SPARK-36644][SQL] Push down boolean column filter

### What changes were proposed in this pull request?
This PR proposes to improve `DataSourceStrategy` to be able to push down 
boolean column filters. Currently boolean column filters do not get pushed down 
and may cause unnecessary IO.

### Why are the changes needed?
The following query does not push down the filter in the current 
implementation
```
SELECT * FROM t WHERE boolean_field
```
although the following query pushes down the filter as expected.
```
SELECT * FROM t WHERE boolean_field = true
```
This is because the Physical Planner (`DataSourceStrategy`) currently only 
pushes down limited expression patterns like`EqualTo`.
It is fair for Spark SQL users to expect `boolean_field` performs the same 
as `boolean_field = true`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests
```
build/sbt "core/testOnly *DataSourceStrategySuite   -- -z SPARK-36644"
```

Closes #33898 from kazuyukitanimura/SPARK-36644.

Authored-by: Kazuyuki Tanimura 
Signed-off-by: DB Tsai 
---
 .../apache/spark/sql/execution/datasources/DataSourceStrategy.scala   | 3 +++
 .../spark/sql/execution/datasources/DataSourceStrategySuite.scala | 4 
 2 files changed, 7 insertions(+)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
index 7a5c343..30818b1 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
@@ -552,6 +552,9 @@ object DataSourceStrategy
 case expressions.Literal(false, BooleanType) =>
   Some(sources.AlwaysFalse)
 
+case e @ pushableColumn(name) if e.dataType.isInstanceOf[BooleanType] =>
+  Some(sources.EqualTo(name, true))
+
 case _ => None
   }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
index b94918e..37fe3c2 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala
@@ -311,6 +311,10 @@ class DataSourceStrategySuite extends PlanTest with 
SharedSparkSession {
 assert(PushableColumnAndNestedColumn.unapply(Abs('col.int)) === None)
   }
 
+  test("SPARK-36644: Push down boolean column filter") {
+testTranslateFilter('col.boolean, Some(sources.EqualTo("col", true)))
+  }
+
   /**
* Translate the given Catalyst [[Expression]] into data source 
[[sources.Filter]]
* then verify against the given [[sources.Filter]].

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (9b262e7 -> 019203c)

[spark] branch master updated: [SPARK-36401][PYTHON] Implement Series.cov

[spark] branch branch-3.2 updated: [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config

[spark] branch branch-3.1 updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && ste

[spark] branch branch-3.2 updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && ste

[spark] branch master updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step <

[spark] branch master updated: [SPARK-36609][PYTHON] Add `errors` argument for `ps.to_numeric`

[spark] branch master updated (d3e3df1 -> 7f1ad7b)

[spark] branch master updated: [SPARK-36644][SQL] Push down boolean column filter

9 matches

Site Navigation

Mail list logo

Footer information