[spark] branch master updated (9b262e7 -> 019203c)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 9b262e7 [SPARK-36401][PYTHON] Implement Series.cov add 019203c [SPARK-36655][PYTHON] Add `versionadded` for API added in Spark 3.3.0 No new revisions were added by this update. Summary of changes: python/pyspark/pandas/frame.py | 2 ++ 1 file changed, 2 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36401][PYTHON] Implement Series.cov
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9b262e7 [SPARK-36401][PYTHON] Implement Series.cov 9b262e7 is described below commit 9b262e722d717bebf7d2aa19af0658cf3b8e1e85 Author: dgd-contributor AuthorDate: Fri Sep 3 10:41:27 2021 -0700 [SPARK-36401][PYTHON] Implement Series.cov ### What changes were proposed in this pull request? Implement Series.cov ### Why are the changes needed? That is supported in pandas. We should support that as well. ### Does this PR introduce _any_ user-facing change? Yes. Series.cov can be used. ```python >>> from pyspark.pandas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035]) >>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198]) >>> s1.cov(s2) -0.016857626527158744 >>> reset_option("compute.ops_on_diff_frames") ``` ### How was this patch tested? Unit tests Closes #33752 from dgd-contributor/SPARK-36401_Implement_Series.cov. Authored-by: dgd-contributor Signed-off-by: Takuya UESHIN --- .../source/reference/pyspark.pandas/series.rst | 1 + python/pyspark/pandas/missing/series.py| 1 - python/pyspark/pandas/series.py| 50 ++ .../pandas/tests/test_ops_on_diff_frames.py| 37 python/pyspark/pandas/tests/test_series.py | 45 +++ 5 files changed, 133 insertions(+), 1 deletion(-) diff --git a/python/docs/source/reference/pyspark.pandas/series.rst b/python/docs/source/reference/pyspark.pandas/series.rst index 1e32773..337f686 100644 --- a/python/docs/source/reference/pyspark.pandas/series.rst +++ b/python/docs/source/reference/pyspark.pandas/series.rst @@ -138,6 +138,7 @@ Computations / Descriptive Stats Series.clip Series.corr Series.count + Series.cov Series.cummax Series.cummin Series.cumsum diff --git a/python/pyspark/pandas/missing/series.py b/python/pyspark/pandas/missing/series.py index ef3f38b..f6ea8d3 100644 --- a/python/pyspark/pandas/missing/series.py +++ b/python/pyspark/pandas/missing/series.py @@ -40,7 +40,6 @@ class MissingPandasLikeSeries(object): autocorr = _unsupported_function("autocorr") combine = _unsupported_function("combine") convert_dtypes = _unsupported_function("convert_dtypes") -cov = _unsupported_function("cov") ewm = _unsupported_function("ewm") infer_objects = _unsupported_function("infer_objects") interpolate = _unsupported_function("interpolate") diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index 568754c..7e45566 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -944,6 +944,56 @@ class Series(Frame, IndexOpsMixin, Generic[T]): return lmask & rmask +def cov(self, other: "Series", min_periods: Optional[int] = None) -> float: +""" +Compute covariance with Series, excluding missing values. + +.. versionadded:: 3.3.0 + +Parameters +-- +other : Series +Series with which to compute the covariance. +min_periods : int, optional +Minimum number of observations needed to have a valid result. + +Returns +--- +float +Covariance between Series and other + +Examples + +>>> from pyspark.pandas.config import set_option, reset_option +>>> set_option("compute.ops_on_diff_frames", True) +>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035]) +>>> s2 = ps.Series([0.12528585, 0.26962463, 0.5198]) +>>> s1.cov(s2) +-0.016857626527158744 +>>> reset_option("compute.ops_on_diff_frames") +""" +if not isinstance(other, Series): +raise TypeError("unsupported type: %s" % type(other)) +if not np.issubdtype(self.dtype, np.number): +raise TypeError("unsupported dtype: %s" % self.dtype) +if not np.issubdtype(other.dtype, np.number): +raise TypeError("unsupported dtype: %s" % other.dtype) + +min_periods = 1 if min_periods is None else min_periods + +if same_anchor(self, other): +sdf = self._internal.spark_frame.select(self.spark.column, other.spark.column) +else: +combined = combine_frames(self.to_frame(), other.to_frame()) +sdf = combined._internal.spark_frame.select(*combined._internal.data_spark_columns) + +sdf = sdf.dropna() + +if len(sdf.head(min_periods)) < min_periods: +return np.nan +else: +return
[spark] branch branch-3.2 updated: [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new aa96a37 [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config aa96a37 is described below commit aa96a374b2f8b0bf4b004792d6b7729e27e349e5 Author: Kent Yao AuthorDate: Fri Sep 3 19:11:37 2021 +0800 [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### What changes were proposed in this pull request? Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config ### Why are the changes needed? spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big, there would be performance issues. It's better to leave this choice to users ### Does this PR introduce _any_ user-facing change? spark.sql.execution.topKSortFallbackThreshold is now user-facing ### How was this patch tested? passing GA Closes #33904 from yaooqinn/SPARK-36659. Authored-by: Kent Yao Signed-off-by: Kent Yao (cherry picked from commit 7f1ad7be18b07d880ba92c470e64bf7458b8366a) Signed-off-by: Dongjoon Hyun --- sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 1 - 1 file changed, 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index b137fd2..11b3097 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -2637,7 +2637,6 @@ object SQLConf { val TOP_K_SORT_FALLBACK_THRESHOLD = buildConf("spark.sql.execution.topKSortFallbackThreshold") - .internal() .doc("In SQL queries with a SORT followed by a LIMIT like " + "'SELECT x FROM t ORDER BY y LIMIT m', if m is under this threshold, do a top-K sort" + " in memory, otherwise do a global sort which spills to disk if necessary.") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && ste
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new c1f8d75 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 c1f8d75 is described below commit c1f8d759a3d75885e694e8c468ee6beea70131a3 Author: Kousuke Saruta AuthorDate: Fri Sep 3 23:25:18 2021 +0900 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 ### What changes were proposed in this pull request? This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`. This is an example. ``` SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month); 21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)] java.lang.ArrayIndexOutOfBoundsException: 1 ``` Actually, this example succeeded before SPARK-31980 (#28819) was merged. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33895 from sarutak/fix-sequence-issue. Authored-by: Kousuke Saruta Signed-off-by: Kousuke Saruta (cherry picked from commit cf3bc65e69dcb0f8ba3dee89642d082265edab31) Signed-off-by: Kousuke Saruta --- .../catalyst/expressions/collectionOperations.scala| 4 ++-- .../expressions/CollectionExpressionsSuite.scala | 18 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index b341895..bb2163c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -2711,7 +2711,7 @@ object Sequence { val maxEstimatedArrayLength = getSequenceLength(startMicros, stopMicros, intervalStepInMicros) -val stepSign = if (stopMicros >= startMicros) +1 else -1 +val stepSign = if (intervalStepInMicros > 0) +1 else -1 val exclusiveItem = stopMicros + stepSign val arr = new Array[T](maxEstimatedArrayLength) var t = startMicros @@ -2786,7 +2786,7 @@ object Sequence { | | $sequenceLengthCode | - | final int $stepSign = $stopMicros >= $startMicros ? +1 : -1; + | final int $stepSign = $intervalInMicros > 0 ? +1 : -1; | final long $exclusiveItem = $stopMicros + $stepSign; | | $arr = new $elemType[$arrLength]; diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala index 095894b..d79f06f 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala @@ -1888,6 +1888,24 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper Seq(Date.valueOf("2018-01-01"))) } + test("SPARK-36639: Start and end equal in month range with a negative step") { +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 day"))), + Seq(Date.valueOf("2018-01-01"))) +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 month"))), + Seq(Date.valueOf("2018-01-01"))) +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 year"))), + Seq(Date.valueOf("2018-01-01"))) + } + test("SPARK-33386: element_at ArrayIndexOutOfBoundsException") { Seq(true, false).foreach { ansiEnabled => withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail:
[spark] branch branch-3.2 updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && ste
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new a3901ed [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 a3901ed is described below commit a3901ed3848d21fd36bb5aa265ef8e8d74d8e324 Author: Kousuke Saruta AuthorDate: Fri Sep 3 23:25:18 2021 +0900 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 ### What changes were proposed in this pull request? This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`. This is an example. ``` SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month); 21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)] java.lang.ArrayIndexOutOfBoundsException: 1 ``` Actually, this example succeeded before SPARK-31980 (#28819) was merged. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33895 from sarutak/fix-sequence-issue. Authored-by: Kousuke Saruta Signed-off-by: Kousuke Saruta (cherry picked from commit cf3bc65e69dcb0f8ba3dee89642d082265edab31) Signed-off-by: Kousuke Saruta --- .../catalyst/expressions/collectionOperations.scala| 4 ++-- .../expressions/CollectionExpressionsSuite.scala | 18 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 6cbab86..ce17231 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -2903,7 +2903,7 @@ object Sequence { val maxEstimatedArrayLength = getSequenceLength(startMicros, stopMicros, input3, intervalStepInMicros) -val stepSign = if (stopMicros >= startMicros) +1 else -1 +val stepSign = if (intervalStepInMicros > 0) +1 else -1 val exclusiveItem = stopMicros + stepSign val arr = new Array[T](maxEstimatedArrayLength) var t = startMicros @@ -2989,7 +2989,7 @@ object Sequence { | | $sequenceLengthCode | - | final int $stepSign = $stopMicros >= $startMicros ? +1 : -1; + | final int $stepSign = $intervalInMicros > 0 ? +1 : -1; | final long $exclusiveItem = $stopMicros + $stepSign; | | $arr = new $elemType[$arrLength]; diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala index caa5e96..e8f5f07 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala @@ -2232,6 +2232,24 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper Seq(Date.valueOf("2018-01-01"))) } + test("SPARK-36639: Start and end equal in month range with a negative step") { +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 day"))), + Seq(Date.valueOf("2018-01-01"))) +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 month"))), + Seq(Date.valueOf("2018-01-01"))) +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 year"))), + Seq(Date.valueOf("2018-01-01"))) + } + test("SPARK-33386: element_at ArrayIndexOutOfBoundsException") { Seq(true, false).foreach { ansiEnabled => withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail:
[spark] branch master updated: [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step <
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new cf3bc65 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 cf3bc65 is described below commit cf3bc65e69dcb0f8ba3dee89642d082265edab31 Author: Kousuke Saruta AuthorDate: Fri Sep 3 23:25:18 2021 +0900 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0 ### What changes were proposed in this pull request? This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`. This is an example. ``` SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month); 21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)] java.lang.ArrayIndexOutOfBoundsException: 1 ``` Actually, this example succeeded before SPARK-31980 (#28819) was merged. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33895 from sarutak/fix-sequence-issue. Authored-by: Kousuke Saruta Signed-off-by: Kousuke Saruta --- .../catalyst/expressions/collectionOperations.scala| 4 ++-- .../expressions/CollectionExpressionsSuite.scala | 18 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 6cbab86..ce17231 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -2903,7 +2903,7 @@ object Sequence { val maxEstimatedArrayLength = getSequenceLength(startMicros, stopMicros, input3, intervalStepInMicros) -val stepSign = if (stopMicros >= startMicros) +1 else -1 +val stepSign = if (intervalStepInMicros > 0) +1 else -1 val exclusiveItem = stopMicros + stepSign val arr = new Array[T](maxEstimatedArrayLength) var t = startMicros @@ -2989,7 +2989,7 @@ object Sequence { | | $sequenceLengthCode | - | final int $stepSign = $stopMicros >= $startMicros ? +1 : -1; + | final int $stepSign = $intervalInMicros > 0 ? +1 : -1; | final long $exclusiveItem = $stopMicros + $stepSign; | | $arr = new $elemType[$arrLength]; diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala index 8f35cf3..688ee61 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala @@ -2249,6 +2249,24 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper Seq(Date.valueOf("2018-01-01"))) } + test("SPARK-36639: Start and end equal in month range with a negative step") { +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 day"))), + Seq(Date.valueOf("2018-01-01"))) +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 month"))), + Seq(Date.valueOf("2018-01-01"))) +checkEvaluation(new Sequence( + Literal(Date.valueOf("2018-01-01")), + Literal(Date.valueOf("2018-01-01")), + Literal(stringToInterval("interval -1 year"))), + Seq(Date.valueOf("2018-01-01"))) + } + test("SPARK-33386: element_at ArrayIndexOutOfBoundsException") { Seq(true, false).foreach { ansiEnabled => withSQLConf(SQLConf.ANSI_ENABLED.key -> ansiEnabled.toString) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36609][PYTHON] Add `errors` argument for `ps.to_numeric`
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 61b3223 [SPARK-36609][PYTHON] Add `errors` argument for `ps.to_numeric` 61b3223 is described below commit 61b3223f475c9cb8265807fbbe51563db699f7ca Author: itholic AuthorDate: Fri Sep 3 21:33:30 2021 +0900 [SPARK-36609][PYTHON] Add `errors` argument for `ps.to_numeric` ### What changes were proposed in this pull request? This PR proposes to support `errors` argument for `ps.to_numeric` such as pandas does. Note that we don't support the `ignore` when the `arg` is pandas-on-Spark Series for now. ### Why are the changes needed? We should match the behavior to pandas' as much as possible. Also in the [recent blog post](https://medium.com/chuck.connell.3/pandas-on-databricks-via-koalas-a-review-9876b0a92541), the author pointed out we're missing this feature. Seems like it's the kind of feature that commonly used in data science. ### Does this PR introduce _any_ user-facing change? Now the `errors` argument is available for `ps.to_numeric`. ### How was this patch tested? Unittests. Closes #33882 from itholic/SPARK-36609. Lead-authored-by: itholic Co-authored-by: Hyukjin Kwon Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/namespace.py| 28 +--- python/pyspark/pandas/tests/test_namespace.py | 48 +++ 2 files changed, 72 insertions(+), 4 deletions(-) diff --git a/python/pyspark/pandas/namespace.py b/python/pyspark/pandas/namespace.py index e964263..d0e4aba 100644 --- a/python/pyspark/pandas/namespace.py +++ b/python/pyspark/pandas/namespace.py @@ -2747,13 +2747,20 @@ def merge( @no_type_check -def to_numeric(arg): +def to_numeric(arg, errors="raise"): """ Convert argument to a numeric type. Parameters -- arg : scalar, list, tuple, 1-d array, or Series +Argument to be converted. +errors : {'raise', 'coerce'}, default 'raise' +* If 'coerce', then invalid parsing will be set as NaN. +* If 'raise', then invalid parsing will raise an exception. +* If 'ignore', then invalid parsing will return the input. + +.. note:: 'ignore' doesn't work yet when `arg` is pandas-on-Spark Series. Returns --- @@ -2783,6 +2790,7 @@ def to_numeric(arg): dtype: float32 If given Series contains invalid value to cast float, just cast it to `np.nan` +when `errors` is set to "coerce". >>> psser = ps.Series(['apple', '1.0', '2', '-3']) >>> psser @@ -2792,7 +2800,7 @@ def to_numeric(arg): 3 -3 dtype: object ->>> ps.to_numeric(psser) +>>> ps.to_numeric(psser, errors="coerce") 0NaN 11.0 22.0 @@ -2814,9 +2822,21 @@ def to_numeric(arg): 1.0 """ if isinstance(arg, Series): -return arg._with_new_scol(arg.spark.column.cast("float")) +if errors == "coerce": +return arg._with_new_scol(arg.spark.column.cast("float")) +elif errors == "raise": +scol = arg.spark.column +scol_casted = scol.cast("float") +cond = F.when( +F.assert_true(scol.isNull() | scol_casted.isNotNull()).isNull(), scol_casted +) +return arg._with_new_scol(cond) +elif errors == "ignore": +raise NotImplementedError("'ignore' is not implemented yet, when the `arg` is Series.") +else: +raise ValueError("invalid error value specified") else: -return pd.to_numeric(arg) +return pd.to_numeric(arg, errors=errors) def broadcast(obj: DataFrame) -> DataFrame: diff --git a/python/pyspark/pandas/tests/test_namespace.py b/python/pyspark/pandas/tests/test_namespace.py index 76a269d..29578a9 100644 --- a/python/pyspark/pandas/tests/test_namespace.py +++ b/python/pyspark/pandas/tests/test_namespace.py @@ -336,6 +336,54 @@ class NamespaceTest(PandasOnSparkTestCase, SQLTestUtils): lambda: read_delta("fake_path", version="0", timestamp="2021-06-22"), ) +def test_to_numeric(self): +pser = pd.Series(["1", "2", None, "4", "hello"]) +psser = ps.from_pandas(pser) + +# "coerce" and "raise" with Series that contains un-parsable data. +self.assert_eq( +pd.to_numeric(pser, errors="coerce"), ps.to_numeric(psser, errors="coerce"), almost=True +) + +# "raise" with Series that contains parsable data only. +pser = pd.Series(["1", "2", None, "4", "5.0"]) +psser = ps.from_pandas(pser) + +self.assert_eq( +
[spark] branch master updated (d3e3df1 -> 7f1ad7b)
This is an automated email from the ASF dual-hosted git repository. yao pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d3e3df1 [SPARK-36644][SQL] Push down boolean column filter add 7f1ad7b [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config No new revisions were added by this update. Summary of changes: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 1 - 1 file changed, 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36644][SQL] Push down boolean column filter
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d3e3df1 [SPARK-36644][SQL] Push down boolean column filter d3e3df1 is described below commit d3e3df17aac577e163cab7e085624f94f07c748e Author: Kazuyuki Tanimura AuthorDate: Fri Sep 3 07:39:14 2021 + [SPARK-36644][SQL] Push down boolean column filter ### What changes were proposed in this pull request? This PR proposes to improve `DataSourceStrategy` to be able to push down boolean column filters. Currently boolean column filters do not get pushed down and may cause unnecessary IO. ### Why are the changes needed? The following query does not push down the filter in the current implementation ``` SELECT * FROM t WHERE boolean_field ``` although the following query pushes down the filter as expected. ``` SELECT * FROM t WHERE boolean_field = true ``` This is because the Physical Planner (`DataSourceStrategy`) currently only pushes down limited expression patterns like`EqualTo`. It is fair for Spark SQL users to expect `boolean_field` performs the same as `boolean_field = true`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` build/sbt "core/testOnly *DataSourceStrategySuite -- -z SPARK-36644" ``` Closes #33898 from kazuyukitanimura/SPARK-36644. Authored-by: Kazuyuki Tanimura Signed-off-by: DB Tsai --- .../apache/spark/sql/execution/datasources/DataSourceStrategy.scala | 3 +++ .../spark/sql/execution/datasources/DataSourceStrategySuite.scala | 4 2 files changed, 7 insertions(+) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala index 7a5c343..30818b1 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala @@ -552,6 +552,9 @@ object DataSourceStrategy case expressions.Literal(false, BooleanType) => Some(sources.AlwaysFalse) +case e @ pushableColumn(name) if e.dataType.isInstanceOf[BooleanType] => + Some(sources.EqualTo(name, true)) + case _ => None } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala index b94918e..37fe3c2 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategySuite.scala @@ -311,6 +311,10 @@ class DataSourceStrategySuite extends PlanTest with SharedSparkSession { assert(PushableColumnAndNestedColumn.unapply(Abs('col.int)) === None) } + test("SPARK-36644: Push down boolean column filter") { +testTranslateFilter('col.boolean, Some(sources.EqualTo("col", true))) + } + /** * Translate the given Catalyst [[Expression]] into data source [[sources.Filter]] * then verify against the given [[sources.Filter]]. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org