This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 0915a666e7f [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series 0915a666e7f is described below commit 0915a666e7f33b99bd607db354bdb395189b4e12 Author: Xinrong Meng <xinrong.m...@databricks.com> AuthorDate: Tue May 10 11:31:38 2022 +0900 [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series ### What changes were proposed in this pull request? Implement `skipna` of common statistical functions of DataFrame and Series, which include `sum / mean / product / min / max / std / sem / median / skew / kurtosis`. See decision details at https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit#heading=h.iom65pc8gqiv. ### Why are the changes needed? With statistical functions standardized, pandas API coverage will be increased since missing parameters `skipna`s are implemented. That would further improve user adoption. ### Does this PR introduce _any_ user-facing change? Yes. `skipna` is supported in common statistical functions of DataFrame and Series. Take `sum` for example, ```py >>> psdf = ps.DataFrame({"a": [np.nan, np.nan, np.nan], "b": [1, np.nan, 2]}) >>> psdf a b 0 NaN 1.0 1 NaN NaN 2 NaN 2.0 >>> psdf.sum(skipna=False) a NaN b NaN dtype: float64 >>> psdf.sum(skipna=True) a 0.0 b 3.0 dtype: float64 >>> psdf.b.sum(skipna=False) nan >>> psdf.b.sum(skipna=True) 3.0 ``` ### How was this patch tested? Unit tests. Closes #36414 from xinrong-databricks/generic.skipna. Authored-by: Xinrong Meng <xinrong.m...@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- .../pandas_on_spark/supported_pandas_api.rst | 46 +++---- python/pyspark/pandas/config.py | 2 +- python/pyspark/pandas/frame.py | 8 +- python/pyspark/pandas/generic.py | 145 ++++++++++++++++++--- python/pyspark/pandas/series.py | 9 +- .../pyspark/pandas/tests/test_generic_functions.py | 42 ++++++ 6 files changed, 207 insertions(+), 45 deletions(-) diff --git a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst index d2ac0b78861..2373fa95d19 100644 --- a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst +++ b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst @@ -241,9 +241,9 @@ Supported DataFrame APIs +--------------------------------------------+-------------+--------------------------------------+ | :func:`keys` | Y | | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`kurt` | P | ``skipna``, ``level`` | +| :func:`kurt` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`kurtosis` | P | ``skipna``, ``level`` | +| :func:`kurtosis` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ | :func:`last` | Y | | +--------------------------------------------+-------------+--------------------------------------+ @@ -262,11 +262,11 @@ Supported DataFrame APIs | :func:`mask` | P | ``inplace``, ``axis``, ``level``, | | | | ``errors`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`max` | P | ``skipna``, ``level`` | +| :func:`max` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`mean` | P | ``skipna``, ``level`` | +| :func:`mean` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`median` | P | ``skipna``, ``level`` | +| :func:`median` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ | :func:`melt` | P | ``col_level``, ``ignore_index`` | +--------------------------------------------+-------------+--------------------------------------+ @@ -275,7 +275,7 @@ Supported DataFrame APIs | :func:`merge` | P | ``sort``, ``copy``, ``indicator``, | | | | ``validate`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`min` | P | ``skipna``, ``level`` | +| :func:`min` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ | :func:`mod` | P | ``axis``, ``level``, ``fill_value`` | +--------------------------------------------+-------------+--------------------------------------+ @@ -335,9 +335,9 @@ Supported DataFrame APIs +--------------------------------------------+-------------+--------------------------------------+ | :func:`pow` | P | ``axis``, ``level``, ``fill_value`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`prod` | P | ``skipna``, ``level`` | +| :func:`prod` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`product` | P | ``skipna``, ``level`` | +| :func:`product` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ | :func:`quantile` | P | ``interpolation`` | +--------------------------------------------+-------------+--------------------------------------+ @@ -386,7 +386,7 @@ Supported DataFrame APIs +--------------------------------------------+-------------+--------------------------------------+ | :func:`select_dtypes` | Y | | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`sem` | P | ``skipna`` | +| :func:`sem` | Y | | +--------------------------------------------+-------------+--------------------------------------+ | set_axis | N | | +--------------------------------------------+-------------+--------------------------------------+ @@ -400,7 +400,7 @@ Supported DataFrame APIs +--------------------------------------------+-------------+--------------------------------------+ | :func:`size` | Y | | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`skew` | P | ``skipna``, ``level`` | +| :func:`skew` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ | slice_shift | N | | +--------------------------------------------+-------------+--------------------------------------+ @@ -415,7 +415,7 @@ Supported DataFrame APIs +--------------------------------------------+-------------+--------------------------------------+ | :func:`stack` | P | ``level``, ``dropna`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`std` | P | ``skipna``, ``level`` | +| :func:`std` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ | :func:`style` | Y | | +--------------------------------------------+-------------+--------------------------------------+ @@ -423,7 +423,7 @@ Supported DataFrame APIs +--------------------------------------------+-------------+--------------------------------------+ | :func:`subtract` | P | ``axis``, ``level``, ``fill_value`` | +--------------------------------------------+-------------+--------------------------------------+ -| :func:`sum` | P | ``skipna``, ``level`` | +| :func:`sum` | P | ``level`` | +--------------------------------------------+-------------+--------------------------------------+ | :func:`swapaxes` | Y | | +--------------------------------------------+-------------+--------------------------------------+ @@ -898,9 +898,9 @@ Supported Series APIs +---------------------------------+-------------------+-------------------------------------------+ | :func:`keys` | Y | | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`kurt` | P | ``skipna``, ``level`` | +| :func:`kurt` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`kurtosis` | P | ``skipna``, ``level`` | +| :func:`kurtosis` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ | :func:`last` | Y | | +---------------------------------+-------------------+-------------------------------------------+ @@ -919,15 +919,15 @@ Supported Series APIs | :func:`mask` | P | ``inplace``, ``axis``, ``level``, | | | | ``errors`` | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`max` | P | ``skipna``, ``level`` | +| :func:`max` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`mean` | P | ``skipna``, ``level`` | +| :func:`mean` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`median` | P | ``skipna``, ``level`` | +| :func:`median` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ | memory_usage | N | | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`min` | P | ``skipna``, ``level`` | +| :func:`min` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ | :func:`mod` | P | ``fill_value``, ``level`` | +---------------------------------+-------------------+-------------------------------------------+ @@ -983,9 +983,9 @@ Supported Series APIs +---------------------------------+-------------------+-------------------------------------------+ | :func:`pow` | P | ``fill_value``, ``level`` | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`prod` | P | ``skipna``, ``level`` | +| :func:`prod` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`product` | P | ``skipna``, ``level`` | +| :func:`product` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ | :func:`quantile` | P | ``interpolation`` | +---------------------------------+-------------------+-------------------------------------------+ @@ -1040,7 +1040,7 @@ Supported Series APIs +---------------------------------+-------------------+-------------------------------------------+ | searchsorted | N | | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`sem` | P | ``skipna``, ``level`` | +| :func:`sem` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ | set_axis | N | | +---------------------------------+-------------------+-------------------------------------------+ @@ -1052,7 +1052,7 @@ Supported Series APIs +---------------------------------+-------------------+-------------------------------------------+ | :func:`size` | Y | | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`skew` | P | ``skipna``, ``level`` | +| :func:`skew` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ | slice_shift | N | | +---------------------------------+-------------------+-------------------------------------------+ @@ -1065,7 +1065,7 @@ Supported Series APIs +---------------------------------+-------------------+-------------------------------------------+ | :func:`squeeze` | Y | | +---------------------------------+-------------------+-------------------------------------------+ -| :func:`std` | P | ``skipna``, ``level`` | +| :func:`std` | P | ``level`` | +---------------------------------+-------------------+-------------------------------------------+ | :func:`str` | Y | | +---------------------------------+-------------------+-------------------------------------------+ diff --git a/python/pyspark/pandas/config.py b/python/pyspark/pandas/config.py index a0b8db67758..dc42a7c813b 100644 --- a/python/pyspark/pandas/config.py +++ b/python/pyspark/pandas/config.py @@ -204,7 +204,7 @@ _options: List[Option] = [ "pandas-on-Spark skip the validation and will be slightly different from pandas. " "Affected APIs: `Series.dot`, `Series.asof`, `Series.compare`, " "`FractionalExtensionOps.astype`, `IntegralExtensionOps.astype`, " - "`FractionalOps.astype`, `DecimalOps.astype`." + "`FractionalOps.astype`, `DecimalOps.astype`, `skipna of statistical functions`." ), default=True, types=bool, diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py index 4ec0c9e0605..8527477b7a2 100644 --- a/python/pyspark/pandas/frame.py +++ b/python/pyspark/pandas/frame.py @@ -583,6 +583,7 @@ class DataFrame(Frame, Generic[T]): name: str, axis: Optional[Axis] = None, numeric_only: bool = True, + skipna: bool = True, **kwargs: Any, ) -> "Series": """ @@ -600,6 +601,8 @@ class DataFrame(Frame, Generic[T]): Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility. Only 'DataFrame.count' uses this parameter currently. + skipna : bool, default True + Exclude NA/null values when computing the result. """ from pyspark.pandas.series import Series, first_series @@ -618,7 +621,10 @@ class DataFrame(Frame, Generic[T]): keep_column = not numeric_only or is_numeric_or_boolean if keep_column: - scol = sfun(psser) + if not skipna and get_option("compute.eager_check") and psser.hasnans: + scol = F.first(F.lit(np.nan)) + else: + scol = sfun(psser) if min_count > 0: scol = F.when(Frame._count_expr(psser) >= min_count, scol) diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py index 1ce4671d696..bb5d6a4edc9 100644 --- a/python/pyspark/pandas/generic.py +++ b/python/pyspark/pandas/generic.py @@ -117,6 +117,7 @@ class Frame(object, metaclass=ABCMeta): name: str, axis: Optional[Axis] = None, numeric_only: bool = True, + skipna: bool = True, **kwargs: Any, ) -> Union["Series", Scalar]: pass @@ -1164,7 +1165,7 @@ class Frame(object, metaclass=ABCMeta): ) def mean( - self, axis: Optional[Axis] = None, numeric_only: bool = None + self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: bool = None ) -> Union[Scalar, "Series"]: """ Return the mean of the values. @@ -1173,6 +1174,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. numeric_only : bool, default None Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility. @@ -1225,11 +1231,19 @@ class Frame(object, metaclass=ABCMeta): return F.mean(spark_column) return self._reduce_for_stat_function( - mean, name="mean", axis=axis, numeric_only=numeric_only + mean, + name="mean", + axis=axis, + numeric_only=numeric_only, + skipna=skipna, ) def sum( - self, axis: Optional[Axis] = None, numeric_only: bool = None, min_count: int = 0 + self, + axis: Optional[Axis] = None, + skipna: bool = True, + numeric_only: bool = None, + min_count: int = 0, ) -> Union[Scalar, "Series"]: """ Return the sum of the values. @@ -1238,6 +1252,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Added *skipna* to exclude . numeric_only : bool, default None Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility. @@ -1301,6 +1320,7 @@ class Frame(object, metaclass=ABCMeta): def sum(psser: "Series") -> Column: spark_type = psser.spark.data_type spark_column = psser.spark.column + if isinstance(spark_type, BooleanType): spark_column = spark_column.cast(LongType()) elif not isinstance(spark_type, NumericType): @@ -1312,11 +1332,20 @@ class Frame(object, metaclass=ABCMeta): return F.coalesce(F.sum(spark_column), SF.lit(0)) return self._reduce_for_stat_function( - sum, name="sum", axis=axis, numeric_only=numeric_only, min_count=min_count + sum, + name="sum", + axis=axis, + numeric_only=numeric_only, + min_count=min_count, + skipna=skipna, ) def product( - self, axis: Optional[Axis] = None, numeric_only: bool = None, min_count: int = 0 + self, + axis: Optional[Axis] = None, + skipna: bool = True, + numeric_only: bool = None, + min_count: int = 0, ) -> Union[Scalar, "Series"]: """ Return the product of the values. @@ -1328,6 +1357,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. numeric_only : bool, default None Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility. @@ -1387,6 +1421,10 @@ class Frame(object, metaclass=ABCMeta): def prod(psser: "Series") -> Column: spark_type = psser.spark.data_type spark_column = psser.spark.column + + if not skipna: + spark_column = F.when(spark_column.isNull(), np.nan).otherwise(spark_column) + if isinstance(spark_type, BooleanType): scol = F.min(F.coalesce(spark_column, SF.lit(True))).cast(LongType()) elif isinstance(spark_type, NumericType): @@ -1411,13 +1449,18 @@ class Frame(object, metaclass=ABCMeta): return F.coalesce(scol, SF.lit(1)) return self._reduce_for_stat_function( - prod, name="prod", axis=axis, numeric_only=numeric_only, min_count=min_count + prod, + name="prod", + axis=axis, + numeric_only=numeric_only, + min_count=min_count, + skipna=skipna, ) prod = product def skew( - self, axis: Optional[Axis] = None, numeric_only: bool = None + self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: bool = None ) -> Union[Scalar, "Series"]: """ Return unbiased skew normalized by N-1. @@ -1426,6 +1469,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. numeric_only : bool, default None Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility. @@ -1471,11 +1519,15 @@ class Frame(object, metaclass=ABCMeta): return F.skewness(spark_column) return self._reduce_for_stat_function( - skew, name="skew", axis=axis, numeric_only=numeric_only + skew, + name="skew", + axis=axis, + numeric_only=numeric_only, + skipna=skipna, ) def kurtosis( - self, axis: Optional[Axis] = None, numeric_only: bool = None + self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: bool = None ) -> Union[Scalar, "Series"]: """ Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). @@ -1485,6 +1537,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. numeric_only : bool, default None Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility. @@ -1530,13 +1587,17 @@ class Frame(object, metaclass=ABCMeta): return F.kurtosis(spark_column) return self._reduce_for_stat_function( - kurtosis, name="kurtosis", axis=axis, numeric_only=numeric_only + kurtosis, + name="kurtosis", + axis=axis, + numeric_only=numeric_only, + skipna=skipna, ) kurt = kurtosis def min( - self, axis: Optional[Axis] = None, numeric_only: bool = None + self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: bool = None ) -> Union[Scalar, "Series"]: """ Return the minimum of the values. @@ -1545,6 +1606,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. numeric_only : bool, default None If True, include only float, int, boolean columns. This parameter is mainly for pandas compatibility. False is supported; however, the columns should @@ -1591,10 +1657,11 @@ class Frame(object, metaclass=ABCMeta): name="min", axis=axis, numeric_only=numeric_only, + skipna=skipna, ) def max( - self, axis: Optional[Axis] = None, numeric_only: bool = None + self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: bool = None ) -> Union[Scalar, "Series"]: """ Return the maximum of the values. @@ -1603,6 +1670,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. numeric_only : bool, default None If True, include only float, int, boolean columns. This parameter is mainly for pandas compatibility. False is supported; however, the columns should @@ -1649,6 +1721,7 @@ class Frame(object, metaclass=ABCMeta): name="max", axis=axis, numeric_only=numeric_only, + skipna=skipna, ) def count( @@ -1726,7 +1799,11 @@ class Frame(object, metaclass=ABCMeta): ) def std( - self, axis: Optional[Axis] = None, ddof: int = 1, numeric_only: bool = None + self, + axis: Optional[Axis] = None, + skipna: bool = True, + ddof: int = 1, + numeric_only: bool = None, ) -> Union[Scalar, "Series"]: """ Return sample standard deviation. @@ -1735,6 +1812,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. ddof : int, default 1 Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. @@ -1803,7 +1885,7 @@ class Frame(object, metaclass=ABCMeta): return F.stddev_samp(spark_column) return self._reduce_for_stat_function( - std, name="std", axis=axis, numeric_only=numeric_only, ddof=ddof + std, name="std", axis=axis, numeric_only=numeric_only, ddof=ddof, skipna=skipna ) def var( @@ -1888,7 +1970,11 @@ class Frame(object, metaclass=ABCMeta): ) def median( - self, axis: Optional[Axis] = None, numeric_only: bool = None, accuracy: int = 10000 + self, + axis: Optional[Axis] = None, + skipna: bool = True, + numeric_only: bool = None, + accuracy: int = 10000, ) -> Union[Scalar, "Series"]: """ Return the median of the values for the requested axis. @@ -1901,6 +1987,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. numeric_only : bool, default None Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility. @@ -1995,11 +2086,19 @@ class Frame(object, metaclass=ABCMeta): ) return self._reduce_for_stat_function( - median, name="median", numeric_only=numeric_only, axis=axis + median, + name="median", + numeric_only=numeric_only, + axis=axis, + skipna=skipna, ) def sem( - self, axis: Optional[Axis] = None, ddof: int = 1, numeric_only: bool = None + self, + axis: Optional[Axis] = None, + skipna: bool = True, + ddof: int = 1, + numeric_only: bool = None, ) -> Union[Scalar, "Series"]: """ Return unbiased standard error of the mean over requested axis. @@ -2008,6 +2107,11 @@ class Frame(object, metaclass=ABCMeta): ---------- axis : {index (0), columns (1)} Axis for the function to be applied on. + skipna : bool, default True + Exclude NA/null values when computing the result. + + .. versionchanged:: 3.4.0 + Supported including NA/null values. ddof : int, default 1 Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. @@ -2086,7 +2190,12 @@ class Frame(object, metaclass=ABCMeta): return std(psser) / pow(Frame._count_expr(psser), 0.5) return self._reduce_for_stat_function( - sem, name="sem", numeric_only=numeric_only, axis=axis, ddof=ddof + sem, + name="sem", + numeric_only=numeric_only, + axis=axis, + ddof=ddof, + skipna=skipna, ) @property diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index f15ba4854f3..b8915e160c1 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -6849,6 +6849,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]): name: str_type, axis: Optional[Axis] = None, numeric_only: bool = True, + skipna: bool = True, **kwargs: Any, ) -> Scalar: """ @@ -6859,13 +6860,17 @@ class Series(Frame, IndexOpsMixin, Generic[T]): sfun : the stats function to be used for aggregation name : original pandas API name. axis : used only for sanity check because series only support index axis. - numeric_only : not used by this implementation, but passed down by stats functions + numeric_only : not used by this implementation, but passed down by stats functions. + skipna: exclude NA/null values when computing the result. """ axis = validate_axis(axis) if axis == 1: raise NotImplementedError("Series does not support columns axis.") - scol = sfun(self) + if not skipna and get_option("compute.eager_check") and self.hasnans: + scol = F.first(F.lit(np.nan)) + else: + scol = sfun(self) min_count = kwargs.get("min_count", 0) if min_count > 0: diff --git a/python/pyspark/pandas/tests/test_generic_functions.py b/python/pyspark/pandas/tests/test_generic_functions.py index 3e4db6c86bc..5062daa77e2 100644 --- a/python/pyspark/pandas/tests/test_generic_functions.py +++ b/python/pyspark/pandas/tests/test_generic_functions.py @@ -111,6 +111,48 @@ class GenericFunctionsTest(PandasOnSparkTestCase, TestUtils): ) self._test_interpolate(pdf) + def _test_stat_functions(self, stat_func): + pdf = pd.DataFrame({"a": [np.nan, np.nan, np.nan], "b": [1, np.nan, 2], "c": [1, 2, 3]}) + psdf = ps.from_pandas(pdf) + self.assert_eq(stat_func(pdf.a), stat_func(psdf.a)) + self.assert_eq(stat_func(pdf.b), stat_func(psdf.b)) + self.assert_eq(stat_func(pdf), stat_func(psdf)) + + # Fix skew and kurtosis and re-enable tests below + def test_stat_functions(self): + self._test_stat_functions(lambda x: x.sum()) + self._test_stat_functions(lambda x: x.sum(skipna=False)) + self._test_stat_functions(lambda x: x.mean()) + self._test_stat_functions(lambda x: x.mean(skipna=False)) + self._test_stat_functions(lambda x: x.product()) + self._test_stat_functions(lambda x: x.product(skipna=False)) + self._test_stat_functions(lambda x: x.min()) + self._test_stat_functions(lambda x: x.min(skipna=False)) + self._test_stat_functions(lambda x: x.max()) + self._test_stat_functions(lambda x: x.max(skipna=False)) + self._test_stat_functions(lambda x: x.std()) + self._test_stat_functions(lambda x: x.std(skipna=False)) + self._test_stat_functions(lambda x: x.sem()) + self._test_stat_functions(lambda x: x.sem(skipna=False)) + # self._test_stat_functions(lambda x: x.skew()) + self._test_stat_functions(lambda x: x.skew(skipna=False)) + + # Test cases below return differently from pandas (either by design or to be fixed) + pdf = pd.DataFrame({"a": [np.nan, np.nan, np.nan], "b": [1, np.nan, 2], "c": [1, 2, 3]}) + psdf = ps.from_pandas(pdf) + + self.assert_eq(pdf.a.median(), psdf.a.median()) + self.assert_eq(pdf.a.median(skipna=False), psdf.a.median(skipna=False)) + self.assert_eq(1.0, psdf.b.median()) + self.assert_eq(pdf.b.median(skipna=False), psdf.b.median(skipna=False)) + self.assert_eq(pdf.c.median(), psdf.c.median()) + + self.assert_eq(pdf.a.kurtosis(skipna=False), psdf.a.kurtosis(skipna=False)) + self.assert_eq(pdf.a.kurtosis(), psdf.a.kurtosis()) + self.assert_eq(pdf.b.kurtosis(skipna=False), psdf.b.kurtosis(skipna=False)) + # self.assert_eq(pdf.b.kurtosis(), psdf.b.kurtosis()) AssertionError: nan != -2.0 + self.assert_eq(-1.5, psdf.c.kurtosis()) + if __name__ == "__main__": import unittest --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org