[spark] branch master updated: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

gurwls223 Mon, 09 May 2022 19:32:08 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 0915a666e7f [SPARK-39077][PYTHON] Implement `skipna` of common 
statistical functions of DataFrame and Series
0915a666e7f is described below

commit 0915a666e7f33b99bd607db354bdb395189b4e12
Author: Xinrong Meng <xinrong.m...@databricks.com>
AuthorDate: Tue May 10 11:31:38 2022 +0900

    [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of 
DataFrame and Series
    
    ### What changes were proposed in this pull request?
    Implement `skipna` of common statistical functions of DataFrame and Series,
    which include `sum / mean / product / min / max / std / sem / median / skew 
/ kurtosis`.
    
    See decision details at 
https://docs.google.com/document/d/1IHUQkSVMPWiK8Jhe0GUtMHnDS6LB4_z9K2ktWmORSSg/edit#heading=h.iom65pc8gqiv.
    
    ### Why are the changes needed?
    
    With statistical functions standardized, pandas API coverage will be 
increased since missing parameters `skipna`s are implemented. That would 
further improve user adoption.
    
    ### Does this PR introduce _any_ user-facing change?
    Yes. `skipna` is supported in common statistical functions of DataFrame and 
Series.
    
    Take `sum` for example,
    ```py
    >>> psdf = ps.DataFrame({"a": [np.nan, np.nan, np.nan], "b": [1, np.nan, 
2]})
    >>> psdf
        a    b
    0 NaN  1.0
    1 NaN  NaN
    2 NaN  2.0
    
    >>> psdf.sum(skipna=False)
    a   NaN
    b   NaN
    dtype: float64
    
    >>> psdf.sum(skipna=True)
    a    0.0
    b    3.0
    dtype: float64
    
    >>> psdf.b.sum(skipna=False)
    nan
    
    >>> psdf.b.sum(skipna=True)
    3.0
    ```
    
    ### How was this patch tested?
    Unit tests.
    
    Closes #36414 from xinrong-databricks/generic.skipna.
    
    Authored-by: Xinrong Meng <xinrong.m...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 .../pandas_on_spark/supported_pandas_api.rst       |  46 +++----
 python/pyspark/pandas/config.py                    |   2 +-
 python/pyspark/pandas/frame.py                     |   8 +-
 python/pyspark/pandas/generic.py                   | 145 ++++++++++++++++++---
 python/pyspark/pandas/series.py                    |   9 +-
 .../pyspark/pandas/tests/test_generic_functions.py |  42 ++++++
 6 files changed, 207 insertions(+), 45 deletions(-)

diff --git 
a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst 
b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst
index d2ac0b78861..2373fa95d19 100644
--- a/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst
+++ b/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst
@@ -241,9 +241,9 @@ Supported DataFrame APIs
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`keys`                               | Y           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`kurt`                               | P           | ``skipna``, 
``level``                |
+| :func:`kurt`                               | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`kurtosis`                           | P           | ``skipna``, 
``level``                |
+| :func:`kurtosis`                           | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`last`                               | Y           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -262,11 +262,11 @@ Supported DataFrame APIs
 | :func:`mask`                               | P           | ``inplace``, 
``axis``, ``level``,    |
 |                                            |             | ``errors``        
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`max`                                | P           | ``skipna``, 
``level``                |
+| :func:`max`                                | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`mean`                               | P           | ``skipna``, 
``level``                |
+| :func:`mean`                               | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`median`                             | P           | ``skipna``, 
``level``                |
+| :func:`median`                             | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`melt`                               | P           | ``col_level``, 
``ignore_index``      |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -275,7 +275,7 @@ Supported DataFrame APIs
 | :func:`merge`                              | P           | ``sort``, 
``copy``, ``indicator``,   |
 |                                            |             | ``validate``      
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`min`                                | P           | ``skipna``, 
``level``                |
+| :func:`min`                                | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`mod`                                | P           | ``axis``, 
``level``, ``fill_value``  |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -335,9 +335,9 @@ Supported DataFrame APIs
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`pow`                                | P           | ``axis``, 
``level``, ``fill_value``  |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`prod`                               | P           | ``skipna``, 
``level``                |
+| :func:`prod`                               | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`product`                            | P           | ``skipna``, 
``level``                |
+| :func:`product`                            | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`quantile`                           | P           | ``interpolation`` 
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -386,7 +386,7 @@ Supported DataFrame APIs
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`select_dtypes`                      | Y           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`sem`                                | P           | ``skipna``        
                   |
+| :func:`sem`                                | Y           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | set_axis                                   | N           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -400,7 +400,7 @@ Supported DataFrame APIs
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`size`                               | Y           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`skew`                               | P           | ``skipna``, 
``level``                |
+| :func:`skew`                               | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | slice_shift                                | N           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -415,7 +415,7 @@ Supported DataFrame APIs
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`stack`                              | P           | ``level``, 
``dropna``                |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`std`                                | P           | ``skipna``, 
``level``                |
+| :func:`std`                                | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`style`                              | Y           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -423,7 +423,7 @@ Supported DataFrame APIs
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`subtract`                           | P           | ``axis``, 
``level``, ``fill_value``  |
 
+--------------------------------------------+-------------+--------------------------------------+
-| :func:`sum`                                | P           | ``skipna``, 
``level``                |
+| :func:`sum`                                | P           | ``level``         
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
 | :func:`swapaxes`                           | Y           |                   
                   |
 
+--------------------------------------------+-------------+--------------------------------------+
@@ -898,9 +898,9 @@ Supported Series APIs
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`keys`                    | Y                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`kurt`                    | P                 | ``skipna``, ``level``  
                   |
+| :func:`kurt`                    | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`kurtosis`                | P                 | ``skipna``, ``level``  
                   |
+| :func:`kurtosis`                | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`last`                    | Y                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
@@ -919,15 +919,15 @@ Supported Series APIs
 | :func:`mask`                    | P                 | ``inplace``, ``axis``, 
``level``,         |
 |                                 |                   | ``errors``             
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`max`                     | P                 | ``skipna``, ``level``  
                   |
+| :func:`max`                     | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`mean`                    | P                 | ``skipna``, ``level``  
                   |
+| :func:`mean`                    | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`median`                  | P                 | ``skipna``, ``level``  
                   |
+| :func:`median`                  | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
 | memory_usage                    | N                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`min`                     | P                 | ``skipna``, ``level``  
                   |
+| :func:`min`                     | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`mod`                     | P                 | ``fill_value``, 
``level``                 |
 
+---------------------------------+-------------------+-------------------------------------------+
@@ -983,9 +983,9 @@ Supported Series APIs
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`pow`                     | P                 | ``fill_value``, 
``level``                 |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`prod`                    | P                 | ``skipna``, ``level``  
                   |
+| :func:`prod`                    | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`product`                 | P                 | ``skipna``, ``level``  
                   |
+| :func:`product`                 | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`quantile`                | P                 | ``interpolation``      
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
@@ -1040,7 +1040,7 @@ Supported Series APIs
 
+---------------------------------+-------------------+-------------------------------------------+
 | searchsorted                    | N                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`sem`                     | P                 | ``skipna``, ``level``  
                   |
+| :func:`sem`                     | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
 | set_axis                        | N                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
@@ -1052,7 +1052,7 @@ Supported Series APIs
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`size`                    | Y                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`skew`                    | P                 | ``skipna``, ``level``  
                   |
+| :func:`skew`                    | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
 | slice_shift                     | N                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
@@ -1065,7 +1065,7 @@ Supported Series APIs
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`squeeze`                 | Y                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
-| :func:`std`                     | P                 | ``skipna``, ``level``  
                   |
+| :func:`std`                     | P                 | ``level``              
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
 | :func:`str`                     | Y                 |                        
                   |
 
+---------------------------------+-------------------+-------------------------------------------+
diff --git a/python/pyspark/pandas/config.py b/python/pyspark/pandas/config.py
index a0b8db67758..dc42a7c813b 100644
--- a/python/pyspark/pandas/config.py
+++ b/python/pyspark/pandas/config.py
@@ -204,7 +204,7 @@ _options: List[Option] = [
             "pandas-on-Spark skip the validation and will be slightly 
different from pandas. "
             "Affected APIs: `Series.dot`, `Series.asof`, `Series.compare`, "
             "`FractionalExtensionOps.astype`, `IntegralExtensionOps.astype`, "
-            "`FractionalOps.astype`, `DecimalOps.astype`."
+            "`FractionalOps.astype`, `DecimalOps.astype`, `skipna of 
statistical functions`."
         ),
         default=True,
         types=bool,
diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index 4ec0c9e0605..8527477b7a2 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -583,6 +583,7 @@ class DataFrame(Frame, Generic[T]):
         name: str,
         axis: Optional[Axis] = None,
         numeric_only: bool = True,
+        skipna: bool = True,
         **kwargs: Any,
     ) -> "Series":
         """
@@ -600,6 +601,8 @@ class DataFrame(Frame, Generic[T]):
             Include only float, int, boolean columns. False is not supported. 
This parameter
             is mainly for pandas compatibility. Only 'DataFrame.count' uses 
this parameter
             currently.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
         """
         from pyspark.pandas.series import Series, first_series
 
@@ -618,7 +621,10 @@ class DataFrame(Frame, Generic[T]):
                 keep_column = not numeric_only or is_numeric_or_boolean
 
                 if keep_column:
-                    scol = sfun(psser)
+                    if not skipna and get_option("compute.eager_check") and 
psser.hasnans:
+                        scol = F.first(F.lit(np.nan))
+                    else:
+                        scol = sfun(psser)
 
                     if min_count > 0:
                         scol = F.when(Frame._count_expr(psser) >= min_count, 
scol)
diff --git a/python/pyspark/pandas/generic.py b/python/pyspark/pandas/generic.py
index 1ce4671d696..bb5d6a4edc9 100644
--- a/python/pyspark/pandas/generic.py
+++ b/python/pyspark/pandas/generic.py
@@ -117,6 +117,7 @@ class Frame(object, metaclass=ABCMeta):
         name: str,
         axis: Optional[Axis] = None,
         numeric_only: bool = True,
+        skipna: bool = True,
         **kwargs: Any,
     ) -> Union["Series", Scalar]:
         pass
@@ -1164,7 +1165,7 @@ class Frame(object, metaclass=ABCMeta):
         )
 
     def mean(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None
+        self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: 
bool = None
     ) -> Union[Scalar, "Series"]:
         """
         Return the mean of the values.
@@ -1173,6 +1174,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         numeric_only : bool, default None
             Include only float, int, boolean columns. False is not supported. 
This parameter
             is mainly for pandas compatibility.
@@ -1225,11 +1231,19 @@ class Frame(object, metaclass=ABCMeta):
             return F.mean(spark_column)
 
         return self._reduce_for_stat_function(
-            mean, name="mean", axis=axis, numeric_only=numeric_only
+            mean,
+            name="mean",
+            axis=axis,
+            numeric_only=numeric_only,
+            skipna=skipna,
         )
 
     def sum(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None, 
min_count: int = 0
+        self,
+        axis: Optional[Axis] = None,
+        skipna: bool = True,
+        numeric_only: bool = None,
+        min_count: int = 0,
     ) -> Union[Scalar, "Series"]:
         """
         Return the sum of the values.
@@ -1238,6 +1252,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Added *skipna* to exclude .
         numeric_only : bool, default None
             Include only float, int, boolean columns. False is not supported. 
This parameter
             is mainly for pandas compatibility.
@@ -1301,6 +1320,7 @@ class Frame(object, metaclass=ABCMeta):
         def sum(psser: "Series") -> Column:
             spark_type = psser.spark.data_type
             spark_column = psser.spark.column
+
             if isinstance(spark_type, BooleanType):
                 spark_column = spark_column.cast(LongType())
             elif not isinstance(spark_type, NumericType):
@@ -1312,11 +1332,20 @@ class Frame(object, metaclass=ABCMeta):
             return F.coalesce(F.sum(spark_column), SF.lit(0))
 
         return self._reduce_for_stat_function(
-            sum, name="sum", axis=axis, numeric_only=numeric_only, 
min_count=min_count
+            sum,
+            name="sum",
+            axis=axis,
+            numeric_only=numeric_only,
+            min_count=min_count,
+            skipna=skipna,
         )
 
     def product(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None, 
min_count: int = 0
+        self,
+        axis: Optional[Axis] = None,
+        skipna: bool = True,
+        numeric_only: bool = None,
+        min_count: int = 0,
     ) -> Union[Scalar, "Series"]:
         """
         Return the product of the values.
@@ -1328,6 +1357,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         numeric_only : bool, default None
             Include only float, int, boolean columns. False is not supported. 
This parameter
             is mainly for pandas compatibility.
@@ -1387,6 +1421,10 @@ class Frame(object, metaclass=ABCMeta):
         def prod(psser: "Series") -> Column:
             spark_type = psser.spark.data_type
             spark_column = psser.spark.column
+
+            if not skipna:
+                spark_column = F.when(spark_column.isNull(), 
np.nan).otherwise(spark_column)
+
             if isinstance(spark_type, BooleanType):
                 scol = F.min(F.coalesce(spark_column, 
SF.lit(True))).cast(LongType())
             elif isinstance(spark_type, NumericType):
@@ -1411,13 +1449,18 @@ class Frame(object, metaclass=ABCMeta):
             return F.coalesce(scol, SF.lit(1))
 
         return self._reduce_for_stat_function(
-            prod, name="prod", axis=axis, numeric_only=numeric_only, 
min_count=min_count
+            prod,
+            name="prod",
+            axis=axis,
+            numeric_only=numeric_only,
+            min_count=min_count,
+            skipna=skipna,
         )
 
     prod = product
 
     def skew(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None
+        self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: 
bool = None
     ) -> Union[Scalar, "Series"]:
         """
         Return unbiased skew normalized by N-1.
@@ -1426,6 +1469,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         numeric_only : bool, default None
             Include only float, int, boolean columns. False is not supported. 
This parameter
             is mainly for pandas compatibility.
@@ -1471,11 +1519,15 @@ class Frame(object, metaclass=ABCMeta):
             return F.skewness(spark_column)
 
         return self._reduce_for_stat_function(
-            skew, name="skew", axis=axis, numeric_only=numeric_only
+            skew,
+            name="skew",
+            axis=axis,
+            numeric_only=numeric_only,
+            skipna=skipna,
         )
 
     def kurtosis(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None
+        self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: 
bool = None
     ) -> Union[Scalar, "Series"]:
         """
         Return unbiased kurtosis using Fisher’s definition of kurtosis 
(kurtosis of normal == 0.0).
@@ -1485,6 +1537,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         numeric_only : bool, default None
             Include only float, int, boolean columns. False is not supported. 
This parameter
             is mainly for pandas compatibility.
@@ -1530,13 +1587,17 @@ class Frame(object, metaclass=ABCMeta):
             return F.kurtosis(spark_column)
 
         return self._reduce_for_stat_function(
-            kurtosis, name="kurtosis", axis=axis, numeric_only=numeric_only
+            kurtosis,
+            name="kurtosis",
+            axis=axis,
+            numeric_only=numeric_only,
+            skipna=skipna,
         )
 
     kurt = kurtosis
 
     def min(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None
+        self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: 
bool = None
     ) -> Union[Scalar, "Series"]:
         """
         Return the minimum of the values.
@@ -1545,6 +1606,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         numeric_only : bool, default None
             If True, include only float, int, boolean columns. This parameter 
is mainly for
             pandas compatibility. False is supported; however, the columns 
should
@@ -1591,10 +1657,11 @@ class Frame(object, metaclass=ABCMeta):
             name="min",
             axis=axis,
             numeric_only=numeric_only,
+            skipna=skipna,
         )
 
     def max(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None
+        self, axis: Optional[Axis] = None, skipna: bool = True, numeric_only: 
bool = None
     ) -> Union[Scalar, "Series"]:
         """
         Return the maximum of the values.
@@ -1603,6 +1670,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         numeric_only : bool, default None
             If True, include only float, int, boolean columns. This parameter 
is mainly for
             pandas compatibility. False is supported; however, the columns 
should
@@ -1649,6 +1721,7 @@ class Frame(object, metaclass=ABCMeta):
             name="max",
             axis=axis,
             numeric_only=numeric_only,
+            skipna=skipna,
         )
 
     def count(
@@ -1726,7 +1799,11 @@ class Frame(object, metaclass=ABCMeta):
         )
 
     def std(
-        self, axis: Optional[Axis] = None, ddof: int = 1, numeric_only: bool = 
None
+        self,
+        axis: Optional[Axis] = None,
+        skipna: bool = True,
+        ddof: int = 1,
+        numeric_only: bool = None,
     ) -> Union[Scalar, "Series"]:
         """
         Return sample standard deviation.
@@ -1735,6 +1812,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         ddof : int, default 1
             Delta Degrees of Freedom. The divisor used in calculations is N - 
ddof,
             where N represents the number of elements.
@@ -1803,7 +1885,7 @@ class Frame(object, metaclass=ABCMeta):
                 return F.stddev_samp(spark_column)
 
         return self._reduce_for_stat_function(
-            std, name="std", axis=axis, numeric_only=numeric_only, ddof=ddof
+            std, name="std", axis=axis, numeric_only=numeric_only, ddof=ddof, 
skipna=skipna
         )
 
     def var(
@@ -1888,7 +1970,11 @@ class Frame(object, metaclass=ABCMeta):
         )
 
     def median(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None, 
accuracy: int = 10000
+        self,
+        axis: Optional[Axis] = None,
+        skipna: bool = True,
+        numeric_only: bool = None,
+        accuracy: int = 10000,
     ) -> Union[Scalar, "Series"]:
         """
         Return the median of the values for the requested axis.
@@ -1901,6 +1987,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         numeric_only : bool, default None
             Include only float, int, boolean columns. False is not supported. 
This parameter
             is mainly for pandas compatibility.
@@ -1995,11 +2086,19 @@ class Frame(object, metaclass=ABCMeta):
                 )
 
         return self._reduce_for_stat_function(
-            median, name="median", numeric_only=numeric_only, axis=axis
+            median,
+            name="median",
+            numeric_only=numeric_only,
+            axis=axis,
+            skipna=skipna,
         )
 
     def sem(
-        self, axis: Optional[Axis] = None, ddof: int = 1, numeric_only: bool = 
None
+        self,
+        axis: Optional[Axis] = None,
+        skipna: bool = True,
+        ddof: int = 1,
+        numeric_only: bool = None,
     ) -> Union[Scalar, "Series"]:
         """
         Return unbiased standard error of the mean over requested axis.
@@ -2008,6 +2107,11 @@ class Frame(object, metaclass=ABCMeta):
         ----------
         axis : {index (0), columns (1)}
             Axis for the function to be applied on.
+        skipna : bool, default True
+            Exclude NA/null values when computing the result.
+
+            .. versionchanged:: 3.4.0
+               Supported including NA/null values.
         ddof : int, default 1
             Delta Degrees of Freedom. The divisor used in calculations is N - 
ddof,
             where N represents the number of elements.
@@ -2086,7 +2190,12 @@ class Frame(object, metaclass=ABCMeta):
             return std(psser) / pow(Frame._count_expr(psser), 0.5)
 
         return self._reduce_for_stat_function(
-            sem, name="sem", numeric_only=numeric_only, axis=axis, ddof=ddof
+            sem,
+            name="sem",
+            numeric_only=numeric_only,
+            axis=axis,
+            ddof=ddof,
+            skipna=skipna,
         )
 
     @property
diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index f15ba4854f3..b8915e160c1 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -6849,6 +6849,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
         name: str_type,
         axis: Optional[Axis] = None,
         numeric_only: bool = True,
+        skipna: bool = True,
         **kwargs: Any,
     ) -> Scalar:
         """
@@ -6859,13 +6860,17 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
         sfun : the stats function to be used for aggregation
         name : original pandas API name.
         axis : used only for sanity check because series only support index 
axis.
-        numeric_only : not used by this implementation, but passed down by 
stats functions
+        numeric_only : not used by this implementation, but passed down by 
stats functions.
+        skipna: exclude NA/null values when computing the result.
         """
         axis = validate_axis(axis)
         if axis == 1:
             raise NotImplementedError("Series does not support columns axis.")
 
-        scol = sfun(self)
+        if not skipna and get_option("compute.eager_check") and self.hasnans:
+            scol = F.first(F.lit(np.nan))
+        else:
+            scol = sfun(self)
 
         min_count = kwargs.get("min_count", 0)
         if min_count > 0:
diff --git a/python/pyspark/pandas/tests/test_generic_functions.py 
b/python/pyspark/pandas/tests/test_generic_functions.py
index 3e4db6c86bc..5062daa77e2 100644
--- a/python/pyspark/pandas/tests/test_generic_functions.py
+++ b/python/pyspark/pandas/tests/test_generic_functions.py
@@ -111,6 +111,48 @@ class GenericFunctionsTest(PandasOnSparkTestCase, 
TestUtils):
         )
         self._test_interpolate(pdf)
 
+    def _test_stat_functions(self, stat_func):
+        pdf = pd.DataFrame({"a": [np.nan, np.nan, np.nan], "b": [1, np.nan, 
2], "c": [1, 2, 3]})
+        psdf = ps.from_pandas(pdf)
+        self.assert_eq(stat_func(pdf.a), stat_func(psdf.a))
+        self.assert_eq(stat_func(pdf.b), stat_func(psdf.b))
+        self.assert_eq(stat_func(pdf), stat_func(psdf))
+
+    # Fix skew and kurtosis and re-enable tests below
+    def test_stat_functions(self):
+        self._test_stat_functions(lambda x: x.sum())
+        self._test_stat_functions(lambda x: x.sum(skipna=False))
+        self._test_stat_functions(lambda x: x.mean())
+        self._test_stat_functions(lambda x: x.mean(skipna=False))
+        self._test_stat_functions(lambda x: x.product())
+        self._test_stat_functions(lambda x: x.product(skipna=False))
+        self._test_stat_functions(lambda x: x.min())
+        self._test_stat_functions(lambda x: x.min(skipna=False))
+        self._test_stat_functions(lambda x: x.max())
+        self._test_stat_functions(lambda x: x.max(skipna=False))
+        self._test_stat_functions(lambda x: x.std())
+        self._test_stat_functions(lambda x: x.std(skipna=False))
+        self._test_stat_functions(lambda x: x.sem())
+        self._test_stat_functions(lambda x: x.sem(skipna=False))
+        # self._test_stat_functions(lambda x: x.skew())
+        self._test_stat_functions(lambda x: x.skew(skipna=False))
+
+        # Test cases below return differently from pandas (either by design or 
to be fixed)
+        pdf = pd.DataFrame({"a": [np.nan, np.nan, np.nan], "b": [1, np.nan, 
2], "c": [1, 2, 3]})
+        psdf = ps.from_pandas(pdf)
+
+        self.assert_eq(pdf.a.median(), psdf.a.median())
+        self.assert_eq(pdf.a.median(skipna=False), psdf.a.median(skipna=False))
+        self.assert_eq(1.0, psdf.b.median())
+        self.assert_eq(pdf.b.median(skipna=False), psdf.b.median(skipna=False))
+        self.assert_eq(pdf.c.median(), psdf.c.median())
+
+        self.assert_eq(pdf.a.kurtosis(skipna=False), 
psdf.a.kurtosis(skipna=False))
+        self.assert_eq(pdf.a.kurtosis(), psdf.a.kurtosis())
+        self.assert_eq(pdf.b.kurtosis(skipna=False), 
psdf.b.kurtosis(skipna=False))
+        # self.assert_eq(pdf.b.kurtosis(), psdf.b.kurtosis())  AssertionError: 
nan != -2.0
+        self.assert_eq(-1.5, psdf.c.kurtosis())
+
 
 if __name__ == "__main__":
     import unittest


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

Reply via email to