This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new e932e0ad289 [SPARK-40579][PS] `GroupBy.first` should skip NULLs
e932e0ad289 is described below

commit e932e0ad289ad46f2cc21c225955fdacaeaf9d24
Author: Ruifeng Zheng <ruife...@apache.org>
AuthorDate: Wed Sep 28 10:24:08 2022 +0900

    [SPARK-40579][PS] `GroupBy.first` should skip NULLs
    
    ### What changes were proposed in this pull request?
    make `GroupBy.first` skip nulls
    
    ### Why are the changes needed?
    to fix the behavior difference
    
    ```
    In [1]:
       ...: import pandas as pd
       ...: import numpy as np
       ...: import pyspark.pandas as ps
       ...:
       ...: pdf = pd.DataFrame({"A": [1, 2, 1, 2],"B": [-1.5, np.nan, -3.2, 
0.1],})
       ...: psdf = ps.from_pandas(pdf)
       ...:
    
    In [2]: pdf.groupby("A").first()
    Out[2]:
         B
    A
    1 -1.5
    2  0.1
    
    In [3]: psdf.groupby("A").first()
    
         B
    A
    1 -1.5
    2  NaN
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    yes, updated `GroupBy.first` will skip NULLs
    
    ### How was this patch tested?
    added UT
    
    Closes #38017 from zhengruifeng/ps_first_skip_na.
    
    Authored-by: Ruifeng Zheng <ruife...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/pyspark/pandas/groupby.py            |  3 ++-
 python/pyspark/pandas/tests/test_groupby.py | 11 +++++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index 9378e83af90..95a41398619 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -449,7 +449,8 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
         2  False  3
         """
         return self._reduce_for_stat_function(
-            F.first, accepted_spark_types=(NumericType, BooleanType) if 
numeric_only else None
+            lambda col: F.first(col, ignorenulls=True),
+            accepted_spark_types=(NumericType, BooleanType) if numeric_only 
else None,
         )
 
     def last(self, numeric_only: Optional[bool] = False) -> FrameLike:
diff --git a/python/pyspark/pandas/tests/test_groupby.py 
b/python/pyspark/pandas/tests/test_groupby.py
index 481a0f8cfac..1f79f9b2939 100644
--- a/python/pyspark/pandas/tests/test_groupby.py
+++ b/python/pyspark/pandas/tests/test_groupby.py
@@ -1419,6 +1419,17 @@ class GroupByTest(PandasOnSparkTestCase, TestUtils):
         self._test_stat_func(lambda groupby_obj: 
groupby_obj.first(numeric_only=None))
         self._test_stat_func(lambda groupby_obj: 
groupby_obj.first(numeric_only=True))
 
+        pdf = pd.DataFrame(
+            {
+                "A": [1, 2, 1, 2],
+                "B": [-1.5, np.nan, -3.2, 0.1],
+            }
+        )
+        psdf = ps.from_pandas(pdf)
+        self.assert_eq(
+            pdf.groupby("A").first().sort_index(), 
psdf.groupby("A").first().sort_index()
+        )
+
     def test_last(self):
         self._test_stat_func(lambda groupby_obj: groupby_obj.last())
         self._test_stat_func(lambda groupby_obj: 
groupby_obj.last(numeric_only=None))


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to