(spark) branch master updated: [SPARK-48045][PYTHON] Pandas API groupby with multi-agg-relabel ignores as_index=False

gurwls223 Tue, 07 May 2024 17:44:40 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 67ae23934b56 [SPARK-48045][PYTHON] Pandas API groupby with 
multi-agg-relabel ignores as_index=False
67ae23934b56 is described below

commit 67ae23934b56761617c2fb217ae6cf6f2d8f619b
Author: sai <said...@saidatts-mbp.attlocal.net>
AuthorDate: Wed May 8 09:44:16 2024 +0900

    [SPARK-48045][PYTHON] Pandas API groupby with multi-agg-relabel ignores 
as_index=False
    
    ### What changes were proposed in this pull request?
    In a Scenario where we use GroupBy in PySpark API with relabeling of 
aggregate columns and using as_index = False,
    the columns with which we group by are not returned in the DataFrame. The 
change proposes to fix this bug.
    
    Example:
    ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", 
as_index=False).agg(b_max=("b", "max"))
    
    Result:
    _  b_max
    0      1
    
    Required Result:
    _  a  b_max
    0  0      1
    
    ### Why are the changes needed?
    The relabeling part of the code only uses only the aggregate columns. In a 
scenario where as_index=True, it is not an issue as the columns with which we 
group by are included in the index. When as_index=False, we need to append the 
columns with which we grouped by to the relabeling code.
    
    Please, check the commits/PR for the code changes
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    - Passed GA
    - Passed Build tests
    - Unit Tested including scenarios in addition to the one provided in the 
Jira ticket
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #46391 from sinaiamonkar-sai/SPARK-48045-2.
    
    Authored-by: sai <said...@saidatts-mbp.attlocal.net>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/pyspark/pandas/groupby.py                    |  7 ++++++-
 python/pyspark/pandas/tests/groupby/test_groupby.py | 21 +++++++++++++++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/groupby.py b/python/pyspark/pandas/groupby.py
index ec47ab75c43c..55627a4c740c 100644
--- a/python/pyspark/pandas/groupby.py
+++ b/python/pyspark/pandas/groupby.py
@@ -308,6 +308,7 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
             )
 
         if not self._as_index:
+            index_cols = psdf._internal.column_labels
             should_drop_index = set(
                 i for i, gkey in enumerate(self._groupkeys) if gkey._psdf is 
not self._psdf
             )
@@ -322,8 +323,12 @@ class GroupBy(Generic[FrameLike], metaclass=ABCMeta):
                 psdf = psdf.reset_index(level=should_drop_index, drop=drop)
             if len(should_drop_index) < len(self._groupkeys):
                 psdf = psdf.reset_index()
+            index_cols = [c for c in psdf._internal.column_labels if c not in 
index_cols]
+            if relabeling:
+                psdf = psdf[pd.Index(index_cols + list(order))]
+                psdf.columns = pd.Index([c[0] for c in index_cols] + 
list(columns))
 
-        if relabeling:
+        if relabeling and self._as_index:
             psdf = psdf[order]
             psdf.columns = columns  # type: ignore[assignment]
         return psdf
diff --git a/python/pyspark/pandas/tests/groupby/test_groupby.py 
b/python/pyspark/pandas/tests/groupby/test_groupby.py
index 5867f7b62fa5..b58bfddb4b99 100644
--- a/python/pyspark/pandas/tests/groupby/test_groupby.py
+++ b/python/pyspark/pandas/tests/groupby/test_groupby.py
@@ -451,6 +451,27 @@ class GroupByTestsMixin:
             pdf.groupby([("x", "a"), ("x", "b")]).diff().sort_index(),
         )
 
+    def test_aggregate_relabel_index_false(self):
+        pdf = pd.DataFrame(
+            {
+                "A": [0, 0, 1, 1, 1],
+                "B": ["a", "a", "b", "a", "b"],
+                "C": [10, 15, 10, 20, 30],
+            }
+        )
+        psdf = ps.from_pandas(pdf)
+
+        self.assert_eq(
+            pdf.groupby(["B", "A"], as_index=False)
+            .agg(C_MAX=("C", "max"))
+            .sort_values(["B", "A"])
+            .reset_index(drop=True),
+            psdf.groupby(["B", "A"], as_index=False)
+            .agg(C_MAX=("C", "max"))
+            .sort_values(["B", "A"])
+            .reset_index(drop=True),
+        )
+
 
 class GroupByTests(
     GroupByTestsMixin,


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48045][PYTHON] Pandas API groupby with multi-agg-relabel ignores as_index=False

Reply via email to