This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 3cf0c83d29aa [SPARK-47771][PYTHON][DOCS][TESTS][FOLLOWUP] Make 
`max_by, min_by` doctests deterministic
3cf0c83d29aa is described below

commit 3cf0c83d29aa9a266f6f4802bfcf67607cc21555
Author: Ruifeng Zheng <ruife...@apache.org>
AuthorDate: Wed Apr 24 11:23:43 2024 +0800

    [SPARK-47771][PYTHON][DOCS][TESTS][FOLLOWUP] Make `max_by, min_by` doctests 
deterministic
    
    ### What changes were proposed in this pull request?
    Make `max_by, min_by` doctests deterministic
    
    ### Why are the changes needed?
    https://github.com/apache/spark/pull/45939 fixed this issue by sorting the 
rows,
    unfortunately, it is not enough:
    
    in group `department=Finance`, two rows `("Finance", "Frank", 5)` and 
`("Finance", "George", 5)` have the same value `years_in_dept=5`, so 
`min_by("name", "years_in_dept")` and `max_by("name", "years_in_dept")` is 
still non-deterministic.
    
    This test failed in some env:
    ```
    **********************************************************************
    File "/home/jenkins/python/pyspark/sql/connect/functions/builtin.py", line 
1177, in pyspark.sql.connect.functions.builtin.max_by
    Failed example:
        df.groupby("department").agg(
            sf.max_by("name", "years_in_dept")
        ).sort("department").show()
    Expected:
        +----------+---------------------------+
        |department|max_by(name, years_in_dept)|
        +----------+---------------------------+
        |   Consult|                      Henry|
        |   Finance|                     George|
        +----------+---------------------------+
    Got:
        +----------+---------------------------+
        |department|max_by(name, years_in_dept)|
        +----------+---------------------------+
        |   Consult|                      Henry|
        |   Finance|                      Frank|
        +----------+---------------------------+
        <BLANKLINE>
    **********************************************************************
    File "/home/jenkins/python/pyspark/sql/connect/functions/builtin.py", line 
1205, in pyspark.sql.connect.functions.builtin.min_by
    Failed example:
        df.groupby("department").agg(
            sf.min_by("name", "years_in_dept")
        ).sort("department").show()
    Expected:
        +----------+---------------------------+
        |department|min_by(name, years_in_dept)|
        +----------+---------------------------+
        |   Consult|                        Eva|
        |   Finance|                     George|
        +----------+---------------------------+
    Got:
        +----------+---------------------------+
        |department|min_by(name, years_in_dept)|
        +----------+---------------------------+
        |   Consult|                        Eva|
        |   Finance|                      Frank|
        +----------+---------------------------+
        <BLANKLINE>
    **********************************************************************
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    ci
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #46196 from zhengruifeng/doc_max_min_by.
    
    Authored-by: Ruifeng Zheng <ruife...@apache.org>
    Signed-off-by: Ruifeng Zheng <ruife...@apache.org>
---
 python/pyspark/sql/functions/builtin.py | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 96be5de0180b..b54b377aaebc 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -1275,7 +1275,7 @@ def max_by(col: "ColumnOrName", ord: "ColumnOrName") -> 
Column:
     >>> import pyspark.sql.functions as sf
     >>> df = spark.createDataFrame([
     ...     ("Consult", "Eva", 6), ("Finance", "Frank", 5),
-    ...     ("Finance", "George", 5), ("Consult", "Henry", 7)],
+    ...     ("Finance", "George", 9), ("Consult", "Henry", 7)],
     ...     schema=("department", "name", "years_in_dept"))
     >>> df.groupby("department").agg(
     ...     sf.max_by("name", "years_in_dept")
@@ -1356,7 +1356,7 @@ def min_by(col: "ColumnOrName", ord: "ColumnOrName") -> 
Column:
     >>> import pyspark.sql.functions as sf
     >>> df = spark.createDataFrame([
     ...     ("Consult", "Eva", 6), ("Finance", "Frank", 5),
-    ...     ("Finance", "George", 5), ("Consult", "Henry", 7)],
+    ...     ("Finance", "George", 9), ("Consult", "Henry", 7)],
     ...     schema=("department", "name", "years_in_dept"))
     >>> df.groupby("department").agg(
     ...     sf.min_by("name", "years_in_dept")
@@ -1365,7 +1365,7 @@ def min_by(col: "ColumnOrName", ord: "ColumnOrName") -> 
Column:
     |department|min_by(name, years_in_dept)|
     +----------+---------------------------+
     |   Consult|                        Eva|
-    |   Finance|                     George|
+    |   Finance|                      Frank|
     +----------+---------------------------+
     """
     return _invoke_function_over_columns("min_by", col, ord)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to