This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 3cf0c83d29aa [SPARK-47771][PYTHON][DOCS][TESTS][FOLLOWUP] Make `max_by, min_by` doctests deterministic 3cf0c83d29aa is described below commit 3cf0c83d29aa9a266f6f4802bfcf67607cc21555 Author: Ruifeng Zheng <ruife...@apache.org> AuthorDate: Wed Apr 24 11:23:43 2024 +0800 [SPARK-47771][PYTHON][DOCS][TESTS][FOLLOWUP] Make `max_by, min_by` doctests deterministic ### What changes were proposed in this pull request? Make `max_by, min_by` doctests deterministic ### Why are the changes needed? https://github.com/apache/spark/pull/45939 fixed this issue by sorting the rows, unfortunately, it is not enough: in group `department=Finance`, two rows `("Finance", "Frank", 5)` and `("Finance", "George", 5)` have the same value `years_in_dept=5`, so `min_by("name", "years_in_dept")` and `max_by("name", "years_in_dept")` is still non-deterministic. This test failed in some env: ``` ********************************************************************** File "/home/jenkins/python/pyspark/sql/connect/functions/builtin.py", line 1177, in pyspark.sql.connect.functions.builtin.max_by Failed example: df.groupby("department").agg( sf.max_by("name", "years_in_dept") ).sort("department").show() Expected: +----------+---------------------------+ |department|max_by(name, years_in_dept)| +----------+---------------------------+ | Consult| Henry| | Finance| George| +----------+---------------------------+ Got: +----------+---------------------------+ |department|max_by(name, years_in_dept)| +----------+---------------------------+ | Consult| Henry| | Finance| Frank| +----------+---------------------------+ <BLANKLINE> ********************************************************************** File "/home/jenkins/python/pyspark/sql/connect/functions/builtin.py", line 1205, in pyspark.sql.connect.functions.builtin.min_by Failed example: df.groupby("department").agg( sf.min_by("name", "years_in_dept") ).sort("department").show() Expected: +----------+---------------------------+ |department|min_by(name, years_in_dept)| +----------+---------------------------+ | Consult| Eva| | Finance| George| +----------+---------------------------+ Got: +----------+---------------------------+ |department|min_by(name, years_in_dept)| +----------+---------------------------+ | Consult| Eva| | Finance| Frank| +----------+---------------------------+ <BLANKLINE> ********************************************************************** ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #46196 from zhengruifeng/doc_max_min_by. Authored-by: Ruifeng Zheng <ruife...@apache.org> Signed-off-by: Ruifeng Zheng <ruife...@apache.org> --- python/pyspark/sql/functions/builtin.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/python/pyspark/sql/functions/builtin.py b/python/pyspark/sql/functions/builtin.py index 96be5de0180b..b54b377aaebc 100644 --- a/python/pyspark/sql/functions/builtin.py +++ b/python/pyspark/sql/functions/builtin.py @@ -1275,7 +1275,7 @@ def max_by(col: "ColumnOrName", ord: "ColumnOrName") -> Column: >>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([ ... ("Consult", "Eva", 6), ("Finance", "Frank", 5), - ... ("Finance", "George", 5), ("Consult", "Henry", 7)], + ... ("Finance", "George", 9), ("Consult", "Henry", 7)], ... schema=("department", "name", "years_in_dept")) >>> df.groupby("department").agg( ... sf.max_by("name", "years_in_dept") @@ -1356,7 +1356,7 @@ def min_by(col: "ColumnOrName", ord: "ColumnOrName") -> Column: >>> import pyspark.sql.functions as sf >>> df = spark.createDataFrame([ ... ("Consult", "Eva", 6), ("Finance", "Frank", 5), - ... ("Finance", "George", 5), ("Consult", "Henry", 7)], + ... ("Finance", "George", 9), ("Consult", "Henry", 7)], ... schema=("department", "name", "years_in_dept")) >>> df.groupby("department").agg( ... sf.min_by("name", "years_in_dept") @@ -1365,7 +1365,7 @@ def min_by(col: "ColumnOrName", ord: "ColumnOrName") -> Column: |department|min_by(name, years_in_dept)| +----------+---------------------------+ | Consult| Eva| - | Finance| George| + | Finance| Frank| +----------+---------------------------+ """ return _invoke_function_over_columns("min_by", col, ord) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org