This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.5 by this push: new d18659de626c [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof d18659de626c is described below commit d18659de626cc3743e7f6a5dceca0f2a25b006de Author: Mark Jarvin <mark.jar...@databricks.com> AuthorDate: Fri Apr 12 09:37:19 2024 +0900 [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof ### What changes were proposed in this pull request? Use the monotonically ID as a sorting condition for `max_by` instead of a literal string. ### Why are the changes needed? https://github.com/apache/spark/pull/35191 had a error where the literal string `"__monotonically_increasing_id__"` was used as the tie-breaker in `max_by` instead of the actual ID. ### Does this PR introduce _any_ user-facing change? Fixes nondeterminism in `asof` ### How was this patch tested? In some circumstances `//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient to reproduce ### Was this patch authored or co-authored using generative AI tooling? No Closes #46018 from markj-db/SPARK-47824. Authored-by: Mark Jarvin <mark.jar...@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> (cherry picked from commit a0ccdf27e5ff30817b8f058f08f98d5b44bad2db) Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/pyspark/pandas/series.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index 95ca92e78787..b54ae88616fa 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -5910,7 +5910,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]): # then return monotonically_increasing_id. This will let max by # to return last index value, which is the behaviour of pandas else spark_column.isNotNull(), - monotonically_increasing_id_column, + F.col(monotonically_increasing_id_column), ), ) for index in where --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org