(spark) branch branch-3.5 updated: [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof

gurwls223 Thu, 11 Apr 2024 17:39:27 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.5 by this push:
     new d18659de626c [SPARK-47824][PS] Fix nondeterminism in 
pyspark.pandas.series.asof
d18659de626c is described below

commit d18659de626cc3743e7f6a5dceca0f2a25b006de
Author: Mark Jarvin <mark.jar...@databricks.com>
AuthorDate: Fri Apr 12 09:37:19 2024 +0900

    [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof
    
    ### What changes were proposed in this pull request?
    
    Use the monotonically ID as a sorting condition for `max_by` instead of a 
literal string.
    
    ### Why are the changes needed?
    https://github.com/apache/spark/pull/35191 had a error where the literal 
string `"__monotonically_increasing_id__"` was used as the tie-breaker in 
`max_by` instead of the actual ID.
    
    ### Does this PR introduce _any_ user-facing change?
    Fixes nondeterminism in `asof`
    
    ### How was this patch tested?
    In some circumstances 
`//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient 
to reproduce
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes #46018 from markj-db/SPARK-47824.
    
    Authored-by: Mark Jarvin <mark.jar...@databricks.com>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
    (cherry picked from commit a0ccdf27e5ff30817b8f058f08f98d5b44bad2db)
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/pyspark/pandas/series.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index 95ca92e78787..b54ae88616fa 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -5910,7 +5910,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
                     # then return monotonically_increasing_id. This will let 
max by
                     # to return last index value, which is the behaviour of 
pandas
                     else spark_column.isNotNull(),
-                    monotonically_increasing_id_column,
+                    F.col(monotonically_increasing_id_column),
                 ),
             )
             for index in where


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch branch-3.5 updated: [SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof

Reply via email to