[ 
https://issues.apache.org/jira/browse/SPARK-55792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084012#comment-18084012
 ] 

Le Xuan Tril commented on SPARK-55792:
--------------------------------------

[~devin-petersohn] Could you please take a look at this PR when you have a 
chance?
 
This PR optimizes `DataFrame.diff(axis=0)` and `Series.diff()` to avoid the 
unpartitioned Spark Window by range-partitioning on the natural order column 
and exchanging only boundary rows across partitions. The grouped `diff()` path 
is left unchanged.
 
Happy to address any feedback. Thank you!

> Optimize DataFrame.diff axis=0 to avoid unpartitioned Window
> ------------------------------------------------------------
>
>                 Key: SPARK-55792
>                 URL: https://issues.apache.org/jira/browse/SPARK-55792
>             Project: Spark
>          Issue Type: Bug
>          Components: Pandas API on Spark
>    Affects Versions: 4.1.1
>            Reporter: Devin Petersohn
>            Priority: Major
>              Labels: pull-request-available
>
> DataFrame.diff(axis=0) currently uses Spark's Window without a partition 
> specification, which will have scaling issues for large datasets. We should 
> try to optimize away the unbounded window (e.g., by using a partitioned 
> window similar to other projects in the space).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to