TheNeuralBit commented on a change in pull request #14517:
URL: https://github.com/apache/beam/pull/14517#discussion_r613621517
##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -1393,9 +1430,16 @@ def fill_dataframe(*args):
- cummax = cummin = cumsum = cumprod = frame_base.wont_implement_method(
- 'order-sensitive')
- diff = frame_base.wont_implement_method('order-sensitive')
+ cummax = frame_base.order_sensitive_method(pd.DataFrame, 'cummax')
+ cummin = frame_base.order_sensitive_method(pd.DataFrame, 'cummin')
+ cumprod = frame_base.order_sensitive_method(pd.DataFrame, 'cumprod')
+ cumsum = frame_base.order_sensitive_method(pd.DataFrame, 'cumsum')
+ diff = frame_base.order_sensitive_method(pd.DataFrame, 'diff')
Review comment:
The problem isn't that the individual operations are difficult to make
efficient, it's that the underlying PCollections have no concept of ordering.
So at execution time we don't know which row is supposed to be before this one.
In all likelihood it's on another worker due to our hash-based partitioning.
@robertwb and I have talked about implementing order-sensitive methods in
the future (I recently filed BEAM-12129 to track this). I think the way it
would work is that we would allow users to perform an operation like
`sort_values` that _imposes_ an ordering, and then it would be possible to
perform order-sensitive operations on the output.
Under the hood we would partition "sorted" dataframes based on their
ordering, rather than randomly with the hash-based partitioning.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]