TheNeuralBit commented on a change in pull request #13401:
URL: https://github.com/apache/beam/pull/13401#discussion_r528923757
##########
File path: sdks/python/apache_beam/dataframe/expressions.py
##########
@@ -48,10 +48,15 @@ def lookup(self, expr): # type: (Expression) -> Any
class PartitioningSession(Session):
"""An extension of Session that enforces actual partitioning of inputs.
- When evaluating an expression, inputs are partitioned according to its
- `requires_partition_by` specifications, the expression is evaluated on each
- partition separately, and the final result concatinated, as if this were
- actually executed in a parallel manner.
+ Each expression is evaluated multiple times for various supported
Review comment:
Yeah I think the performance impact is worth it. We could probably
reduce it by being a little more discerning in what gets re-executed. For
example I don't think there's any value in doing this for expressions that have
preserves=Nothing().
Really good point about the random seed. I could bracket the runs with calls
to random.getstate() and random.setstate()
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]