Hi beamers!
Enjoying using apache beam with Python.
I'm often finding myself creating patterns of:
input -> (key, calc1 result)
input -> (key, calc2 result)
input -> (key, calc3 result)
...
input -> (key, calcN result)
CoGroupByKey -> merge results of calc1, calc2, calc3....calcN
into original input dict
(where input is a dict which represents a row)
Up until now, I've been using CoGroupByKey, but I wonder if there's a
better way such that as a set of calcs for a row is finished it can
continue to be processed down-pipeline.
Given that:
- There's only one result for each calc for each row
- The calculations can be made in parallel
... waiting for the whole set to be realized at the merging CoGroupByKey
seems like an unnecessary bottleneck.
Is there a better pattern for splitting / processing in parallel / merging
results per-row?
Thanks,
Adam