Splitting & merging with CoGroupByKey on a per-row basis

adam moore Wed, 11 Mar 2020 21:00:43 -0700

Hi beamers!

Enjoying using apache beam with Python.


I'm often finding myself creating patterns of:

input -> (key, calc1 result)
input -> (key, calc2 result)
input -> (key, calc3 result)
...
input -> (key, calcN result)

CoGroupByKey -> merge results of calc1, calc2, calc3....calcN
                into original input dict

(where input is a dict which represents a row)

Up until now, I've been using CoGroupByKey, but I wonder if there's a
better way such that as a set of calcs for a row is finished it can
continue to be processed down-pipeline.

Given that:

- There's only one result for each calc for each row
- The calculations can be made in parallel

... waiting for the whole set to be realized at the merging CoGroupByKey
seems like an unnecessary bottleneck.

Is there a better pattern for splitting / processing in parallel / merging
results per-row?

Thanks,
Adam

Splitting & merging with CoGroupByKey on a per-row basis

Reply via email to