Hi Mohamed, I believe this is related to fusion which is a feature of some of the runners, you will be able to find more information on fusion on:
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization Cheers Reza On Thu, 3 Jan 2019 at 04:09, Mohamed Haseeb <[email protected]> wrote: > Hi, > > As per the Authoring I/O Transforms guide > <https://beam.apache.org/documentation/io/authoring-overview/>, the > recommended way to implement a Read transform (from a source that can be > read in parallel) has these steps: > - Splitting the data into parts to be read in parallel (ParDo) > - Reading from each of those parts (ParDo) > - With a GroupByKey in between the ParDo:s > The stated motivation for the GroupByKey is "it allows the runner to use > different numbers of workers" for the splitting and reading parts. Can > someone elaborate (or point to some relevant DOCs) on how the GroupByKey > will enable using different number of works for the two ParDo steps. > > Thanks, > Mohamed >
