Hi Mohamed,

I believe this is related to fusion which is a feature of some of the
runners, you will be able to find more information on fusion on:

https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization

Cheers

Reza

On Thu, 3 Jan 2019 at 04:09, Mohamed Haseeb <[email protected]> wrote:

> Hi,
>
> As per the Authoring I/O Transforms guide
> <https://beam.apache.org/documentation/io/authoring-overview/>, the
> recommended way to implement a Read transform (from a source that can be
> read in parallel) has these steps:
> - Splitting the data into parts to be read in parallel (ParDo)
> - Reading from each of those parts (ParDo)
> - With a GroupByKey in between the ParDo:s
> The stated motivation for the GroupByKey is "it allows the runner to use
> different numbers of workers" for the splitting and reading parts. Can
> someone elaborate (or point to some relevant DOCs) on how the GroupByKey
> will enable using different number of works for the two ParDo steps.
>
> Thanks,
> Mohamed
>

Reply via email to