This explains it. Thanks Reza!

On Thu, Jan 3, 2019 at 1:19 AM Reza Ardeshir Rokni <[email protected]>
wrote:

> Hi Mohamed,
>
> I believe this is related to fusion which is a feature of some of the
> runners, you will be able to find more information on fusion on:
>
>
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
>
> Cheers
>
> Reza
>
> On Thu, 3 Jan 2019 at 04:09, Mohamed Haseeb <[email protected]> wrote:
>
>> Hi,
>>
>> As per the Authoring I/O Transforms guide
>> <https://beam.apache.org/documentation/io/authoring-overview/>, the
>> recommended way to implement a Read transform (from a source that can be
>> read in parallel) has these steps:
>> - Splitting the data into parts to be read in parallel (ParDo)
>> - Reading from each of those parts (ParDo)
>> - With a GroupByKey in between the ParDo:s
>> The stated motivation for the GroupByKey is "it allows the runner to use
>> different numbers of workers" for the splitting and reading parts. Can
>> someone elaborate (or point to some relevant DOCs) on how the GroupByKey
>> will enable using different number of works for the two ParDo steps.
>>
>> Thanks,
>> Mohamed
>>
>

Reply via email to