Re: Bundling in ParDos

Kenneth Knowles Mon, 21 May 2018 21:31:29 -0700

Hi Abdul,

The bundle is chosen by the runner in order to best balance low latency
with amortized cost of FinishBundle and committing transactions, so you
cannot generally control it, by design. If you have a very large amount of
data coming through, or are running a batch job, then the runner will
likely choose to put many elements in a bundle to get better amortization.
If you have less data, then the runner will likely send smaller bundles
(sometimes just one element) to avoid having data sit around waiting for
too long.

Your computation should have the same result regardless of bundling, within
tolerances - for example you may write one file per bundle but whoever
consumes the files should not be sensitive to this.

Kenn

On Mon, May 21, 2018 at 7:04 PM Abdul Qadeer <[email protected]> wrote:

> Hi!
>
> I was trying to understand the behavior of StartBundle and FinishBundle
> w.r.t. DoFns.
> I have an unbounded data source and I am trying to leverage bundling to
> achieve batching.
> From the docs of ParDo:
>
> "when a ParDo transform is executed, the elements of the input PCollection
> are first divided up into some number of "bundles"
>
> I would like to know if bundling is possible for unbounded data in the
> first place. If it is then how do I control the bundle size i.e. number of
> elements of a given PCollection in that bundle?
>

Re: Bundling in ParDos

Reply via email to