Hi Abdul, The bundle is chosen by the runner in order to best balance low latency with amortized cost of FinishBundle and committing transactions, so you cannot generally control it, by design. If you have a very large amount of data coming through, or are running a batch job, then the runner will likely choose to put many elements in a bundle to get better amortization. If you have less data, then the runner will likely send smaller bundles (sometimes just one element) to avoid having data sit around waiting for too long.
Your computation should have the same result regardless of bundling, within tolerances - for example you may write one file per bundle but whoever consumes the files should not be sensitive to this. Kenn On Mon, May 21, 2018 at 7:04 PM Abdul Qadeer <[email protected]> wrote: > Hi! > > I was trying to understand the behavior of StartBundle and FinishBundle > w.r.t. DoFns. > I have an unbounded data source and I am trying to leverage bundling to > achieve batching. > From the docs of ParDo: > > "when a ParDo transform is executed, the elements of the input PCollection > are first divided up into some number of "bundles" > > I would like to know if bundling is possible for unbounded data in the > first place. If it is then how do I control the bundle size i.e. number of > elements of a given PCollection in that bundle? >
