Thanks for the insight Kenneth. It would surprise me if the the decision made by runner about latency vs amortized cost is non deterministic. Are there any benchmarking results with respect to bundling kicking in somewhere?
> On May 21, 2018, at 8:52 PM, Kenneth Knowles <[email protected]> wrote: > > Hi Abdul, > > The bundle is chosen by the runner in order to best balance low latency with > amortized cost of FinishBundle and committing transactions, so you cannot > generally control it, by design. If you have a very large amount of data > coming through, or are running a batch job, then the runner will likely > choose to put many elements in a bundle to get better amortization. If you > have less data, then the runner will likely send smaller bundles (sometimes > just one element) to avoid having data sit around waiting for too long. > > Your computation should have the same result regardless of bundling, within > tolerances - for example you may write one file per bundle but whoever > consumes the files should not be sensitive to this. > > Kenn > > On Mon, May 21, 2018 at 7:04 PM Abdul Qadeer <[email protected] > <mailto:[email protected]>> wrote: > Hi! > > I was trying to understand the behavior of StartBundle and FinishBundle > w.r.t. DoFns. > I have an unbounded data source and I am trying to leverage bundling to > achieve batching. > From the docs of ParDo: > > "when a ParDo transform is executed, the elements of the input PCollection > are first divided up into some number of "bundles" > > I would like to know if bundling is possible for unbounded data in the first > place. If it is then how do I control the bundle size i.e. number of elements > of a given PCollection in that bundle?
