Romain, Are you saying that you need the set of elements in the bundle to be deterministic, so on retry you get the same bundle back?
Reuven On Wed, Nov 15, 2017 at 4:44 PM, Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > Hi Reuven, > > how does it help since you will still "send to the transform" the data > before the commit point of beam and therefore not be able to reprocess > the same data in case of a failure between your eager flush and next > commit point? > > Romain Manni-Bucau > @rmannibucau | Blog | Old Blog | Github | LinkedIn > > > 2017-11-15 9:43 GMT+01:00 Reuven Lax <re...@google.com.invalid>: > > Bundles are currently completely up to the runner, and different runners > do > > them differently. In addition to Flink, the Dataflow runner creates > > smallish bundles when run in streaming mode, as the streaming-mode runner > > is optimizing for latency (so a bundle might be small simply because not > > enough time has passed to make it large). Bundles being large is usually > > less of an issue - you simply need to occasionally commit in > processElement > > if the bundle being built up is too large. > > > > However there is a simple workaround. You can GBK on a fixed number of > > shard keys and set a elementCountAtLeast trigger to trigger when enough > > data has arrived. > > > > Reuven > > > > On Wed, Nov 15, 2017 at 4:38 PM, Romain Manni-Bucau < > rmannibu...@gmail.com> > > wrote: > > > >> Hi guys, > >> > >> The subject is a bit provocative but the topic is real and coming > >> again and again with the beam usage: how a dofn can handle some > >> "chunking". > >> > >> The need is to be able to commit each N records but with N not too big. > >> > >> The natural API for that in beam is the bundle one but bundles are not > >> reliable since they can be very small (flink) - we can say it is "ok" > >> even if it has some perf impacts - or too big (spark does full size / > >> #workers). > >> > >> The workaround is what we see in the ES I/O: a maxSize which does an > >> eager flush. The issue is that then the checkpoint is not respected > >> and you can process multiple times the same records. > >> > >> Any plan to make this API reliable and controllable from a beam point > >> of view (at least in a max manner)? > >> > >> Thanks, > >> Romain Manni-Bucau > >> @rmannibucau | Blog | Old Blog | Github | LinkedIn > >> >