Bundles are simply the unit of commitment (retry) in the Beam SDK. They're not really a model concept, but do leak from the implementation into the API as it's not feasible to checkpoint every individual process call, and this allows some state/compute/... to be safely amortized across elements (either the results of all processed elements in a bundle are sent downstream, or none are and the entire bundle is retried).
On Wed, Jan 25, 2017 at 9:36 AM, Matthew Jadczak <[email protected]> wrote: > Hi, > > I’m a finalist CompSci student at the University of Cambridge, and for my > final project/dissertation I am writing an implementation of the Beam SDK in > Elixir [1]. Given that the Beam project is obviously still very much WIP, > it’s still somewhat difficult to find good conceptual overviews of parts of > the system, which is crucial when translating the OOP architecture to > something completely different. However I have found many of the design docs > scattered around the JIRA and here very helpful. (Incidentally, perhaps it > would be helpful to maintain a list of them, to help any contributors > acquaint themselves with the conceptual vision of the implementation?) > > One thing which I have not yet been able to work out is the significance of > “bundles” in the SDK. On the one hand, it seems that they are simply an > implementation detail, effectively a way to do micro-batch processing > efficiently, and indeed they are not mentioned at all in the original > Dataflow paper or anywhere in the Beam docs (except in passing). On the other > hand, it seems most of the key transforms in the SDK core have a concept of > bundles and operate in their terms in practice, while all conceptually being > described as just operating on elements. > > Do bundles have semantic meaning in the Beam Model? Are there any guidelines > as to how a given transform should split its output up into bundles? Should > any runner/SDK implementing the Model have that concept, even when other > primitives for streaming data processing including things like efficiently > transmitting individual elements between stages with backpressure are > available in the language/standard libraries? Are there any insights here > that I am missing, i.e. were problems present in early versions of the > runners solved by adding the concept of bundles? > > Thanks so much, > Matt > > [1] http://elixir-lang.org/
