Bundles are simply the unit of commitment (retry) in the Beam SDK.
They're not really a model concept, but do leak from the
implementation into the API as it's not feasible to checkpoint every
individual process call, and this allows some state/compute/... to be
safely amortized across elements (either the results of all processed
elements in a bundle are sent downstream, or none are and the entire
bundle is retried).

On Wed, Jan 25, 2017 at 9:36 AM, Matthew Jadczak <mn...@cam.ac.uk> wrote:
> Hi,
>
> I’m a finalist CompSci student at the University of Cambridge, and for my 
> final project/dissertation I am writing an implementation of the Beam SDK in 
> Elixir [1]. Given that the Beam project is obviously still very much WIP, 
> it’s still somewhat difficult to find good conceptual overviews of parts of 
> the system, which is crucial when translating the OOP architecture to 
> something completely different. However I have found many of the design docs 
> scattered around the JIRA and here very helpful. (Incidentally, perhaps it 
> would be helpful to maintain a list of them, to help any contributors 
> acquaint themselves with the conceptual vision of the implementation?)
>
> One thing which I have not yet been able to work out is the significance of 
> “bundles” in the SDK. On the one hand, it seems that they are simply an 
> implementation detail, effectively a way to do micro-batch processing 
> efficiently, and indeed they are not mentioned at all in the original 
> Dataflow paper or anywhere in the Beam docs (except in passing). On the other 
> hand, it seems most of the key transforms in the SDK core have a concept of 
> bundles and operate in their terms in practice, while all conceptually being 
> described as just operating on elements.
>
> Do bundles have semantic meaning in the Beam Model? Are there any guidelines 
> as to how a given transform should split its output up into bundles? Should 
> any runner/SDK implementing the Model have that concept, even when other 
> primitives for streaming data processing including things like efficiently 
> transmitting individual elements between stages with backpressure are 
> available in the language/standard libraries? Are there any insights here 
> that I am missing, i.e. were problems present in early versions of the 
> runners solved by adding the concept of bundles?
>
> Thanks so much,
> Matt
>
> [1] http://elixir-lang.org/

Reply via email to