Hi, I’m a finalist CompSci student at the University of Cambridge, and for my final project/dissertation I am writing an implementation of the Beam SDK in Elixir [1]. Given that the Beam project is obviously still very much WIP, it’s still somewhat difficult to find good conceptual overviews of parts of the system, which is crucial when translating the OOP architecture to something completely different. However I have found many of the design docs scattered around the JIRA and here very helpful. (Incidentally, perhaps it would be helpful to maintain a list of them, to help any contributors acquaint themselves with the conceptual vision of the implementation?)
One thing which I have not yet been able to work out is the significance of “bundles” in the SDK. On the one hand, it seems that they are simply an implementation detail, effectively a way to do micro-batch processing efficiently, and indeed they are not mentioned at all in the original Dataflow paper or anywhere in the Beam docs (except in passing). On the other hand, it seems most of the key transforms in the SDK core have a concept of bundles and operate in their terms in practice, while all conceptually being described as just operating on elements. Do bundles have semantic meaning in the Beam Model? Are there any guidelines as to how a given transform should split its output up into bundles? Should any runner/SDK implementing the Model have that concept, even when other primitives for streaming data processing including things like efficiently transmitting individual elements between stages with backpressure are available in the language/standard libraries? Are there any insights here that I am missing, i.e. were problems present in early versions of the runners solved by adding the concept of bundles? Thanks so much, Matt [1] http://elixir-lang.org/
