Hi,

I’m a finalist CompSci student at the University of Cambridge, and for my final 
project/dissertation I am writing an implementation of the Beam SDK in Elixir 
[1]. Given that the Beam project is obviously still very much WIP, it’s still 
somewhat difficult to find good conceptual overviews of parts of the system, 
which is crucial when translating the OOP architecture to something completely 
different. However I have found many of the design docs scattered around the 
JIRA and here very helpful. (Incidentally, perhaps it would be helpful to 
maintain a list of them, to help any contributors acquaint themselves with the 
conceptual vision of the implementation?)

One thing which I have not yet been able to work out is the significance of 
“bundles” in the SDK. On the one hand, it seems that they are simply an 
implementation detail, effectively a way to do micro-batch processing 
efficiently, and indeed they are not mentioned at all in the original Dataflow 
paper or anywhere in the Beam docs (except in passing). On the other hand, it 
seems most of the key transforms in the SDK core have a concept of bundles and 
operate in their terms in practice, while all conceptually being described as 
just operating on elements.

Do bundles have semantic meaning in the Beam Model? Are there any guidelines as 
to how a given transform should split its output up into bundles? Should any 
runner/SDK implementing the Model have that concept, even when other primitives 
for streaming data processing including things like efficiently transmitting 
individual elements between stages with backpressure are available in the 
language/standard libraries? Are there any insights here that I am missing, 
i.e. were problems present in early versions of the runners solved by adding 
the concept of bundles?

Thanks so much,
Matt

[1] http://elixir-lang.org/

Reply via email to