Heads up, as I will be submitting Blocks Framework to Giraph, so in the next few days/weeks you will see large number of JIRAs, diffs and new files related to it.
Blocks Framework is a new API for writing applications, that tries to address many deficiencies of current Giraph API, while keeping all of its functionality and performance. Execution is still the same, and Blocks Framework API translates internally to Giraph API itself. As we were developing more complex applications, we were finding that it was very hard to encapsulate parts of the application, and to compose multiple parts, within current API. And that is important throughout, from extending and managing complex application, to being able to build libraries. For example, let's say we want to write a simple utility that will compute total edge weight of the graph, and log it on master. First there is no way to encapsulate that logic - we would have to have some code on the master (to register aggregator, and to log the result), and we would have to have a computation that will aggregate values for each vertex. Since MasterCompute cannot be changed, at best, we could have two functions: masterComputeStep1 to register aggregator and change computation, and masterComputeStep2 to log the result. At minimum, we are leaking how many supersteps this computation takes, computation would need to match message types before and after it. Now let's say we have two such utilities, and we want to compose them - there is no way to do so. If we want to modify existing application to do this at the beginning, before it's own logic - there is no easy way to do so. Blocks Framework is primarily trying to solve that problem - by having all code being written in "Blocks" (as block of code in programming). You can compose multiple blocks together, to get a more complex logic (for example new SequenceBlock(block1, block2), or new RepeatBlock(10, iterBlock)). Each application is represented by a single complex Block. Each block is an iterator over individual steps - called Pieces (each Piece is itself a Block). Each Piece encapsulates one message sending, and master compute in between (so "send" half of one superstep, master compute and "receive" half of the next superstep). There is of course much more to it, and a lot of details, which you will be able to see from diffs and JIRAs. If any of these bullet points resonate to being painful with you, you are going to enjoy new framework: - checking getSuperstep to decide which logic to execute throughout the code, or keep track of where you are - changing computation, trying to match computation message types - having multiple consecutive Giraph jobs, because it is too complex to have it all in one - having multiple concrete computation classes, just to support different vertex/edges types - wanting to combine multiple algorithms inside single Giraph job We have developed this framework within Facebook within the last year or so, and it is mature enough by now to be shared with the community. Thanks, Igor