I am trying to understand what kind of data Giraph holds in memory per worker. My questions in descending order of importance: 1. Does Giraph hold in memory exactly one vertex of data at a time, or does it need to hold all the vertexes assigned to that worker? 2. Can Giraph handle vertexes with, a million edges per vertex? If not, at what order of magnitude does it break down? - 1000 edges, 10K edges, 100K edges?... (Of course, I understand that this depends upon the -Xmx value, so let's say we fix a value of -Xmx8g). 3. Are there any limitations on the kind of objects that can be used as vertices? Specifically, does Giraph assume that vertices are lightweight (eg, integer vertex ID + simple Java primitive vertex values + collection of out-edges), or can Giraph support heavyweight vertices (hold complex nested Java objects in a vertex)? 4. More generally, what data is stored in memory, and what, if any, is offloaded/spilled to disk?
Would appreciate any light the experts can throw on this. On this note, I would like to mention that the presentations posted on the Wiki explain what Giraph can do, and how to use it from a coding perspective, but there are no explanations of the design approach used, the rationale behind the choices, and the software architecture. I feel that new users can really benefit from a design and architecture document, along the lines of Hadoop and Lucene. For folks who are considering whether or not to use Giraph, this can be a big help. The only alternative today is to read the source code, the burden of which might in itself be reason for folks not to consider using Giraph. My 2c :-) Thanks a lot, Jeyendran