I am trying to understand what kind of data Giraph holds in memory per
worker.
My questions in descending order of importance:
1. Does Giraph hold in memory exactly one vertex of data at a time, or does
it need to hold all the vertexes assigned to that worker?
2. Can Giraph handle vertexes with, a million edges per vertex?
   If not, at what order of magnitude does it break down? - 1000 edges, 10K
edges, 100K edges?... 
  (Of course, I understand that this depends upon the -Xmx value, so let's
say we fix a value of -Xmx8g).
3. Are there any limitations on the kind of objects that can be used as
vertices?
   Specifically, does Giraph assume that vertices are lightweight (eg,
integer vertex ID + simple Java primitive vertex values + collection of
out-edges),
   or can Giraph support heavyweight vertices (hold complex nested Java
objects in a vertex)?
4. More generally, what data is stored in memory, and what, if any, is
offloaded/spilled to disk?

Would appreciate any light the experts can throw on this.

On this note, I would like to mention that the presentations posted on the
Wiki explain what Giraph can do, and how to use it from  a coding
perspective, but there are no explanations of the design approach used, the
rationale behind the choices, and the software architecture. I feel that new
users can really benefit from a design  and architecture document, along the
lines of Hadoop and  Lucene. For folks who are considering whether or not to
use Giraph, this can be a big help. The only alternative today is to read
the source code, the burden of which might in itself be reason for folks not
to consider using Giraph. 
My 2c  :-)

Thanks a lot,
Jeyendran


Reply via email to