Great discussion. I am very curious about the Apache Jena spill solution
you were speaking of, will check it out. What was your impression, was it
successful for your uses? Sounds like it was not so hard to adapt for this
use?

The good news is, Giraph recently acquired the ability to optionally spill
both messages and vertex data to disk to avoid overloads, and when
configured right should provide the functionality you're looking for. Even
though Giraph rides atop the Hadoop framework, it performs its calculations
in a fundamentally different paradigm than MapReduce, so I doubt we will
ever fully replicate the ability of Hadoop to trade cluster size for
calculation time so transparently. Regarding the use of these new features,
there are threads on the Giraph JIRA list by Maja, Claudio, and Alessandro
regarding these issues I'd recommend reading. Try it out, and please let us
know how it goes for you. Its exciting for us to have these features
available now.

> Many thanks for the point-by-point replies. It clarifies a lot of
> questions I had.****
>
> The Pregel papers did throw more light on the approach and architecture.**
> **
>
> ** **
>
> Hi Eli, ****
>
> Your feedback about very large scale applications on Giraph sounds very
> encouraging. Thanks very much.****
>
> ** **
>
> After reading both of your replies, I have some (final!) questions
> regarding memory usage:****
>
> **·         **For applications with a large number of edges per vextex:
> Are there any built-in vertex or helper classes or at least sample code
> which feature spilling of edges to disk, or some kind of disk-backed map of
> edges, to support such vertices? Or do we have to sort of roll our own?***
> *
>
> **·         **For graphs with a large number of vertices relative to
> available workers, at least in development phase,  one may not always have
> access to a large number of workers, yet one might wish to process a very
> large graph. In these cases, it may happen that the workers may not be able
> to hold all their assigned vertices in memory. So again in this case, are
> there any built-in classes to allow spilling of vertices to disk, or a
> similar kind of disk-backed map?****
>
> **·         **Assuming some kind of disk backing is implemented to handle
> large number of vertices/edges (under a situation of insufficient # of
> workers or memory per worker), is it likely that just the volume of IO
> (message/IPC) could cause OOMEs? Or merely slowdowns?****
>
> ** **
>
> In general, I feel that one of the reasons for wide and rapid adoption of
> Hadoop is the “download, install and run” feature, where even for large
> data sets, the stock code will still run to completion on a single laptop
> (or a single Linux server, etc), except that it will take more time. But
> this may be perfectly acceptable for people who are evaluating and
> experimenting, since there is no incurred cost for hardware. A lot of
> developers might be OK with giving the thing a run overnight on their
> laptops or fire up just one spot instance on EC2 etc and let it chug along
> for a couple of days. ****
>
> I know this was the case for me when I was starting out with Hadoop. So
> more nodes are needed only to speed things up, but not for functionality.*
> ***
>
> It might be great to include such features into Giraph also…. which will
> require that disk backed workers be supported in the code as standard
> feature…****
>
> ** **
>
> Would love to hear your thoughts on these…****
>
> ** **
>
> Thanks,****
>
> Jeyendran****
>
> ** **
>
> ** **
>
> *From:* Eli Reisman [mailto:apache.mail...@gmail.com]
> *Sent:* Tuesday, September 11, 2012 12:11 PM
> *To:* user@giraph.apache.org
> *Subject:* Re: Can Giraph handle graphs with very large number of edges
> per vertex?****
>
> ** **
>
> Hi Jeyendran, I was just sayiing the same thing about the documentation on
> another thread, couldn't agree more. There will be progress on this soon, I
> promise. I'd like us to reach a model of "if you add a new feature or
> change a core feature, the patch gets committed contingent on a new wiki
> page of docs going up on the website." There's still nothing about our new
> Vertex API, master compute, etc. on the wiki.
>
> I would say 8 gigs to play with is a great amount where you will most
> definitely be able to get very large interesting graphs to run in-memory,
> depending on how many workers (with 8G each) you have to work with. having
> 3-4 workers per machine is not a bad thing if you are provisioned to do
> this. And lots of machines. This is a distributed batch processing
> framework, so more is better ;)
>
> as far as vertices with a million edges, sure but it depends on how many
> of them and your compute resources. Again, can't go into much detail but
> Giraph has been extensively tested using real-world, large, interesting,
> useful graph data. This includes large social graphs that have supernodes.
> So if you're supplying that, and you have the gear to run your data, you've
> picked the right tool. You can spill to disk, run in memory, or spread the
> load and scale to many, many workers (Mapper tasks) hosted on many nodes
> and Giraph will behave well if you have the compute resource to scale to
> fit your volume of data.
>
> ****
>
> On Tue, Sep 11, 2012 at 12:27 AM, Avery Ching <ach...@apache.org> wrote:**
> **
>
> Hi Jeyendran, nice to meet you.
>
> Answers inline.****
>
>
>
> On 9/10/12 11:23 PM, Jeyendran Balakrishnan wrote:****
>
> I am trying to understand what kind of data Giraph holds in memory per
> worker.
> My questions in descending order of importance:
> 1. Does Giraph hold in memory exactly one vertex of data at a time, or does
> it need to hold all the vertexes assigned to that worker?****
>
> All vertices assigned to that worker.****
>
> ** **
>
> 2. Can Giraph handle vertexes with, a million edges per vertex?****
>
> Depends on how much memory you have.  Would recommend making a custom
> vertex implementation that has a very efficient store for better
> scalability (i.e. see IntIntNullIntVertex).****
>
> ** **
>
>     If not, at what order of magnitude does it break down? - 1000 edges,
> 10K
> edges, 100K edges?...
>    (Of course, I understand that this depends upon the -Xmx value, so let's
> say we fix a value of -Xmx8g).
> 3. Are there any limitations on the kind of objects that can be used as
> vertices?
>     Specifically, does Giraph assume that vertices are lightweight (eg,
> integer vertex ID + simple Java primitive vertex values + collection of
> out-edges),
>     or can Giraph support heavyweight vertices (hold complex nested Java
> objects in a vertex)?****
>
> Limitations are that the vertex implementation must be Writable, the
> vertex index must be WritableComparable, edge type Writable, message type
> Writable.****
>
> ** **
>
> 4. More generally, what data is stored in memory, and what, if any, is
> offloaded/spilled to disk?****
>
> Messages and vertices can be spilled to disk, but you must enable this.***
> *
>
> ** **
>
> Would appreciate any light the experts can throw on this.
>
> On this note, I would like to mention that the presentations posted on the
> Wiki explain what Giraph can do, and how to use it from  a coding
> perspective, but there are no explanations of the design approach used, the
> rationale behind the choices, and the software architecture. I feel that
> new
> users can really benefit from a design  and architecture document, along
> the
> lines of Hadoop and  Lucene. For folks who are considering whether or not
> to
> use Giraph, this can be a big help. The only alternative today is to read
> the source code, the burden of which might in itself be reason for folks
> not
> to consider using Giraph.
> My 2c  :-)****
>
> ** **
>
> Agreed that documentation is lacking =).  That being said, the
> presentations explain most of the design approach and reasons.  I would
> refer to the Pregel paper for a more detailed look or ask if you have any
> specific questions.****
>
>
> Thanks a lot,****
>
> No problem!****
>
> Jeyendran
>
> ****
>
> ** **
>
> ** **
>

Reply via email to