Re: Can Giraph handle graphs with very large number of edges per vertex?

Eli Reisman Mon, 17 Sep 2012 11:32:00 -0700

Yes I think after i emailed you I remembered we did talk about this briefly
a few months back on the mailing list or a thread? Giraph has been coming
along nicely, I agree! If you do put your solution up, let me know I'd love
to see.



On Sun, Sep 16, 2012 at 2:24 AM, Paolo Castagna <castagna.li...@gmail.com>wrote:

> Hi Eli,
> mine was just at an attempt pre-GIRAPH-249 (and my attempt was not
> successful because I had problems sub-classing MutableVertex at the
> time, if I remember correctly. I think I shared some of my issues at
> the time...). Anyway, now that GIRAPH-249 is closed, the need for that
> is less or gone. I just need some time to look at Giraph source code
> now and test/use what you did for GIRAPH-249. Giraph is making such
> good progress and I still need to catch up. :-)
>
> If/when I do that and I see any valuable (i.e. faster) alternative,
> I'll share it here or implement an example and share the code on
> GitHub.
>
> Cheers,
> Paolo
>
> On 14 September 2012 11:03, Eli Reisman <apache.mail...@gmail.com> wrote:
> > Great discussion. I am very curious about the Apache Jena spill solution
> you
> > were speaking of, will check it out. What was your impression, was it
> > successful for your uses? Sounds like it was not so hard to adapt for
> this
> > use?
> >
> > The good news is, Giraph recently acquired the ability to optionally
> spill
> > both messages and vertex data to disk to avoid overloads, and when
> > configured right should provide the functionality you're looking for.
> Even
> > though Giraph rides atop the Hadoop framework, it performs its
> calculations
> > in a fundamentally different paradigm than MapReduce, so I doubt we will
> > ever fully replicate the ability of Hadoop to trade cluster size for
> > calculation time so transparently. Regarding the use of these new
> features,
> > there are threads on the Giraph JIRA list by Maja, Claudio, and
> Alessandro
> > regarding these issues I'd recommend reading. Try it out, and please let
> us
> > know how it goes for you. Its exciting for us to have these features
> > available now.
> >>
> >> Many thanks for the point-by-point replies. It clarifies a lot of
> >> questions I had.
> >>
> >> The Pregel papers did throw more light on the approach and architecture.
> >>
> >>
> >>
> >> Hi Eli,
> >>
> >> Your feedback about very large scale applications on Giraph sounds very
> >> encouraging. Thanks very much.
> >>
> >>
> >>
> >> After reading both of your replies, I have some (final!) questions
> >> regarding memory usage:
> >>
> >> ·         For applications with a large number of edges per vextex: Are
> >> there any built-in vertex or helper classes or at least sample code
> which
> >> feature spilling of edges to disk, or some kind of disk-backed map of
> edges,
> >> to support such vertices? Or do we have to sort of roll our own?
> >>
> >> ·         For graphs with a large number of vertices relative to
> available
> >> workers, at least in development phase,  one may not always have access
> to a
> >> large number of workers, yet one might wish to process a very large
> graph.
> >> In these cases, it may happen that the workers may not be able to hold
> all
> >> their assigned vertices in memory. So again in this case, are there any
> >> built-in classes to allow spilling of vertices to disk, or a similar
> kind of
> >> disk-backed map?
> >>
> >> ·         Assuming some kind of disk backing is implemented to handle
> >> large number of vertices/edges (under a situation of insufficient # of
> >> workers or memory per worker), is it likely that just the volume of IO
> >> (message/IPC) could cause OOMEs? Or merely slowdowns?
> >>
> >>
> >>
> >> In general, I feel that one of the reasons for wide and rapid adoption
> of
> >> Hadoop is the “download, install and run” feature, where even for large
> data
> >> sets, the stock code will still run to completion on a single laptop
> (or a
> >> single Linux server, etc), except that it will take more time. But this
> may
> >> be perfectly acceptable for people who are evaluating and experimenting,
> >> since there is no incurred cost for hardware. A lot of developers might
> be
> >> OK with giving the thing a run overnight on their laptops or fire up
> just
> >> one spot instance on EC2 etc and let it chug along for a couple of days.
> >>
> >> I know this was the case for me when I was starting out with Hadoop. So
> >> more nodes are needed only to speed things up, but not for
> functionality.
> >>
> >> It might be great to include such features into Giraph also…. which will
> >> require that disk backed workers be supported in the code as standard
> >> feature…
> >>
> >>
> >>
> >> Would love to hear your thoughts on these…
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Jeyendran
> >>
> >>
> >>
> >>
> >>
> >> From: Eli Reisman [mailto:apache.mail...@gmail.com]
> >> Sent: Tuesday, September 11, 2012 12:11 PM
> >> To: user@giraph.apache.org
> >> Subject: Re: Can Giraph handle graphs with very large number of edges
> per
> >> vertex?
> >>
> >>
> >>
> >> Hi Jeyendran, I was just sayiing the same thing about the documentation
> on
> >> another thread, couldn't agree more. There will be progress on this
> soon, I
> >> promise. I'd like us to reach a model of "if you add a new feature or
> change
> >> a core feature, the patch gets committed contingent on a new wiki page
> of
> >> docs going up on the website." There's still nothing about our new
> Vertex
> >> API, master compute, etc. on the wiki.
> >>
> >> I would say 8 gigs to play with is a great amount where you will most
> >> definitely be able to get very large interesting graphs to run
> in-memory,
> >> depending on how many workers (with 8G each) you have to work with.
> having
> >> 3-4 workers per machine is not a bad thing if you are provisioned to do
> >> this. And lots of machines. This is a distributed batch processing
> >> framework, so more is better ;)
> >>
> >> as far as vertices with a million edges, sure but it depends on how many
> >> of them and your compute resources. Again, can't go into much detail but
> >> Giraph has been extensively tested using real-world, large, interesting,
> >> useful graph data. This includes large social graphs that have
> supernodes.
> >> So if you're supplying that, and you have the gear to run your data,
> you've
> >> picked the right tool. You can spill to disk, run in memory, or spread
> the
> >> load and scale to many, many workers (Mapper tasks) hosted on many
> nodes and
> >> Giraph will behave well if you have the compute resource to scale to fit
> >> your volume of data.
> >>
> >> On Tue, Sep 11, 2012 at 12:27 AM, Avery Ching <ach...@apache.org>
> wrote:
> >>
> >> Hi Jeyendran, nice to meet you.
> >>
> >> Answers inline.
> >>
> >>
> >>
> >> On 9/10/12 11:23 PM, Jeyendran Balakrishnan wrote:
> >>
> >> I am trying to understand what kind of data Giraph holds in memory per
> >> worker.
> >> My questions in descending order of importance:
> >> 1. Does Giraph hold in memory exactly one vertex of data at a time, or
> >> does
> >> it need to hold all the vertexes assigned to that worker?
> >>
> >> All vertices assigned to that worker.
> >>
> >>
> >>
> >> 2. Can Giraph handle vertexes with, a million edges per vertex?
> >>
> >> Depends on how much memory you have.  Would recommend making a custom
> >> vertex implementation that has a very efficient store for better
> scalability
> >> (i.e. see IntIntNullIntVertex).
> >>
> >>
> >>
> >>     If not, at what order of magnitude does it break down? - 1000 edges,
> >> 10K
> >> edges, 100K edges?...
> >>    (Of course, I understand that this depends upon the -Xmx value, so
> >> let's
> >> say we fix a value of -Xmx8g).
> >> 3. Are there any limitations on the kind of objects that can be used as
> >> vertices?
> >>     Specifically, does Giraph assume that vertices are lightweight (eg,
> >> integer vertex ID + simple Java primitive vertex values + collection of
> >> out-edges),
> >>     or can Giraph support heavyweight vertices (hold complex nested Java
> >> objects in a vertex)?
> >>
> >> Limitations are that the vertex implementation must be Writable, the
> >> vertex index must be WritableComparable, edge type Writable, message
> type
> >> Writable.
> >>
> >>
> >>
> >> 4. More generally, what data is stored in memory, and what, if any, is
> >> offloaded/spilled to disk?
> >>
> >> Messages and vertices can be spilled to disk, but you must enable this.
> >>
> >>
> >>
> >> Would appreciate any light the experts can throw on this.
> >>
> >> On this note, I would like to mention that the presentations posted on
> the
> >> Wiki explain what Giraph can do, and how to use it from  a coding
> >> perspective, but there are no explanations of the design approach used,
> >> the
> >> rationale behind the choices, and the software architecture. I feel that
> >> new
> >> users can really benefit from a design  and architecture document, along
> >> the
> >> lines of Hadoop and  Lucene. For folks who are considering whether or
> not
> >> to
> >> use Giraph, this can be a big help. The only alternative today is to
> read
> >> the source code, the burden of which might in itself be reason for folks
> >> not
> >> to consider using Giraph.
> >> My 2c  :-)
> >>
> >>
> >>
> >> Agreed that documentation is lacking =).  That being said, the
> >> presentations explain most of the design approach and reasons.  I would
> >> refer to the Pregel paper for a more detailed look or ask if you have
> any
> >> specific questions.
> >>
> >>
> >> Thanks a lot,
> >>
> >> No problem!
> >>
> >> Jeyendran
> >>
> >>
> >>
> >>
> >
> >
>

Re: Can Giraph handle graphs with very large number of edges per vertex?

Reply via email to