Re: Can Giraph handle graphs with very large number of edges per vertex?

Maja Kabiljo Thu, 13 Sep 2012 01:30:08 -0700

Hi Jeyendran,

As Paolo mentioned, there were two patches to deal with out-of-core:
GIRAPH-249 for out-of-core graph
GIRAPH-45 for out-of-core messages


For the graph part, currently assumption is that you have enough memory to
keep at least one whole partition at the time. Options you need to set
here are:
giraph.useOutOfCoreGraph=true
giraph.maxPartitionsInMemory= as much as you can keep

For the messages, it's not necessary that messages for the whole partition
fit in memory, since it streams on per vertex basis. There is however the
constraint that all vertex ids (from all partitions) need to fit in
memory, but for your application I understand that's not an issue. Options:
giraph.useOutOfCoreMessages=true

giraph.maxMessagesInMemory= as much as you can keep

Also for messages, if you have a really heavy load and still run out of
memory, you can also try using options from GIRAPH-287, since in practice
it happens that messages are created much faster than they are actually
transferred and processed on the destination, options there will prevent
it from happening. But try it without these options first, since this can
really slow down your application. You set:
giraph.waitForRequestsConfirmation=true

giraph.maxNumberOfOpenRequests= as much as you want


Hope this helps, let us know if you have any other questions.

Maja

On 9/13/12 8:41 AM, "Paolo Castagna" <[email protected]> wrote:

>Hi Jeyendran,
>interesting questions and IMHO it is not always easy to understand how
>many Giraph workers are necessary in order to process a specific
>(large) graph.
>A few more comments inline, but I am interested in the answers to your
>questions as well.
>
>On 13 September 2012 07:03, Jeyendran Balakrishnan <[email protected]>
>wrote:
>> After reading both of your replies, I have some (final!) questions
>>regarding
>> memory usage:
>>
>> ·         For applications with a large number of edges per vextex: Are
>> there any built-in vertex or helper classes or at least sample code
>>which
>> feature spilling of edges to disk, or some kind of disk-backed map of
>>edges,
>> to support such vertices? Or do we have to sort of roll our own?
>
>You'll probably need to roll your own (let's see what others suggest).
>However, if you do that, you should do it in the open so others can have
>a look,
>eventually help you and perhaps ensure that what you do might in future be
>contributed back to Giraph for others to benefit/use.
>
>A few months ago I had a look at this and I tried to use TDB (i.e. the
>storage
>layer available in Apache Jena) to store (and spill on disk) vertexes
>with Giraph.
>TDB uses B+Tree and memory mapped files. It's designed and tuned to store
>RDF, however it is not limited to RDF and someone might reuse it's low
>level
>indexing capabilities to store different graphs.
>
>Even if you do not use TDB, having a look at its sources might inspire
>you or
>give you some ideas and what you could do:
>https://svn.apache.org/repos/asf/jena/trunk/jena-tdb/src/main/java/com/hp/
>hpl/jena/tdb/index/
>
>
>> ·         For graphs with a large number of vertices relative to
>>available
>> workers, at least in development phase,  one may not always have access
>>to a
>> large number of workers, yet one might wish to process a very large
>>graph.
>> In these cases, it may happen that the workers may not be able to hold
>>all
>> their assigned vertices in memory. So again in this case, are there any
>> built-in classes to allow spilling of vertices to disk, or a similar
>>kind of
>> disk-backed map?
>
>Here, I am not sure I understand where your need comes from.
>
>I usually develop and test everything locally, but while I do that I
>use a small
>dataset which it can be loaded in memory and allows me to iterate faster.
>
>Why do you need to use a large/read dataset in development phase?
>
>How "large" is your "large" number of vertices?
>
>Even if you use indexes and data structures on disk, as your dataset grow,
>the indexing and processing might take long time. So, perhaps, in
>development
>you are better off with small datasets anyway.
>
>> ·         Assuming some kind of disk backing is implemented to handle
>>large
>> number of vertices/edges (under a situation of insufficient # of
>>workers or
>> memory per worker), is it likely that just the volume of IO
>>(message/IPC)
>> could cause OOMEs? Or merely slowdowns?
>
>There was work on spilling messages to disk and I found GIRAPH-249
>(marked as resolved):
>https://issues.apache.org/jira/browse/GIRAPH-249
>
>> In general, I feel that one of the reasons for wide and rapid adoption
>>of
>> Hadoop is the ³download, install and run² feature, where even for large
>>data
>> sets, the stock code will still run to completion on a single laptop
>>(or a
>> single Linux server, etc), except that it will take more time. But this
>>may
>> be perfectly acceptable for people who are evaluating and experimenting,
>> since there is no incurred cost for hardware. A lot of developers might
>>be
>> OK with giving the thing a run overnight on their laptops or fire up
>>just
>> one spot instance on EC2 etc and let it chug along for a couple of days.
>
>Making as easy as possible to get up and running with Giraph is very
>important.
>
>However, even with Hadoop, I think testing on your laptop with a small
>(but
>representative) dataset is a good thing and it allows you to iterate much
>faster
>when you are in development (as well as use those small datasets in your
>unit tests).
>
>> I know this was the case for me when I was starting out with Hadoop. So
>>more
>> nodes are needed only to speed things up, but not for functionality.
>>
>> It might be great to include such features into Giraph alsoŠ. which will
>> require that disk backed workers be supported in the code as standard
>> featureŠ
>
>With MapReduce developers are not involved in any "capacity" planning
>related
>to how much RAM they would need when they run their MapReduce jobs (sort
>of...)
>
>With Giraph this might not always be the case. Only if messages and
>vertexes are
>spilled to disk, users/developers are freed to think about the minimum
>number of
>workers in order not to get an OOME. And I found not that easy to, given
>a graph
>and an algorithm, estimate the number of necessary workers.
>
>> Would love to hear your thoughts on theseŠ
>
>Me too. I have not had time to read the Giraph sources recently but I
>would like to
>know if the spilling of messages as well as vertexes is now done and
>the problems
>described above are not problems any more. That would be awesome. :-)
>
>Paolo
>
>>
>>
>>
>> Thanks,
>>
>> Jeyendran
>>
>>
>>
>>
>>
>> From: Eli Reisman [mailto:[email protected]]
>> Sent: Tuesday, September 11, 2012 12:11 PM
>> To: [email protected]
>> Subject: Re: Can Giraph handle graphs with very large number of edges
>>per
>> vertex?
>>
>>
>>
>> Hi Jeyendran, I was just sayiing the same thing about the documentation
>>on
>> another thread, couldn't agree more. There will be progress on this
>>soon, I
>> promise. I'd like us to reach a model of "if you add a new feature or
>>change
>> a core feature, the patch gets committed contingent on a new wiki page
>>of
>> docs going up on the website." There's still nothing about our new
>>Vertex
>> API, master compute, etc. on the wiki.
>>
>> I would say 8 gigs to play with is a great amount where you will most
>> definitely be able to get very large interesting graphs to run
>>in-memory,
>> depending on how many workers (with 8G each) you have to work with.
>>having
>> 3-4 workers per machine is not a bad thing if you are provisioned to do
>> this. And lots of machines. This is a distributed batch processing
>> framework, so more is better ;)
>>
>> as far as vertices with a million edges, sure but it depends on how
>>many of
>> them and your compute resources. Again, can't go into much detail but
>>Giraph
>> has been extensively tested using real-world, large, interesting, useful
>> graph data. This includes large social graphs that have supernodes. So
>>if
>> you're supplying that, and you have the gear to run your data, you've
>>picked
>> the right tool. You can spill to disk, run in memory, or spread the
>>load and
>> scale to many, many workers (Mapper tasks) hosted on many nodes and
>>Giraph
>> will behave well if you have the compute resource to scale to fit your
>> volume of data.
>>
>> On Tue, Sep 11, 2012 at 12:27 AM, Avery Ching <[email protected]> wrote:
>>
>> Hi Jeyendran, nice to meet you.
>>
>> Answers inline.
>>
>>
>>
>> On 9/10/12 11:23 PM, Jeyendran Balakrishnan wrote:
>>
>> I am trying to understand what kind of data Giraph holds in memory per
>> worker.
>> My questions in descending order of importance:
>> 1. Does Giraph hold in memory exactly one vertex of data at a time, or
>>does
>> it need to hold all the vertexes assigned to that worker?
>>
>> All vertices assigned to that worker.
>>
>>
>>
>> 2. Can Giraph handle vertexes with, a million edges per vertex?
>>
>> Depends on how much memory you have.  Would recommend making a custom
>>vertex
>> implementation that has a very efficient store for better scalability
>>(i.e.
>> see IntIntNullIntVertex).
>>
>>
>>
>>     If not, at what order of magnitude does it break down? - 1000
>>edges, 10K
>> edges, 100K edges?...
>>    (Of course, I understand that this depends upon the -Xmx value, so
>>let's
>> say we fix a value of -Xmx8g).
>> 3. Are there any limitations on the kind of objects that can be used as
>> vertices?
>>     Specifically, does Giraph assume that vertices are lightweight (eg,
>> integer vertex ID + simple Java primitive vertex values + collection of
>> out-edges),
>>     or can Giraph support heavyweight vertices (hold complex nested Java
>> objects in a vertex)?
>>
>> Limitations are that the vertex implementation must be Writable, the
>>vertex
>> index must be WritableComparable, edge type Writable, message type
>>Writable.
>>
>>
>>
>> 4. More generally, what data is stored in memory, and what, if any, is
>> offloaded/spilled to disk?
>>
>> Messages and vertices can be spilled to disk, but you must enable this.
>>
>>
>>
>> Would appreciate any light the experts can throw on this.
>>
>> On this note, I would like to mention that the presentations posted on
>>the
>> Wiki explain what Giraph can do, and how to use it from  a coding
>> perspective, but there are no explanations of the design approach used,
>>the
>> rationale behind the choices, and the software architecture. I feel
>>that new
>> users can really benefit from a design  and architecture document,
>>along the
>> lines of Hadoop and  Lucene. For folks who are considering whether or
>>not to
>> use Giraph, this can be a big help. The only alternative today is to
>>read
>> the source code, the burden of which might in itself be reason for
>>folks not
>> to consider using Giraph.
>> My 2c  :-)
>>
>>
>>
>> Agreed that documentation is lacking =).  That being said, the
>>presentations
>> explain most of the design approach and reasons.  I would refer to the
>> Pregel paper for a more detailed look or ask if you have any specific
>> questions.
>>
>>
>> Thanks a lot,
>>
>> No problem!
>>
>> Jeyendran
>>
>>
>>
>>

Re: Can Giraph handle graphs with very large number of edges per vertex?

Reply via email to