Re: Giraph : newbie questions

Nicolas DUGUE Tue, 17 Jul 2012 01:22:40 -0700

Thanks for your answer David !

Okay, but, is there a way to force Giraph to partition the Graph in ourown way and how to do that ? It may be useful to minimize communicationbetween Giraph nodes.

You're talking about starting the job with a minimum of vertices and addnew vertices then. It seems really interesting, how to do that and howdoes it work ?For example, I run my Giraph job with half of the vertices and during myfirst superstep, I add (I don't know how) some vertices to my file. Willthese vertices be taken in account for my first superstep or just forthe next superstep.And when the vertices are loaded, is it possible to remove it from thememory ? In other words, I can add new vertices, can I remove verticestoo ? So, is it possible to change the topology of my graph dynamically ?

Moreover, I'm still wondering what is the best ? Launching one VM withGiraph on each server and with 20GB of Ram OR launching two of its with10GB of RAM for each ?

And finally, when I launch a Giraph Job, Zookeeper is loaded in onevirtual machine alone... Is there a way to run some Giraph jobs in thisvirtual machine too ? Or to mention explicitely in which VM running theZooKeeper Job ?


Best regards,
Nicolas

On 16/07/2012 21:51, David Garcia wrote:

Giraph partitions the vertices using a hashing function that's basically
the equivalent of (hash(vertexID) mod #ofComputeNodes).
You can mitigate memory issues by starting the job with a minimum of
vertices in your file and then add them dynamically as your job progresses
(assuming that your job doesn't require all of the vertices).

-David


On 7/16/12 4:36 AM, "Nicolas DUGUE" <nicolas.du...@univ-orleans.fr> wrote:

Hi everybody,

     I'm new to Giraph so I have a few questions about how it works and
so how to configure it to make it work as well as possible.
     We have settled a cluster of 6 servers with 24 cpu, 24GB of RAM and
we want to use it to experiment with Giraph.
     Currently, we've made a few runs and we have some problems with
memory, it seems that we don't give enough of it to the JVM (GC
overhead, OutOfMemory, ...).
     Our experiments were benchmarks using the PageRank, we only succeed
in running it on a 100 millions edges graph by running two virtual
machines with 8GB of Ram on each of our server.

     Here are our questions :
     - What is the best ? Launching one VM with Giraph on each server
and with 20GB of Ram OR launching two of its with 10GB of RAM for each ?
     - Are there a way to minimize the memory used by Hadoop to give
more memory to the Giraph jobs ?
     - How is the graph distributed across the cluster ? Our graph may
be a power-law graph with a few nodes with a very large amount of edges
and a lot of nodes with a few edges. How Giraph will distribute this
kind of graph ? Does it take in account the number of edges of each
vertice ?

Thanks in advance,
Nicolas Dugué
PhD student at the Univeristy of Orléans

Re: Giraph : newbie questions

Reply via email to