RE: Resources or advice on minimising memory usage in Giraph/Hadoop code ?

David Garcia Thu, 07 Jun 2012 07:33:38 -0700

it will if you have a fully connected graph, and/or your computation "requires" 
the instantiation of all the vertices.  It's obviously not fully-connected 
since there are 2million vertices and 4million edges. . .so unless all the 
vertices need to execute a computation, for whatever reason, it may not be 
necessary to instantiate them all.
________________________________________
From: Claudio Martella [[email protected]]
Sent: Thursday, June 07, 2012 1:35 AM
To: [email protected]
Subject: Re: Resources or advice on minimising memory usage in Giraph/Hadoop 
code ?


Won't this just postpone the pain?

On Thursday, June 7, 2012, David Garcia wrote:
Based upon what you have mentioned, o think you are getting heap errors because 
every vertex in your graph will be loaded into memory prior to super step one.  
So if you have a large graph, with lots of state, you probably have memory 
issues from the very beginning.  A simple way to mitigate the problem is to 
simply load the vertices that you need and then add vertices as your 
computation progresses.  This will prevent the entire graph from occupying 
memory.

Sent from my HTC Inspire™ 4G on AT&T

----- Reply message -----
From: "Avery Ching" 
<[email protected]<javascript:_e({},%20'cvml',%20'[email protected]');>>
To: 
"[email protected]<javascript:_e({},%20'cvml',%20'[email protected]');>"
 
<[email protected]<javascript:_e({},%20'cvml',%20'[email protected]');>>
Subject: Resources or advice on minimising memory usage in Giraph/Hadoop code ?
Date: Wed, Jun 6, 2012 10:33 pm



No article or book, but here's a few tips.

1) Use aggregators!  This can drastically can reduce the amount of
memory use by combining messages on the server side.
2) -Dmapred.child.java.opts="-Xss128k" or some other value (should
affect the RPC threads or netty threads)
3) You'll want to minimize the state of every vertex as best as
possible, perhaps creating a custom vertex.

Avery

On 6/5/12 7:38 PM, Benjamin Heitmann wrote:
> Hello,
>
> can somebody recommend a web page, article or book on minimising the memory 
> usage of Giraph/Hadoop code ?
> I am looking for non-obvious advice on what *not* to do, and for best 
> practices on what to do inside of Hadoop...
>
> E.g. is it preferable to use Java Strings or Hadoop Text Writables ? Should 
> all strings be externalised ?
>
> Currently, I am running a Giraph job with 10 workers. Each worker has a 
> maximum heap of Xmx7G.
> The concurrent garbage collection is enabled. The machine has 24 cores, and 
> 96 GB of memory.
> The job currently uses a max of around 50 GB, so there is free memory 
> available outside of java.
>
> The graph itself has ~2 million vertices and ~4 million edges, which is not 
> really "big data".
>
> However, before starting superstep 1, I get heap space errors. Previous 
> versions of my algorithm where simpler,
> but they also ran into heap space errors when the data was around one order 
> of magnitude bigger.
>
> My suspicion is that the amount of state which my vertices have, and the 
> amount of messages which I am generating
> exceeds the standard use case of a pagerank rank algorithm by far.
>
> To list a few of the reasons why I need a lot of state:
>
> * I need to execute multiple runs of the same algorithm in parallel. Loading 
> this specific graph takes about 3 minutes,
> running the algorithm once takes about 10 seconds or so, but I have around 
> 600 users in that graph. And this is just a small graph,
> the whole algorithm is intended to be run for thousands of users. (... "big 
> data"...)
>
> * The identities of the edges and vertices are not based on numbers but on 
> strings.
> All edges and all vertices have a URI associated with them.
> The graph represents RDF data from different sources, such as DBpedia.
> In addition, most of the vertices have one or multiple types associated with 
> them, and
> each type is again represented by a URI.
> These types are essential to the logic of the algorithm.
> I guess it would be possible to externalise all of those strings, but it adds 
> a layer of complexity which I had previously hoped to avoid.
>
> * As Giraph does not currently provide a central coordination point for the 
> processing of the graph,
> I need to send a lot of messages between vertices in order to coordinate the 
> algorithm.
>
> * Giraph does not allow multiple Java classes to be used for different 
> vertices in the same graph.
> However, different vertices have different roles in my algorithm, and each 
> role has a different set of states in which it can be,
> due to the missing global coordination point.
>
> * Taken together, the lack of a central coordination point and the inabiltity 
> to have different java classes as part of the same graph,
> make the whole algorithm more similar to a network protocol and not to a 
> graph algorithm. Thus I need a lot of messages
> and a lot of state.
>
>
> If anybody has some good suggestion on how I should proceed, I would be very 
> interested in hearing them.
>
> If somebody wants to take a look at my code, then I can currently provide you 
> with that code in a non-public way.
>
> sincerely, Benjamin Heitmann.
>



--
   Claudio Martella
   [email protected]<mailto:[email protected]>

RE: Resources or advice on minimising memory usage in Giraph/Hadoop code ?

Reply via email to