Hello, 

can somebody recommend a web page, article or book on minimising the memory 
usage of Giraph/Hadoop code ? 
I am looking for non-obvious advice on what *not* to do, and for best practices 
on what to do inside of Hadoop...

E.g. is it preferable to use Java Strings or Hadoop Text Writables ? Should all 
strings be externalised ? 

Currently, I am running a Giraph job with 10 workers. Each worker has a maximum 
heap of Xmx7G. 
The concurrent garbage collection is enabled. The machine has 24 cores, and 96 
GB of memory. 
The job currently uses a max of around 50 GB, so there is free memory available 
outside of java.

The graph itself has ~2 million vertices and ~4 million edges, which is not 
really "big data".

However, before starting superstep 1, I get heap space errors. Previous 
versions of my algorithm where simpler, 
but they also ran into heap space errors when the data was around one order of 
magnitude bigger. 

My suspicion is that the amount of state which my vertices have, and the amount 
of messages which I am generating 
exceeds the standard use case of a pagerank rank algorithm by far. 

To list a few of the reasons why I need a lot of state: 

* I need to execute multiple runs of the same algorithm in parallel. Loading 
this specific graph takes about 3 minutes, 
running the algorithm once takes about 10 seconds or so, but I have around 600 
users in that graph. And this is just a small graph, 
the whole algorithm is intended to be run for thousands of users. (... "big 
data"...) 

* The identities of the edges and vertices are not based on numbers but on 
strings. 
All edges and all vertices have a URI associated with them. 
The graph represents RDF data from different sources, such as DBpedia. 
In addition, most of the vertices have one or multiple types associated with 
them, and 
each type is again represented by a URI. 
These types are essential to the logic of the algorithm. 
I guess it would be possible to externalise all of those strings, but it adds a 
layer of complexity which I had previously hoped to avoid. 

* As Giraph does not currently provide a central coordination point for the 
processing of the graph, 
I need to send a lot of messages between vertices in order to coordinate the 
algorithm.

* Giraph does not allow multiple Java classes to be used for different vertices 
in the same graph. 
However, different vertices have different roles in my algorithm, and each role 
has a different set of states in which it can be, 
due to the missing global coordination point. 

* Taken together, the lack of a central coordination point and the inabiltity 
to have different java classes as part of the same graph, 
make the whole algorithm more similar to a network protocol and not to a graph 
algorithm. Thus I need a lot of messages
and a lot of state. 


If anybody has some good suggestion on how I should proceed, I would be very 
interested in hearing them. 

If somebody wants to take a look at my code, then I can currently provide you 
with that code in a non-public way.

sincerely, Benjamin Heitmann. 

Reply via email to