Re: Resources or advice on minimising memory usage in Giraph/Hadoop code ?

André Kelpe Thu, 07 Jun 2012 07:22:17 -0700

Hi!

One interesting jvm option I learned about lately is
-XX:+UseCompressedStrings, which will use a byte [] for all strings,
that are fully defined in ASCII. Given that you are working with URIs,
I assume that this is true for most of your strings, so I would give
it a shot.


For more info on JVM options, please take a look here:
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

HTH

-André

2012/6/6 Benjamin Heitmann <[email protected]>:
> Hello,
>
> can somebody recommend a web page, article or book on minimising the memory 
> usage of Giraph/Hadoop code ?
> I am looking for non-obvious advice on what *not* to do, and for best 
> practices on what to do inside of Hadoop...
>
> E.g. is it preferable to use Java Strings or Hadoop Text Writables ? Should 
> all strings be externalised ?
>
> Currently, I am running a Giraph job with 10 workers. Each worker has a 
> maximum heap of Xmx7G.
> The concurrent garbage collection is enabled. The machine has 24 cores, and 
> 96 GB of memory.
> The job currently uses a max of around 50 GB, so there is free memory 
> available outside of java.
>
> The graph itself has ~2 million vertices and ~4 million edges, which is not 
> really "big data".
>
> However, before starting superstep 1, I get heap space errors. Previous 
> versions of my algorithm where simpler,
> but they also ran into heap space errors when the data was around one order 
> of magnitude bigger.
>
> My suspicion is that the amount of state which my vertices have, and the 
> amount of messages which I am generating
> exceeds the standard use case of a pagerank rank algorithm by far.
>
> To list a few of the reasons why I need a lot of state:
>
> * I need to execute multiple runs of the same algorithm in parallel. Loading 
> this specific graph takes about 3 minutes,
> running the algorithm once takes about 10 seconds or so, but I have around 
> 600 users in that graph. And this is just a small graph,
> the whole algorithm is intended to be run for thousands of users. (... "big 
> data"...)
>
> * The identities of the edges and vertices are not based on numbers but on 
> strings.
> All edges and all vertices have a URI associated with them.
> The graph represents RDF data from different sources, such as DBpedia.
> In addition, most of the vertices have one or multiple types associated with 
> them, and
> each type is again represented by a URI.
> These types are essential to the logic of the algorithm.
> I guess it would be possible to externalise all of those strings, but it adds 
> a layer of complexity which I had previously hoped to avoid.
>
> * As Giraph does not currently provide a central coordination point for the 
> processing of the graph,
> I need to send a lot of messages between vertices in order to coordinate the 
> algorithm.
>
> * Giraph does not allow multiple Java classes to be used for different 
> vertices in the same graph.
> However, different vertices have different roles in my algorithm, and each 
> role has a different set of states in which it can be,
> due to the missing global coordination point.
>
> * Taken together, the lack of a central coordination point and the inabiltity 
> to have different java classes as part of the same graph,
> make the whole algorithm more similar to a network protocol and not to a 
> graph algorithm. Thus I need a lot of messages
> and a lot of state.
>
>
> If anybody has some good suggestion on how I should proceed, I would be very 
> interested in hearing them.
>
> If somebody wants to take a look at my code, then I can currently provide you 
> with that code in a non-public way.
>
> sincerely, Benjamin Heitmann.
>

Re: Resources or advice on minimising memory usage in Giraph/Hadoop code ?

Reply via email to