Hi! One interesting jvm option I learned about lately is -XX:+UseCompressedStrings, which will use a byte [] for all strings, that are fully defined in ASCII. Given that you are working with URIs, I assume that this is true for most of your strings, so I would give it a shot.
For more info on JVM options, please take a look here: http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html HTH -André 2012/6/6 Benjamin Heitmann <benjamin.heitm...@deri.org>: > Hello, > > can somebody recommend a web page, article or book on minimising the memory > usage of Giraph/Hadoop code ? > I am looking for non-obvious advice on what *not* to do, and for best > practices on what to do inside of Hadoop... > > E.g. is it preferable to use Java Strings or Hadoop Text Writables ? Should > all strings be externalised ? > > Currently, I am running a Giraph job with 10 workers. Each worker has a > maximum heap of Xmx7G. > The concurrent garbage collection is enabled. The machine has 24 cores, and > 96 GB of memory. > The job currently uses a max of around 50 GB, so there is free memory > available outside of java. > > The graph itself has ~2 million vertices and ~4 million edges, which is not > really "big data". > > However, before starting superstep 1, I get heap space errors. Previous > versions of my algorithm where simpler, > but they also ran into heap space errors when the data was around one order > of magnitude bigger. > > My suspicion is that the amount of state which my vertices have, and the > amount of messages which I am generating > exceeds the standard use case of a pagerank rank algorithm by far. > > To list a few of the reasons why I need a lot of state: > > * I need to execute multiple runs of the same algorithm in parallel. Loading > this specific graph takes about 3 minutes, > running the algorithm once takes about 10 seconds or so, but I have around > 600 users in that graph. And this is just a small graph, > the whole algorithm is intended to be run for thousands of users. (... "big > data"...) > > * The identities of the edges and vertices are not based on numbers but on > strings. > All edges and all vertices have a URI associated with them. > The graph represents RDF data from different sources, such as DBpedia. > In addition, most of the vertices have one or multiple types associated with > them, and > each type is again represented by a URI. > These types are essential to the logic of the algorithm. > I guess it would be possible to externalise all of those strings, but it adds > a layer of complexity which I had previously hoped to avoid. > > * As Giraph does not currently provide a central coordination point for the > processing of the graph, > I need to send a lot of messages between vertices in order to coordinate the > algorithm. > > * Giraph does not allow multiple Java classes to be used for different > vertices in the same graph. > However, different vertices have different roles in my algorithm, and each > role has a different set of states in which it can be, > due to the missing global coordination point. > > * Taken together, the lack of a central coordination point and the inabiltity > to have different java classes as part of the same graph, > make the whole algorithm more similar to a network protocol and not to a > graph algorithm. Thus I need a lot of messages > and a lot of state. > > > If anybody has some good suggestion on how I should proceed, I would be very > interested in hearing them. > > If somebody wants to take a look at my code, then I can currently provide you > with that code in a non-public way. > > sincerely, Benjamin Heitmann. >