Hello Giraph Mailing List, I'm a student at TU Berlin. For a project that is led by Sebastian Schelter (Giraph Commiter), I'm (together with another student) implementing algorithms to efficiently calculate the closeness of nodes in a graph. We implemented a Flajolet Martin-Sketch as described in "HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop" (Kang et.al.) and the HyperLogLog sketch for space efficient closeness computations in graphs.
We were able to run our implementations on small- and mid-sized graphs. The largest graph we tested with has 177,147 nodes and 1,977,149,596 edges (its a kronecker graph, generated using http://www.cs.cmu.edu/~ukang/dataset/). We also wanted to run our implementations against this graph: http://law.di.unimi.it/webdata/twitter-2010/ which has a size of 12.5 GB when converted into ASCII. But I'm getting OutOfMemoryError Exceptions when using this graph. The exception is thrown from the input format. This indicates that the system is not able to fully load the graph into memory. I'm running it on a a 26 node cluster with 208 Map tasks, each TaskTracker has a heap of 2 GB, hence we have a total heapspace of 416 GB. I tried to use the Out-Of-Core execution feature of Giraph, because it seem to enable disk-splling if the system runs out of memory. I enabled it using the argument "-ca giraph.useOutOfCoreGraph=true" for the GiraphRunner. (Is this the correct way to enable the feature?) What can I do to get Giraph running with the twitter-graph? Regards, Robert