Hi everyone, I'm working on a community detection algorithm for giraph and I'm trying to execute the algorithm on the Friendster graph, which has about 65M nodes and about 1.8 billion edges. Running on 16 machines, before doing ANY processing, it's taking about 50G of RAM. That's 800G total for this graph, which seems excessive. I'm using the 1.1.0 stable release and Hadoop 0.20.203 (should I use a newer version of hadoop?). This is the command I'm running:
$HADOOP_HOME/bin/hadoop --config $CONF jar $GIR_JAR org.apache.giraph.GiraphRunner -D 'mapred.child.java.opts=-Xms80G -Xmx80G' -D 'mapred.tasktracker.map.tasks.maximum=1' -D-Xmx60000m -libjars $LIBJARS computation.StartComputation $GIRAPH_OPTIONS -eif org.apache.giraph.io.formats.IntNullReverseTextEdgeInputFormat -eip $INPUT -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op $OUTPUT -w $N_WORKERS -mc computation.WccMasterCompute where N_THREADS=12 (number of available threads on the machine), N_PARTITIONS=3*N_THREADS*N_WORKERS, and GIRAPH_OPTIONS="-ca giraph.useSuperstepCounters=false -ca giraph.numComputeThreads=$N_THREADS -ca giraph.numInputThreads=$N_THREADS -ca giraph.numOutputThreads=$N_THREADS -ca giraph.oneToAllMsgSending=true -ca giraph.metrics.enable=true -ca giraph.maxPartitionsInMemory=$N_THREADS -ca giraph.userPartitionCount=$N_PARTITIONS -ca giraph.outEdgesClass=utils.IntNullHashSetEdges" IntNullHashSetEdges takes a bit more memory than IntNullArrayEdges I know but it doesn't make that big of a difference . Each vertex contains 7 ints, a double, three arrays and a map, but all of these are empty when the graph is loaded and it still takes that much memory. I feel like I must be doing something wrong, or missing a configuration option, or something. Thanks in advance for any help you might be able to offer. Best, Matthew