Hi Folks, I've spent the last two months learning, installing, coding, and analyzing the performance of our Giraph app, and I'm able to run on small inputs on our tiny cluster (yay!) I am now stuck trying to figure out why larger inputs fail, why only some compute nodes are being used, and generally how to make sure I've configured hadoop and giraph to use all available CPUs and RAM. I feel that I'm "this close," and I could really use some pointers.
Below I share our app, configuration, results and log messages, some questions, and counter output for the successful run. My post here is long (I've broken it into sections delimited with '===='), but I hope I've provided good enough information to get help on. I'm happy to add to it. Thanks! ==== application ==== Our application is a kind of path search where all nodes have a type and source database ID (e.g., "movie 99"), and searches are expressed as type paths, such as "movie, acted_in, actor", which would start with movies and then find all actors in each movie, for all movies in the database. The program does a kind of filtering by keeping track of previously-processed initial IDs. Our database is a small movie one with 2K movies, 6K users (people who rate movies), and 80K ratings of movies by users. Though small, we've found this kind of search can result in a massive explosion of messages, as was well put by Rob Vesse ( http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E ): > even with this relatively small graph you get a massive explosion of messages > by the later super steps which exhausts memory (in my graph the theoretical > maximum messages by the last super step was ~3 billion) ==== job failure and error messages ==== Currently I have a four-step path that completes in ~20 seconds ("rates, movie, rates, user" - counter output shown at bottom) but a five-step one ("rates, movie, rates, user, rates") fails after a few minutes. I've looked carefully at the task logs, but I find it a little difficult to discern what the actual failure was. However, looking at system information (e.g., top and ganglia) during the run indicates hosts are running out of memory. There are no OutOfMemoryErrors in the logs, and only this one stsands out: > ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: > Missing chosen worker Worker(hostname=compute-0-3.wright, MRtaskID=1, > port=30001) on superstep 4 NB: So far I've been ignoring these other types of messages: > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No > last good checkpoints can be found, killing the job. > java.io.FileNotFoundException: File _bsp/_checkpoints/job_201409191450_0003 > does not exist. > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event > (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished, > type=NodeDeleted, state=SyncConnected) > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got > failure, unregistering health on > /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1 > on superstep 4 The counter statistics are minimal after the run fails, but during it I see something like this when refreshing the Job Tracker Web UI: > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB > Counters > Giraph Stats > Sent messages: ~181M ==== hadoop/giraph command ==== hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \ -Dgiraph.zkList=xx.xx.xx.edu:2181 \ -libjars ${LIBJARS} \ relpath.RelPathVertex \ -wc relpath.RelPathWorkerContext \ -mc relpath.RelPathMasterCompute \ -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \ -vip $REL_PATH_INPUT \ -of relpath.RelPathVertexValueTextOutputFormat \ -op $REL_PATH_OUTPUT \ -ca RelPathVertex.path=$REL_PATH_PATH \ -w 8 ==== cluster, versions, and configuration ==== We have a five-node cluster with a head and four compute nodes. The head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU, 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM and 48 cores. Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker on the head node, and task trackers on all five nodes. Hadoop configuration (mapred-site.xml and CDH interface - sorry for the mix) - not sure I'm listing all of them of interest: > mapreduce.job.counters.max: 120 > mapred.output.compress: false > mapred.reduce.tasks: 12 > mapred.child.java.opts: -Xmx2147483648 > mapred.job.reuse.jvm.num.tasks: 1 > MapReduce Child Java Maximum Heap Size: 2 GB > I/O Sort Memory Buffer (io.sort.mb): 512 MB > Client Java Heap Size in Bytes: 256 MB > Java Heap Size of Jobtracker in Bytes 1 GB > Java Heap Size of TaskTracker in Bytes: 1 GB Cluster summary from the Job Tracker Web UI: > Heap Size is 972.69 MB/989.88 MB > Map Task Capacity: 48 > Reduce Task Capacity: 24 > Avg. Tasks/Node: 14.40 Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG: Release 1.0.0 - 2013-04-15 ==== questions ==== o How can I verify that the failure is actually one of memory? I've looked fairly carefully at the logs. o I noticed that not all hosts are being used. I did three runs, two with 8 workers and one with 12, and I pulled the following from the task logs ('h' = head node, 0-3 = compute nodes): > run #1: 0, 2, 3, h, h, h, h, h, h > run #2: 2, 1, 3, h, h, h, h, h, h > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1 Note that there's at least one compute node that isn't listed for each run. o What's a good # of workers to use? o What Hadoop parameters should I tweak? > mapred.job.map.memory.mb=xx > mapred.map.child.java.opts=xx > mapred.{map|reduce}.child.ulimit > mapred.task.profile > # map slots for each TaskTracker > number of partitions you keep in memory o What Giraph parameters should I tweak? I'm currently using defaults for all, but I found these possibilities: > giraph.maxPartitionsInMemory > giraph.useOutOfCoreGraph=true > giraph.maxPartitionsInMemory=N (default: 10) > giraph.isStaticGraph=true > giraph.useOutOfCoreMessages=true (default: disabled) > giraph.maxMessagesInMemory=N (default: 1000000) o How can I get a feel for how much more processing and memory might be needed to finish the job, beyond that it's on the last superstep? For example, of the ~181M sent messages I see during the run, how many more might be left? o Why is the Heap Size from the Cluster summary above (972.69 MB/989.88 MB) so low? Thanks again! ==== counters from successful four-step run ==== INFO mapred.JobClient: Job complete: job_201409191450_0001 INFO mapred.JobClient: Counters: 39 INFO mapred.JobClient: File System Counters INFO mapred.JobClient: FILE: Number of bytes read=0 INFO mapred.JobClient: FILE: Number of bytes written=1694975 INFO mapred.JobClient: FILE: Number of read operations=0 INFO mapred.JobClient: FILE: Number of large read operations=0 INFO mapred.JobClient: FILE: Number of write operations=0 INFO mapred.JobClient: HDFS: Number of bytes read=10016293 INFO mapred.JobClient: HDFS: Number of bytes written=113612773 INFO mapred.JobClient: HDFS: Number of read operations=12 INFO mapred.JobClient: HDFS: Number of large read operations=0 INFO mapred.JobClient: HDFS: Number of write operations=9 INFO mapred.JobClient: Job Counters INFO mapred.JobClient: Launched map tasks=9 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=206659 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 INFO mapred.JobClient: Map-Reduce Framework INFO mapred.JobClient: Map input records=9 INFO mapred.JobClient: Map output records=0 INFO mapred.JobClient: Input split bytes=396 INFO mapred.JobClient: Spilled Records=0 INFO mapred.JobClient: CPU time spent (ms)=243280 INFO mapred.JobClient: Physical memory (bytes) snapshot=9947144192 INFO mapred.JobClient: Virtual memory (bytes) snapshot=25884065792 INFO mapred.JobClient: Total committed heap usage (bytes)=10392305664 INFO mapred.JobClient: Giraph Stats INFO mapred.JobClient: Aggregate edges=402428 INFO mapred.JobClient: Aggregate finished vertices=119141 INFO mapred.JobClient: Aggregate vertices=119141 INFO mapred.JobClient: Current master task partition=0 INFO mapred.JobClient: Current workers=8 INFO mapred.JobClient: Last checkpointed superstep=0 INFO mapred.JobClient: Sent messages=0 INFO mapred.JobClient: Superstep=4 INFO mapred.JobClient: Giraph Timers INFO mapred.JobClient: Input superstep (milliseconds)=1689 INFO mapred.JobClient: Setup (milliseconds)=3977 INFO mapred.JobClient: Shutdown (milliseconds)=1177 INFO mapred.JobClient: Superstep 0 (milliseconds)=834 INFO mapred.JobClient: Superstep 1 (milliseconds)=1836 INFO mapred.JobClient: Superstep 2 (milliseconds)=2524 INFO mapred.JobClient: Superstep 3 (milliseconds)=8284 INFO mapred.JobClient: Total (milliseconds)=20322 ==== EOF ==== -- Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 Dickinson Street, Amherst MA 01002 | matthewcornell.org