Hi Matthew, I answered a few of your questions in-line (unfortunately they might not help the larger problem, but hopefully it'll help a bit).
Best, Matthew On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell <m...@matthewcornell.org> wrote: > Hi Folks, > > I've spent the last two months learning, installing, coding, and > analyzing the performance of our Giraph app, and I'm able to run on > small inputs on our tiny cluster (yay!) I am now stuck trying to > figure out why larger inputs fail, why only some compute nodes are > being used, and generally how to make sure I've configured hadoop and > giraph to use all available CPUs and RAM. I feel that I'm "this > close," and I could really use some pointers. > > Below I share our app, configuration, results and log messages, some > questions, and counter output for the successful run. My post here is > long (I've broken it into sections delimited with '===='), but I hope > I've provided good enough information to get help on. I'm happy to add > to it. > > Thanks! > > > ==== application ==== > > Our application is a kind of path search where all nodes have a type > and source database ID (e.g., "movie 99"), and searches are expressed > as type paths, such as "movie, acted_in, actor", which would start > with movies and then find all actors in each movie, for all movies in > the database. The program does a kind of filtering by keeping track of > previously-processed initial IDs. > > Our database is a small movie one with 2K movies, 6K users (people who > rate movies), and 80K ratings of movies by users. Though small, we've > found this kind of search can result in a massive explosion of > messages, as was well put by Rob Vesse ( > > http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E > ): > > > even with this relatively small graph you get a massive explosion of > messages by the later super steps which exhausts memory (in my graph the > theoretical maximum messages by the last super step was ~3 billion) > > > ==== job failure and error messages ==== > > Currently I have a four-step path that completes in ~20 seconds > ("rates, movie, rates, user" - counter output shown at bottom) but a > five-step one ("rates, movie, rates, user, rates") fails after a few > minutes. I've looked carefully at the task logs, but I find it a > little difficult to discern what the actual failure was. However, > looking at system information (e.g., top and ganglia) during the run > indicates hosts are running out of memory. There are no > OutOfMemoryErrors in the logs, and only this one stsands out: > > > ERROR org.apache.giraph.master.BspServiceMaster: > superstepChosenWorkerAlive: Missing chosen worker > Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4 > > NB: So far I've been ignoring these other types of messages: > > > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: > No last good checkpoints can be found, killing the job. > > > java.io.FileNotFoundException: File > _bsp/_checkpoints/job_201409191450_0003 does not exist. > > > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed > event > (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished, > type=NodeDeleted, state=SyncConnected) > > > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got > failure, unregistering health on > /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1 > on superstep 4 > > The counter statistics are minimal after the run fails, but during it > I see something like this when refreshing the Job Tracker Web UI: > > > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB > > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB > > Counters > Giraph Stats > Sent messages: ~181M > > > ==== hadoop/giraph command ==== > > hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \ > -Dgiraph.zkList=xx.xx.xx.edu:2181 \ > -libjars ${LIBJARS} \ > relpath.RelPathVertex \ > -wc relpath.RelPathWorkerContext \ > -mc relpath.RelPathMasterCompute \ > -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \ > -vip $REL_PATH_INPUT \ > -of relpath.RelPathVertexValueTextOutputFormat \ > -op $REL_PATH_OUTPUT \ > -ca RelPathVertex.path=$REL_PATH_PATH \ > -w 8 > > > ==== cluster, versions, and configuration ==== > > We have a five-node cluster with a head and four compute nodes. The > head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU, > 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM > and 48 cores. > > Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker > on the head node, and task trackers on all five nodes. > > Hadoop configuration (mapred-site.xml and CDH interface - sorry for > the mix) - not sure I'm listing all of them of interest: > > > mapreduce.job.counters.max: 120 > > mapred.output.compress: false > > mapred.reduce.tasks: 12 > > mapred.child.java.opts: -Xmx2147483648 > > mapred.job.reuse.jvm.num.tasks: 1 > > MapReduce Child Java Maximum Heap Size: 2 GB > > I/O Sort Memory Buffer (io.sort.mb): 512 MB > > Client Java Heap Size in Bytes: 256 MB > > Java Heap Size of Jobtracker in Bytes 1 GB > > Java Heap Size of TaskTracker in Bytes: 1 GB > > Cluster summary from the Job Tracker Web UI: > > > Heap Size is 972.69 MB/989.88 MB > > Map Task Capacity: 48 > > Reduce Task Capacity: 24 > > Avg. Tasks/Node: 14.40 > > Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG: > Release 1.0.0 - 2013-04-15 > > > ==== questions ==== > > o How can I verify that the failure is actually one of memory? I've > looked fairly carefully at the logs. > *In the logs for the workers, do you have a line that looks like:* *2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker: finishSuperstep: Waiting on all requests, superstep 93 Memory (free/total/max) = 21951.08M / 36456.50M / 43691.00M* *Looking at the memory usage in the worker that fails at the end of superstep before failure could give you a clue. If you don't have that line then you can directly use org.apache.utils.MemoryUtils.getRuntimeMemoryStats() in the preSuperstep() method of your computation class (or anywhere that will only run once on the worker) to print out that info.* > > o I noticed that not all hosts are being used. I did three runs, two > with 8 workers and one with 12, and I pulled the following from the > task logs ('h' = head node, 0-3 = compute nodes): > > > run #1: 0, 2, 3, h, h, h, h, h, h > > run #2: 2, 1, 3, h, h, h, h, h, h > > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1 > > Note that there's at least one compute node that isn't listed for each run. > > o What's a good # of workers to use? > *You want # of workers to be equal to the # of worker machines. So with 5 machines you'd want -w 4. See this thread <https://www.mail-archive.com/user@giraph.apache.org/msg01570.html>. Quote: "*Basically, better usage of resources: one single JVM, no duplication of core data structures, less netty threads and communication points, more locality (less messages over the network), less actors accessing zookeeper etc." *Also from that thread, you should set mapred.tasktracker.map.tasks.maximum=1, since each worker is a map task and you don't want to run multiple workers on the same machine. * > o What Hadoop parameters should I tweak? > > mapred.job.map.memory.mb=xx > > mapred.map.child.java.opts=xx > > mapred.{map|reduce}.child.ulimit > > mapred.task.profile > > # map slots for each TaskTracker > > number of partitions you keep in memory > > > o What Giraph parameters should I tweak? I'm currently using defaults > for all, but I found these possibilities: > > giraph.maxPartitionsInMemory > > giraph.useOutOfCoreGraph=true > > giraph.maxPartitionsInMemory=N (default: 10) > > giraph.isStaticGraph=true > > giraph.useOutOfCoreMessages=true (default: disabled) > > giraph.maxMessagesInMemory=N (default: 1000000) > > *giraph.numComputeThreads, giraph.numInputThreads, and giraph.numOutputThreads should be set to the number of threads you have available, or potentially n - 1 or something since as Claudio mentions in that email chain I linked to, Giraph is also doing some additional work in the background. By default these are set to 1, so you should change them to increase the use of parallelism. * *Depending on what you're doing with your edges (adding and removing a lot of edges or leaving them alone) you can check out giraph.outEdgesClass and set it to one of the subclasses here <https://giraph.apache.org/apidocs/org/apache/giraph/edge/OutEdges.html>. For example, if you're adding and removing a lot of edges, you could try one of the HashMap/Set backed classes, though these take a bit more memory (I'm basing that on the class descriptions). * *giraph.metrics.enable=true prints metrics output in the stderr of each worker/the master like time spent in communication vs computation, message bytes sent, etc. * *giraph.oneToAllMessageSending=true : If you're doing sendMessageToAllEdges this (supposedly) activates some sort of optimization for this. I read that somewhere on the listserv and I feel like looked at it in the code but take this advice with a grain of salt. * *giraph.userPartitionCount : Chooses the number of partition. Default is # workers squared. In your case if you do 4 workers then that gives you 16 partitions, 4 per worker, which means that if you do numComputeThreads=4 you'll have one partition to process per thread. You might want to consider using a higher number of partitions (maybe 3 per thread, or higher) since partitions are processed in parallel in vertex computation. That way, when partitions are finer-grained, if you have one partition eating a bunch of time the others can keep getting processed by the remaining threads. * > o How can I get a feel for how much more processing and memory might > be needed to finish the job, beyond that it's on the last superstep? > For example, of the ~181M sent messages I see during the run, how many > more might be left? > > o Why is the Heap Size from the Cluster summary above (972.69 > MB/989.88 MB) so low? > > Thanks again! > > > ==== counters from successful four-step run ==== > > INFO mapred.JobClient: Job complete: job_201409191450_0001 > INFO mapred.JobClient: Counters: 39 > INFO mapred.JobClient: File System Counters > INFO mapred.JobClient: FILE: Number of bytes read=0 > INFO mapred.JobClient: FILE: Number of bytes written=1694975 > INFO mapred.JobClient: FILE: Number of read operations=0 > INFO mapred.JobClient: FILE: Number of large read operations=0 > INFO mapred.JobClient: FILE: Number of write operations=0 > INFO mapred.JobClient: HDFS: Number of bytes read=10016293 > INFO mapred.JobClient: HDFS: Number of bytes written=113612773 > INFO mapred.JobClient: HDFS: Number of read operations=12 > INFO mapred.JobClient: HDFS: Number of large read operations=0 > INFO mapred.JobClient: HDFS: Number of write operations=9 > INFO mapred.JobClient: Job Counters > INFO mapred.JobClient: Launched map tasks=9 > INFO mapred.JobClient: Total time spent by all maps in occupied > slots (ms)=206659 > INFO mapred.JobClient: Total time spent by all reduces in occupied > slots (ms)=0 > INFO mapred.JobClient: Total time spent by all maps waiting after > reserving slots (ms)=0 > INFO mapred.JobClient: Total time spent by all reduces waiting > after reserving slots (ms)=0 > INFO mapred.JobClient: Map-Reduce Framework > INFO mapred.JobClient: Map input records=9 > INFO mapred.JobClient: Map output records=0 > INFO mapred.JobClient: Input split bytes=396 > INFO mapred.JobClient: Spilled Records=0 > INFO mapred.JobClient: CPU time spent (ms)=243280 > INFO mapred.JobClient: Physical memory (bytes) snapshot=9947144192 > INFO mapred.JobClient: Virtual memory (bytes) snapshot=25884065792 > INFO mapred.JobClient: Total committed heap usage (bytes)=10392305664 > INFO mapred.JobClient: Giraph Stats > INFO mapred.JobClient: Aggregate edges=402428 > INFO mapred.JobClient: Aggregate finished vertices=119141 > INFO mapred.JobClient: Aggregate vertices=119141 > INFO mapred.JobClient: Current master task partition=0 > INFO mapred.JobClient: Current workers=8 > INFO mapred.JobClient: Last checkpointed superstep=0 > INFO mapred.JobClient: Sent messages=0 > INFO mapred.JobClient: Superstep=4 > INFO mapred.JobClient: Giraph Timers > INFO mapred.JobClient: Input superstep (milliseconds)=1689 > INFO mapred.JobClient: Setup (milliseconds)=3977 > INFO mapred.JobClient: Shutdown (milliseconds)=1177 > INFO mapred.JobClient: Superstep 0 (milliseconds)=834 > INFO mapred.JobClient: Superstep 1 (milliseconds)=1836 > INFO mapred.JobClient: Superstep 2 (milliseconds)=2524 > INFO mapred.JobClient: Superstep 3 (milliseconds)=8284 > INFO mapred.JobClient: Total (milliseconds)=20322 > > ==== EOF ==== > > > -- > Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 > Dickinson Street, Amherst MA 01002 | matthewcornell.org >