Sorry, should be "*org.apache.giraph.utils.MemoryUtils.getRuntimeMemoryStats()", *I left out the giraph.
On Mon, Sep 22, 2014 at 8:10 PM, Matthew Saltz <sal...@gmail.com> wrote: > Hi Matthew, > > I answered a few of your questions in-line (unfortunately they might not > help the larger problem, but hopefully it'll help a bit). > > Best, > Matthew > > > On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell <m...@matthewcornell.org> > wrote: > >> Hi Folks, >> >> I've spent the last two months learning, installing, coding, and >> analyzing the performance of our Giraph app, and I'm able to run on >> small inputs on our tiny cluster (yay!) I am now stuck trying to >> figure out why larger inputs fail, why only some compute nodes are >> being used, and generally how to make sure I've configured hadoop and >> giraph to use all available CPUs and RAM. I feel that I'm "this >> close," and I could really use some pointers. >> >> Below I share our app, configuration, results and log messages, some >> questions, and counter output for the successful run. My post here is >> long (I've broken it into sections delimited with '===='), but I hope >> I've provided good enough information to get help on. I'm happy to add >> to it. >> >> Thanks! >> >> >> ==== application ==== >> >> Our application is a kind of path search where all nodes have a type >> and source database ID (e.g., "movie 99"), and searches are expressed >> as type paths, such as "movie, acted_in, actor", which would start >> with movies and then find all actors in each movie, for all movies in >> the database. The program does a kind of filtering by keeping track of >> previously-processed initial IDs. >> >> Our database is a small movie one with 2K movies, 6K users (people who >> rate movies), and 80K ratings of movies by users. Though small, we've >> found this kind of search can result in a massive explosion of >> messages, as was well put by Rob Vesse ( >> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E >> ): >> >> > even with this relatively small graph you get a massive explosion of >> messages by the later super steps which exhausts memory (in my graph the >> theoretical maximum messages by the last super step was ~3 billion) >> >> >> ==== job failure and error messages ==== >> >> Currently I have a four-step path that completes in ~20 seconds >> ("rates, movie, rates, user" - counter output shown at bottom) but a >> five-step one ("rates, movie, rates, user, rates") fails after a few >> minutes. I've looked carefully at the task logs, but I find it a >> little difficult to discern what the actual failure was. However, >> looking at system information (e.g., top and ganglia) during the run >> indicates hosts are running out of memory. There are no >> OutOfMemoryErrors in the logs, and only this one stsands out: >> >> > ERROR org.apache.giraph.master.BspServiceMaster: >> superstepChosenWorkerAlive: Missing chosen worker >> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4 >> >> NB: So far I've been ignoring these other types of messages: >> >> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: >> No last good checkpoints can be found, killing the job. >> >> > java.io.FileNotFoundException: File >> _bsp/_checkpoints/job_201409191450_0003 does not exist. >> >> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed >> event >> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished, >> type=NodeDeleted, state=SyncConnected) >> >> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got >> failure, unregistering health on >> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1 >> on superstep 4 >> >> The counter statistics are minimal after the run fails, but during it >> I see something like this when refreshing the Job Tracker Web UI: >> >> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: >> ~28GB >> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB >> > Counters > Giraph Stats > Sent messages: ~181M >> >> >> ==== hadoop/giraph command ==== >> >> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \ >> -Dgiraph.zkList=xx.xx.xx.edu:2181 \ >> -libjars ${LIBJARS} \ >> relpath.RelPathVertex \ >> -wc relpath.RelPathWorkerContext \ >> -mc relpath.RelPathMasterCompute \ >> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \ >> -vip $REL_PATH_INPUT \ >> -of relpath.RelPathVertexValueTextOutputFormat \ >> -op $REL_PATH_OUTPUT \ >> -ca RelPathVertex.path=$REL_PATH_PATH \ >> -w 8 >> >> >> ==== cluster, versions, and configuration ==== >> >> We have a five-node cluster with a head and four compute nodes. The >> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU, >> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM >> and 48 cores. >> >> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker >> on the head node, and task trackers on all five nodes. >> >> Hadoop configuration (mapred-site.xml and CDH interface - sorry for >> the mix) - not sure I'm listing all of them of interest: >> >> > mapreduce.job.counters.max: 120 >> > mapred.output.compress: false >> > mapred.reduce.tasks: 12 >> > mapred.child.java.opts: -Xmx2147483648 >> > mapred.job.reuse.jvm.num.tasks: 1 >> > MapReduce Child Java Maximum Heap Size: 2 GB >> > I/O Sort Memory Buffer (io.sort.mb): 512 MB >> > Client Java Heap Size in Bytes: 256 MB >> > Java Heap Size of Jobtracker in Bytes 1 GB >> > Java Heap Size of TaskTracker in Bytes: 1 GB >> >> Cluster summary from the Job Tracker Web UI: >> >> > Heap Size is 972.69 MB/989.88 MB >> > Map Task Capacity: 48 >> > Reduce Task Capacity: 24 >> > Avg. Tasks/Node: 14.40 >> >> Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG: >> Release 1.0.0 - 2013-04-15 >> >> >> ==== questions ==== >> >> o How can I verify that the failure is actually one of memory? I've >> looked fairly carefully at the logs. >> > > *In the logs for the workers, do you have a line that looks like:* > > *2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker: > finishSuperstep: Waiting on all requests, superstep 93 Memory > (free/total/max) = 21951.08M / 36456.50M / 43691.00M* > > *Looking at the memory usage in the worker that fails at the end of > superstep before failure could give you a clue. If you don't have that line > then you can directly use > org.apache.utils.MemoryUtils.getRuntimeMemoryStats() in the preSuperstep() > method of your computation class (or anywhere that will only run once on > the worker) to print out that info.* > >> >> o I noticed that not all hosts are being used. I did three runs, two >> with 8 workers and one with 12, and I pulled the following from the >> task logs ('h' = head node, 0-3 = compute nodes): >> >> > run #1: 0, 2, 3, h, h, h, h, h, h >> > run #2: 2, 1, 3, h, h, h, h, h, h >> > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1 >> >> Note that there's at least one compute node that isn't listed for each >> run. >> >> o What's a good # of workers to use? >> > > *You want # of workers to be equal to the # of worker machines. So with 5 > machines you'd want -w 4. See this thread > <https://www.mail-archive.com/user@giraph.apache.org/msg01570.html>. Quote: > "*Basically, better usage of resources: one single JVM, no duplication of > > core data structures, less netty threads and communication points, more > locality (less messages over the network), less actors accessing zookeeper > etc." *Also from that thread, you should set > mapred.tasktracker.map.tasks.maximum=1, since each worker is a map task and > you don't want to run multiple workers on the same machine. * > > > > >> o What Hadoop parameters should I tweak? >> > mapred.job.map.memory.mb=xx >> > mapred.map.child.java.opts=xx >> > mapred.{map|reduce}.child.ulimit >> > mapred.task.profile >> > # map slots for each TaskTracker >> > number of partitions you keep in memory >> >> >> o What Giraph parameters should I tweak? I'm currently using defaults >> for all, but I found these possibilities: >> > giraph.maxPartitionsInMemory >> > giraph.useOutOfCoreGraph=true >> > giraph.maxPartitionsInMemory=N (default: 10) >> > giraph.isStaticGraph=true >> > giraph.useOutOfCoreMessages=true (default: disabled) >> > giraph.maxMessagesInMemory=N (default: 1000000) >> >> *giraph.numComputeThreads, giraph.numInputThreads, and > giraph.numOutputThreads should be set to the number of threads you have > available, or potentially n - 1 or something since as Claudio mentions in > that email chain I linked to, Giraph is also doing some additional work in > the background. By default these are set to 1, so you should change them to > increase the use of parallelism. * > > *Depending on what you're doing with your edges (adding and removing a lot > of edges or leaving them alone) you can check out giraph.outEdgesClass and > set it to one of the subclasses here > <https://giraph.apache.org/apidocs/org/apache/giraph/edge/OutEdges.html>. > For example, if you're adding and removing a lot of edges, you could try > one of the HashMap/Set backed classes, though these take a bit more memory > (I'm basing that on the class descriptions). * > > *giraph.metrics.enable=true prints metrics output in the stderr of each > worker/the master like time spent in communication vs computation, message > bytes sent, etc. * > > *giraph.oneToAllMessageSending=true : If you're doing > sendMessageToAllEdges this (supposedly) activates some sort of optimization > for this. I read that somewhere on the listserv and I feel like looked at > it in the code but take this advice with a grain of salt. * > > *giraph.userPartitionCount : Chooses the number of partition. Default is # > workers squared. In your case if you do 4 workers then that gives you 16 > partitions, 4 per worker, which means that if you do numComputeThreads=4 > you'll have one partition to process per thread. You might want to consider > using a higher number of partitions (maybe 3 per thread, or higher) since > partitions are processed in parallel in vertex computation. That way, when > partitions are finer-grained, if you have one partition eating a bunch of > time the others can keep getting processed by the remaining threads. * > > >> o How can I get a feel for how much more processing and memory might >> be needed to finish the job, beyond that it's on the last superstep? >> For example, of the ~181M sent messages I see during the run, how many >> more might be left? >> >> o Why is the Heap Size from the Cluster summary above (972.69 >> MB/989.88 MB) so low? >> >> Thanks again! >> >> >> ==== counters from successful four-step run ==== >> >> INFO mapred.JobClient: Job complete: job_201409191450_0001 >> INFO mapred.JobClient: Counters: 39 >> INFO mapred.JobClient: File System Counters >> INFO mapred.JobClient: FILE: Number of bytes read=0 >> INFO mapred.JobClient: FILE: Number of bytes written=1694975 >> INFO mapred.JobClient: FILE: Number of read operations=0 >> INFO mapred.JobClient: FILE: Number of large read operations=0 >> INFO mapred.JobClient: FILE: Number of write operations=0 >> INFO mapred.JobClient: HDFS: Number of bytes read=10016293 >> INFO mapred.JobClient: HDFS: Number of bytes written=113612773 >> INFO mapred.JobClient: HDFS: Number of read operations=12 >> INFO mapred.JobClient: HDFS: Number of large read operations=0 >> INFO mapred.JobClient: HDFS: Number of write operations=9 >> INFO mapred.JobClient: Job Counters >> INFO mapred.JobClient: Launched map tasks=9 >> INFO mapred.JobClient: Total time spent by all maps in occupied >> slots (ms)=206659 >> INFO mapred.JobClient: Total time spent by all reduces in occupied >> slots (ms)=0 >> INFO mapred.JobClient: Total time spent by all maps waiting after >> reserving slots (ms)=0 >> INFO mapred.JobClient: Total time spent by all reduces waiting >> after reserving slots (ms)=0 >> INFO mapred.JobClient: Map-Reduce Framework >> INFO mapred.JobClient: Map input records=9 >> INFO mapred.JobClient: Map output records=0 >> INFO mapred.JobClient: Input split bytes=396 >> INFO mapred.JobClient: Spilled Records=0 >> INFO mapred.JobClient: CPU time spent (ms)=243280 >> INFO mapred.JobClient: Physical memory (bytes) snapshot=9947144192 >> INFO mapred.JobClient: Virtual memory (bytes) snapshot=25884065792 >> INFO mapred.JobClient: Total committed heap usage (bytes)=10392305664 >> INFO mapred.JobClient: Giraph Stats >> INFO mapred.JobClient: Aggregate edges=402428 >> INFO mapred.JobClient: Aggregate finished vertices=119141 >> INFO mapred.JobClient: Aggregate vertices=119141 >> INFO mapred.JobClient: Current master task partition=0 >> INFO mapred.JobClient: Current workers=8 >> INFO mapred.JobClient: Last checkpointed superstep=0 >> INFO mapred.JobClient: Sent messages=0 >> INFO mapred.JobClient: Superstep=4 >> INFO mapred.JobClient: Giraph Timers >> INFO mapred.JobClient: Input superstep (milliseconds)=1689 >> INFO mapred.JobClient: Setup (milliseconds)=3977 >> INFO mapred.JobClient: Shutdown (milliseconds)=1177 >> INFO mapred.JobClient: Superstep 0 (milliseconds)=834 >> INFO mapred.JobClient: Superstep 1 (milliseconds)=1836 >> INFO mapred.JobClient: Superstep 2 (milliseconds)=2524 >> INFO mapred.JobClient: Superstep 3 (milliseconds)=8284 >> INFO mapred.JobClient: Total (milliseconds)=20322 >> >> ==== EOF ==== >> >> >> -- >> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34 >> Dickinson Street, Amherst MA 01002 | matthewcornell.org >> > >