Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Matthew Saltz Mon, 22 Sep 2014 11:13:06 -0700

Hi Matthew,

I answered a few of your questions in-line (unfortunately they might not
help the larger problem, but hopefully it'll help a bit).


Best,
Matthew


On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell <m...@matthewcornell.org>
wrote:

> Hi Folks,
>
> I've spent the last two months learning, installing, coding, and
> analyzing the performance of our Giraph app, and I'm able to run on
> small inputs on our tiny cluster (yay!) I am now stuck trying to
> figure out why larger inputs fail, why only some compute nodes are
> being used, and generally how to make sure I've configured hadoop and
> giraph to use all available CPUs and RAM. I feel that I'm "this
> close," and I could really use some pointers.
>
> Below I share our app, configuration, results and log messages, some
> questions, and counter output for the successful run. My post here is
> long (I've broken it into sections delimited with '===='), but I hope
> I've provided good enough information to get help on. I'm happy to add
> to it.
>
> Thanks!
>
>
> ==== application ====
>
> Our application is a kind of path search where all nodes have a type
> and source database ID (e.g., "movie 99"), and searches are expressed
> as type paths, such as "movie, acted_in, actor", which would start
> with movies and then find all actors in each movie, for all movies in
> the database. The program does a kind of filtering by keeping track of
> previously-processed initial IDs.
>
> Our database is a small movie one with 2K movies, 6K users (people who
> rate movies), and 80K ratings of movies by users. Though small, we've
> found this kind of search can result in a massive explosion of
> messages, as was well put by Rob Vesse (
>
> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
> ):
>
> > even with this relatively small graph you get a massive explosion of
> messages by the later super steps which exhausts memory (in my graph the
> theoretical maximum messages by the last super step was ~3 billion)
>
>
> ==== job failure and error messages ====
>
> Currently I have a four-step path that completes in ~20 seconds
> ("rates, movie, rates, user" - counter output shown at bottom) but a
> five-step one ("rates, movie, rates, user, rates") fails after a few
> minutes. I've looked carefully at the task logs, but I find it a
> little difficult to discern what the actual failure was. However,
> looking at system information (e.g., top and ganglia) during the run
> indicates hosts are running out of memory. There are no
> OutOfMemoryErrors in the logs, and only this one stsands out:
>
> > ERROR org.apache.giraph.master.BspServiceMaster:
> superstepChosenWorkerAlive: Missing chosen worker
> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>
> NB: So far I've been ignoring these other types of messages:
>
> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
> No last good checkpoints can be found, killing the job.
>
> > java.io.FileNotFoundException: File
> _bsp/_checkpoints/job_201409191450_0003 does not exist.
>
> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
> event
> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
> type=NodeDeleted, state=SyncConnected)
>
> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
> failure, unregistering health on
> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
> on superstep 4
>
> The counter statistics are minimal after the run fails, but during it
> I see something like this when refreshing the Job Tracker Web UI:
>
> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB
> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
> > Counters > Giraph Stats > Sent messages: ~181M
>
>
> ==== hadoop/giraph command ====
>
> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
> -libjars ${LIBJARS} \
> relpath.RelPathVertex \
> -wc relpath.RelPathWorkerContext \
> -mc relpath.RelPathMasterCompute \
> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
> -vip $REL_PATH_INPUT \
> -of relpath.RelPathVertexValueTextOutputFormat \
> -op $REL_PATH_OUTPUT \
> -ca RelPathVertex.path=$REL_PATH_PATH \
> -w 8
>
>
> ==== cluster, versions, and configuration ====
>
> We have a five-node cluster with a head and four compute nodes. The
> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
> and 48 cores.
>
> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
> on the head node, and task trackers on all five nodes.
>
> Hadoop configuration (mapred-site.xml and CDH interface - sorry for
> the mix) - not sure I'm listing all of them of interest:
>
> > mapreduce.job.counters.max: 120
> > mapred.output.compress: false
> > mapred.reduce.tasks: 12
> > mapred.child.java.opts: -Xmx2147483648
> > mapred.job.reuse.jvm.num.tasks: 1
> > MapReduce Child Java Maximum Heap Size: 2 GB
> > I/O Sort Memory Buffer (io.sort.mb): 512 MB
> > Client Java Heap Size in Bytes: 256 MB
> > Java Heap Size of Jobtracker in Bytes 1 GB
> > Java Heap Size of TaskTracker in Bytes: 1 GB
>
> Cluster summary from the Job Tracker Web UI:
>
> > Heap Size is 972.69 MB/989.88 MB
> > Map Task Capacity: 48
> > Reduce Task Capacity: 24
> > Avg. Tasks/Node: 14.40
>
> Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG:
> Release 1.0.0 - 2013-04-15
>
>
> ==== questions ====
>
> o How can I verify that the failure is actually one of memory? I've
> looked fairly carefully at the logs.
>

*In the logs for the workers, do you have a line that looks like:*

*2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker:
finishSuperstep: Waiting on all requests, superstep 93 Memory
(free/total/max) = 21951.08M / 36456.50M / 43691.00M*

*Looking at the memory usage in the worker that fails at the end of
superstep before failure could give you a clue. If you don't have that line
then you can directly use
org.apache.utils.MemoryUtils.getRuntimeMemoryStats() in the preSuperstep()
method of your computation class (or anywhere that will only run once on
the worker) to print out that info.*

>
> o I noticed that not all hosts are being used. I did three runs, two
> with 8 workers and one with 12, and I pulled the following from the
> task logs ('h' = head node, 0-3 = compute nodes):
>
> > run #1: 0, 2, 3, h, h, h, h, h, h
> > run #2: 2, 1, 3, h, h, h, h, h, h
> > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1
>
> Note that there's at least one compute node that isn't listed for each run.
>
> o What's a good # of workers to use?
>

*You want # of workers to be equal to the # of worker machines. So with 5
machines you'd want -w 4.  See this thread
<https://www.mail-archive.com/user@giraph.apache.org/msg01570.html>. Quote:
"*Basically, better usage of resources: one single JVM, no duplication of

core data structures, less netty threads and communication points, more
locality (less messages over the network), less actors accessing zookeeper
etc." *Also from that thread, you should set
mapred.tasktracker.map.tasks.maximum=1, since each worker is a map
task and you don't want to run multiple workers on the same machine. *




> o What Hadoop parameters should I tweak?
> > mapred.job.map.memory.mb=xx
> > mapred.map.child.java.opts=xx
> > mapred.{map|reduce}.child.ulimit
> > mapred.task.profile
> > # map slots for each TaskTracker
> > number of partitions you keep in memory
>
>
> o What Giraph parameters should I tweak? I'm currently using defaults
> for all, but I found these possibilities:
> > giraph.maxPartitionsInMemory
> > giraph.useOutOfCoreGraph=true
> > giraph.maxPartitionsInMemory=N (default: 10)
> > giraph.isStaticGraph=true
> > giraph.useOutOfCoreMessages=true (default: disabled)
> > giraph.maxMessagesInMemory=N (default: 1000000)
>
> *giraph.numComputeThreads, giraph.numInputThreads, and
giraph.numOutputThreads should be set to the number of threads you have
available, or potentially n - 1 or something since as Claudio mentions in
that email chain I linked to, Giraph is also doing some additional work in
the background. By default these are set to 1, so you should change them to
increase the use of parallelism.  *

*Depending on what you're doing with your edges (adding and removing a lot
of edges or leaving them alone) you can check out giraph.outEdgesClass and
set it to one of the subclasses here
<https://giraph.apache.org/apidocs/org/apache/giraph/edge/OutEdges.html>.
For example, if you're adding and removing a lot of edges, you could try
one of the HashMap/Set backed classes, though these take a bit more memory
(I'm basing that on the class descriptions). *

*giraph.metrics.enable=true prints metrics output in the stderr of each
worker/the master like time spent in communication vs computation, message
bytes sent, etc. *

*giraph.oneToAllMessageSending=true : If you're doing sendMessageToAllEdges
this (supposedly) activates some sort of optimization for this. I read that
somewhere on the listserv and I feel like looked at it in the code but take
this advice with a grain of salt. *

*giraph.userPartitionCount : Chooses the number of partition. Default is #
workers squared. In your case if you do 4 workers then that gives you 16
partitions, 4 per worker, which means that if you do numComputeThreads=4
you'll have one partition to process per thread. You might want to consider
using a higher number of partitions (maybe 3 per thread, or higher) since
partitions are processed in parallel in vertex computation. That way, when
partitions are finer-grained, if you have one partition eating a bunch of
time the others can keep getting processed by the remaining threads. *


> o How can I get a feel for how much more processing and memory might
> be needed to finish the job, beyond that it's on the last superstep?
> For example, of the ~181M sent messages I see during the run, how many
> more might be left?
>
> o Why is the Heap Size from the Cluster summary above (972.69
> MB/989.88 MB) so low?
>
> Thanks again!
>
>
> ==== counters from successful four-step run ====
>
> INFO mapred.JobClient: Job complete: job_201409191450_0001
> INFO mapred.JobClient: Counters: 39
> INFO mapred.JobClient:   File System Counters
> INFO mapred.JobClient:     FILE: Number of bytes read=0
> INFO mapred.JobClient:     FILE: Number of bytes written=1694975
> INFO mapred.JobClient:     FILE: Number of read operations=0
> INFO mapred.JobClient:     FILE: Number of large read operations=0
> INFO mapred.JobClient:     FILE: Number of write operations=0
> INFO mapred.JobClient:     HDFS: Number of bytes read=10016293
> INFO mapred.JobClient:     HDFS: Number of bytes written=113612773
> INFO mapred.JobClient:     HDFS: Number of read operations=12
> INFO mapred.JobClient:     HDFS: Number of large read operations=0
> INFO mapred.JobClient:     HDFS: Number of write operations=9
> INFO mapred.JobClient:   Job Counters
> INFO mapred.JobClient:     Launched map tasks=9
> INFO mapred.JobClient:     Total time spent by all maps in occupied
> slots (ms)=206659
> INFO mapred.JobClient:     Total time spent by all reduces in occupied
> slots (ms)=0
> INFO mapred.JobClient:     Total time spent by all maps waiting after
> reserving slots (ms)=0
> INFO mapred.JobClient:     Total time spent by all reduces waiting
> after reserving slots (ms)=0
> INFO mapred.JobClient:   Map-Reduce Framework
> INFO mapred.JobClient:     Map input records=9
> INFO mapred.JobClient:     Map output records=0
> INFO mapred.JobClient:     Input split bytes=396
> INFO mapred.JobClient:     Spilled Records=0
> INFO mapred.JobClient:     CPU time spent (ms)=243280
> INFO mapred.JobClient:     Physical memory (bytes) snapshot=9947144192
> INFO mapred.JobClient:     Virtual memory (bytes) snapshot=25884065792
> INFO mapred.JobClient:     Total committed heap usage (bytes)=10392305664
> INFO mapred.JobClient:   Giraph Stats
> INFO mapred.JobClient:     Aggregate edges=402428
> INFO mapred.JobClient:     Aggregate finished vertices=119141
> INFO mapred.JobClient:     Aggregate vertices=119141
> INFO mapred.JobClient:     Current master task partition=0
> INFO mapred.JobClient:     Current workers=8
> INFO mapred.JobClient:     Last checkpointed superstep=0
> INFO mapred.JobClient:     Sent messages=0
> INFO mapred.JobClient:     Superstep=4
> INFO mapred.JobClient:   Giraph Timers
> INFO mapred.JobClient:     Input superstep (milliseconds)=1689
> INFO mapred.JobClient:     Setup (milliseconds)=3977
> INFO mapred.JobClient:     Shutdown (milliseconds)=1177
> INFO mapred.JobClient:     Superstep 0 (milliseconds)=834
> INFO mapred.JobClient:     Superstep 1 (milliseconds)=1836
> INFO mapred.JobClient:     Superstep 2 (milliseconds)=2524
> INFO mapred.JobClient:     Superstep 3 (milliseconds)=8284
> INFO mapred.JobClient:     Total (milliseconds)=20322
>
> ==== EOF ====
>
>
> --
> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34
> Dickinson Street, Amherst MA 01002 | matthewcornell.org
>

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Reply via email to