Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Matthew Cornell Wed, 24 Sep 2014 04:59:15 -0700

I cannot thank you enough, Matthew. You've given me a lot to
experiment with. -- matt


On Mon, Sep 22, 2014 at 2:10 PM, Matthew Saltz <sal...@gmail.com> wrote:
> Hi Matthew,
>
> I answered a few of your questions in-line (unfortunately they might not
> help the larger problem, but hopefully it'll help a bit).
>
> Best,
> Matthew
>
>
> On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell <m...@matthewcornell.org>
> wrote:
>>
>> Hi Folks,
>>
>> I've spent the last two months learning, installing, coding, and
>> analyzing the performance of our Giraph app, and I'm able to run on
>> small inputs on our tiny cluster (yay!) I am now stuck trying to
>> figure out why larger inputs fail, why only some compute nodes are
>> being used, and generally how to make sure I've configured hadoop and
>> giraph to use all available CPUs and RAM. I feel that I'm "this
>> close," and I could really use some pointers.
>>
>> Below I share our app, configuration, results and log messages, some
>> questions, and counter output for the successful run. My post here is
>> long (I've broken it into sections delimited with '===='), but I hope
>> I've provided good enough information to get help on. I'm happy to add
>> to it.
>>
>> Thanks!
>>
>>
>> ==== application ====
>>
>> Our application is a kind of path search where all nodes have a type
>> and source database ID (e.g., "movie 99"), and searches are expressed
>> as type paths, such as "movie, acted_in, actor", which would start
>> with movies and then find all actors in each movie, for all movies in
>> the database. The program does a kind of filtering by keeping track of
>> previously-processed initial IDs.
>>
>> Our database is a small movie one with 2K movies, 6K users (people who
>> rate movies), and 80K ratings of movies by users. Though small, we've
>> found this kind of search can result in a massive explosion of
>> messages, as was well put by Rob Vesse (
>>
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
>> ):
>>
>> > even with this relatively small graph you get a massive explosion of
>> > messages by the later super steps which exhausts memory (in my graph the
>> > theoretical maximum messages by the last super step was ~3 billion)
>>
>>
>> ==== job failure and error messages ====
>>
>> Currently I have a four-step path that completes in ~20 seconds
>> ("rates, movie, rates, user" - counter output shown at bottom) but a
>> five-step one ("rates, movie, rates, user, rates") fails after a few
>> minutes. I've looked carefully at the task logs, but I find it a
>> little difficult to discern what the actual failure was. However,
>> looking at system information (e.g., top and ganglia) during the run
>> indicates hosts are running out of memory. There are no
>> OutOfMemoryErrors in the logs, and only this one stsands out:
>>
>> > ERROR org.apache.giraph.master.BspServiceMaster:
>> > superstepChosenWorkerAlive: Missing chosen worker
>> > Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>>
>> NB: So far I've been ignoring these other types of messages:
>>
>> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
>> > No last good checkpoints can be found, killing the job.
>>
>> > java.io.FileNotFoundException: File
>> > _bsp/_checkpoints/job_201409191450_0003 does not exist.
>>
>> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
>> > event
>> > (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>> > type=NodeDeleted, state=SyncConnected)
>>
>> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
>> > failure, unregistering health on
>> > /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
>> > on superstep 4
>>
>> The counter statistics are minimal after the run fails, but during it
>> I see something like this when refreshing the Job Tracker Web UI:
>>
>> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot:
>> > ~28GB
>> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
>> > Counters > Giraph Stats > Sent messages: ~181M
>>
>>
>> ==== hadoop/giraph command ====
>>
>> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
>> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
>> -libjars ${LIBJARS} \
>> relpath.RelPathVertex \
>> -wc relpath.RelPathWorkerContext \
>> -mc relpath.RelPathMasterCompute \
>> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
>> -vip $REL_PATH_INPUT \
>> -of relpath.RelPathVertexValueTextOutputFormat \
>> -op $REL_PATH_OUTPUT \
>> -ca RelPathVertex.path=$REL_PATH_PATH \
>> -w 8
>>
>>
>> ==== cluster, versions, and configuration ====
>>
>> We have a five-node cluster with a head and four compute nodes. The
>> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
>> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
>> and 48 cores.
>>
>> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
>> on the head node, and task trackers on all five nodes.
>>
>> Hadoop configuration (mapred-site.xml and CDH interface - sorry for
>> the mix) - not sure I'm listing all of them of interest:
>>
>> > mapreduce.job.counters.max: 120
>> > mapred.output.compress: false
>> > mapred.reduce.tasks: 12
>> > mapred.child.java.opts: -Xmx2147483648
>> > mapred.job.reuse.jvm.num.tasks: 1
>> > MapReduce Child Java Maximum Heap Size: 2 GB
>> > I/O Sort Memory Buffer (io.sort.mb): 512 MB
>> > Client Java Heap Size in Bytes: 256 MB
>> > Java Heap Size of Jobtracker in Bytes 1 GB
>> > Java Heap Size of TaskTracker in Bytes: 1 GB
>>
>> Cluster summary from the Job Tracker Web UI:
>>
>> > Heap Size is 972.69 MB/989.88 MB
>> > Map Task Capacity: 48
>> > Reduce Task Capacity: 24
>> > Avg. Tasks/Node: 14.40
>>
>> Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha", CHANGELOG:
>> Release 1.0.0 - 2013-04-15
>>
>>
>> ==== questions ====
>>
>> o How can I verify that the failure is actually one of memory? I've
>> looked fairly carefully at the logs.
>
>
> In the logs for the workers, do you have a line that looks like:
> 2014-09-21 18:12:13,021 INFO org.apache.giraph.worker.BspServiceWorker:
> finishSuperstep: Waiting on all requests, superstep 93 Memory
> (free/total/max) = 21951.08M / 36456.50M / 43691.00M
>
> Looking at the memory usage in the worker that fails at the end of superstep
> before failure could give you a clue. If you don't have that line then you
> can directly use org.apache.utils.MemoryUtils.getRuntimeMemoryStats() in the
> preSuperstep() method of your computation class (or anywhere that will only
> run once on the worker) to print out that info.
>>
>>
>> o I noticed that not all hosts are being used. I did three runs, two
>> with 8 workers and one with 12, and I pulled the following from the
>> task logs ('h' = head node, 0-3 = compute nodes):
>>
>> > run #1: 0, 2, 3, h, h, h, h, h, h
>> > run #2: 2, 1, 3, h, h, h, h, h, h
>> > run #3: 3, 3, h, h, h, h, h, h, h, h, h, 1, 1
>>
>> Note that there's at least one compute node that isn't listed for each
>> run.
>>
>> o What's a good # of workers to use?
>
>
> You want # of workers to be equal to the # of worker machines. So with 5
> machines you'd want -w 4.  See this thread. Quote: "Basically, better usage
> of resources: one single JVM, no duplication of
>
> core data structures, less netty threads and communication points, more
> locality (less messages over the network), less actors accessing zookeeper
> etc." Also from that thread, you should set
> mapred.tasktracker.map.tasks.maximum=1, since each worker is a map task and
> you don't want to run multiple workers on the same machine.
>
>
>
>>
>> o What Hadoop parameters should I tweak?
>> > mapred.job.map.memory.mb=xx
>> > mapred.map.child.java.opts=xx
>> > mapred.{map|reduce}.child.ulimit
>> > mapred.task.profile
>> > # map slots for each TaskTracker
>> > number of partitions you keep in memory
>>
>>
>> o What Giraph parameters should I tweak? I'm currently using defaults
>> for all, but I found these possibilities:
>> > giraph.maxPartitionsInMemory
>> > giraph.useOutOfCoreGraph=true
>> > giraph.maxPartitionsInMemory=N (default: 10)
>> > giraph.isStaticGraph=true
>> > giraph.useOutOfCoreMessages=true (default: disabled)
>> > giraph.maxMessagesInMemory=N (default: 1000000)
>>
> giraph.numComputeThreads, giraph.numInputThreads, and
> giraph.numOutputThreads should be set to the number of threads you have
> available, or potentially n - 1 or something since as Claudio mentions in
> that email chain I linked to, Giraph is also doing some additional work in
> the background. By default these are set to 1, so you should change them to
> increase the use of parallelism.
>
> Depending on what you're doing with your edges (adding and removing a lot of
> edges or leaving them alone) you can check out giraph.outEdgesClass and set
> it to one of the subclasses here. For example, if you're adding and removing
> a lot of edges, you could try one of the HashMap/Set backed classes, though
> these take a bit more memory (I'm basing that on the class descriptions).
>
> giraph.metrics.enable=true prints metrics output in the stderr of each
> worker/the master like time spent in communication vs computation, message
> bytes sent, etc.
>
> giraph.oneToAllMessageSending=true : If you're doing sendMessageToAllEdges
> this (supposedly) activates some sort of optimization for this. I read that
> somewhere on the listserv and I feel like looked at it in the code but take
> this advice with a grain of salt.
>
> giraph.userPartitionCount : Chooses the number of partition. Default is #
> workers squared. In your case if you do 4 workers then that gives you 16
> partitions, 4 per worker, which means that if you do numComputeThreads=4
> you'll have one partition to process per thread. You might want to consider
> using a higher number of partitions (maybe 3 per thread, or higher) since
> partitions are processed in parallel in vertex computation. That way, when
> partitions are finer-grained, if you have one partition eating a bunch of
> time the others can keep getting processed by the remaining threads.
>
>>
>> o How can I get a feel for how much more processing and memory might
>> be needed to finish the job, beyond that it's on the last superstep?
>> For example, of the ~181M sent messages I see during the run, how many
>> more might be left?
>>
>> o Why is the Heap Size from the Cluster summary above (972.69
>> MB/989.88 MB) so low?
>>
>> Thanks again!
>>
>>
>> ==== counters from successful four-step run ====
>>
>> INFO mapred.JobClient: Job complete: job_201409191450_0001
>> INFO mapred.JobClient: Counters: 39
>> INFO mapred.JobClient:   File System Counters
>> INFO mapred.JobClient:     FILE: Number of bytes read=0
>> INFO mapred.JobClient:     FILE: Number of bytes written=1694975
>> INFO mapred.JobClient:     FILE: Number of read operations=0
>> INFO mapred.JobClient:     FILE: Number of large read operations=0
>> INFO mapred.JobClient:     FILE: Number of write operations=0
>> INFO mapred.JobClient:     HDFS: Number of bytes read=10016293
>> INFO mapred.JobClient:     HDFS: Number of bytes written=113612773
>> INFO mapred.JobClient:     HDFS: Number of read operations=12
>> INFO mapred.JobClient:     HDFS: Number of large read operations=0
>> INFO mapred.JobClient:     HDFS: Number of write operations=9
>> INFO mapred.JobClient:   Job Counters
>> INFO mapred.JobClient:     Launched map tasks=9
>> INFO mapred.JobClient:     Total time spent by all maps in occupied
>> slots (ms)=206659
>> INFO mapred.JobClient:     Total time spent by all reduces in occupied
>> slots (ms)=0
>> INFO mapred.JobClient:     Total time spent by all maps waiting after
>> reserving slots (ms)=0
>> INFO mapred.JobClient:     Total time spent by all reduces waiting
>> after reserving slots (ms)=0
>> INFO mapred.JobClient:   Map-Reduce Framework
>> INFO mapred.JobClient:     Map input records=9
>> INFO mapred.JobClient:     Map output records=0
>> INFO mapred.JobClient:     Input split bytes=396
>> INFO mapred.JobClient:     Spilled Records=0
>> INFO mapred.JobClient:     CPU time spent (ms)=243280
>> INFO mapred.JobClient:     Physical memory (bytes) snapshot=9947144192
>> INFO mapred.JobClient:     Virtual memory (bytes) snapshot=25884065792
>> INFO mapred.JobClient:     Total committed heap usage (bytes)=10392305664
>> INFO mapred.JobClient:   Giraph Stats
>> INFO mapred.JobClient:     Aggregate edges=402428
>> INFO mapred.JobClient:     Aggregate finished vertices=119141
>> INFO mapred.JobClient:     Aggregate vertices=119141
>> INFO mapred.JobClient:     Current master task partition=0
>> INFO mapred.JobClient:     Current workers=8
>> INFO mapred.JobClient:     Last checkpointed superstep=0
>> INFO mapred.JobClient:     Sent messages=0
>> INFO mapred.JobClient:     Superstep=4
>> INFO mapred.JobClient:   Giraph Timers
>> INFO mapred.JobClient:     Input superstep (milliseconds)=1689
>> INFO mapred.JobClient:     Setup (milliseconds)=3977
>> INFO mapred.JobClient:     Shutdown (milliseconds)=1177
>> INFO mapred.JobClient:     Superstep 0 (milliseconds)=834
>> INFO mapred.JobClient:     Superstep 1 (milliseconds)=1836
>> INFO mapred.JobClient:     Superstep 2 (milliseconds)=2524
>> INFO mapred.JobClient:     Superstep 3 (milliseconds)=8284
>> INFO mapred.JobClient:     Total (milliseconds)=20322
>>
>> ==== EOF ====
>>
>>
>> --
>> Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34
>> Dickinson Street, Amherst MA 01002 | matthewcornell.org
>
>



-- 
Matthew Cornell | m...@matthewcornell.org | 413-626-3621 | 34
Dickinson Street, Amherst MA 01002 | matthewcornell.org

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

Reply via email to