Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-22 Thread Matthew Saltz
Sorry, should be
*org.apache.giraph.utils.MemoryUtils.getRuntimeMemoryStats(),
*I left out the giraph.

On Mon, Sep 22, 2014 at 8:10 PM, Matthew Saltz sal...@gmail.com wrote:

 Hi Matthew,

 I answered a few of your questions in-line (unfortunately they might not
 help the larger problem, but hopefully it'll help a bit).

 Best,
 Matthew


 On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell m...@matthewcornell.org
 wrote:

 Hi Folks,

 I've spent the last two months learning, installing, coding, and
 analyzing the performance of our Giraph app, and I'm able to run on
 small inputs on our tiny cluster (yay!) I am now stuck trying to
 figure out why larger inputs fail, why only some compute nodes are
 being used, and generally how to make sure I've configured hadoop and
 giraph to use all available CPUs and RAM. I feel that I'm this
 close, and I could really use some pointers.

 Below I share our app, configuration, results and log messages, some
 questions, and counter output for the successful run. My post here is
 long (I've broken it into sections delimited with ''), but I hope
 I've provided good enough information to get help on. I'm happy to add
 to it.

 Thanks!


  application 

 Our application is a kind of path search where all nodes have a type
 and source database ID (e.g., movie 99), and searches are expressed
 as type paths, such as movie, acted_in, actor, which would start
 with movies and then find all actors in each movie, for all movies in
 the database. The program does a kind of filtering by keeping track of
 previously-processed initial IDs.

 Our database is a small movie one with 2K movies, 6K users (people who
 rate movies), and 80K ratings of movies by users. Though small, we've
 found this kind of search can result in a massive explosion of
 messages, as was well put by Rob Vesse (

 http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
 ):

  even with this relatively small graph you get a massive explosion of
 messages by the later super steps which exhausts memory (in my graph the
 theoretical maximum messages by the last super step was ~3 billion)


  job failure and error messages 

 Currently I have a four-step path that completes in ~20 seconds
 (rates, movie, rates, user - counter output shown at bottom) but a
 five-step one (rates, movie, rates, user, rates) fails after a few
 minutes. I've looked carefully at the task logs, but I find it a
 little difficult to discern what the actual failure was. However,
 looking at system information (e.g., top and ganglia) during the run
 indicates hosts are running out of memory. There are no
 OutOfMemoryErrors in the logs, and only this one stsands out:

  ERROR org.apache.giraph.master.BspServiceMaster:
 superstepChosenWorkerAlive: Missing chosen worker
 Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4

 NB: So far I've been ignoring these other types of messages:

  FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
 No last good checkpoints can be found, killing the job.

  java.io.FileNotFoundException: File
 _bsp/_checkpoints/job_201409191450_0003 does not exist.

  WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
 event
 (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
 type=NodeDeleted, state=SyncConnected)

  ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
 failure, unregistering health on
 /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
 on superstep 4

 The counter statistics are minimal after the run fails, but during it
 I see something like this when refreshing the Job Tracker Web UI:

  Counters  Map-Reduce Framework  Physical memory (bytes) snapshot:
 ~28GB
  Counters  Map-Reduce Framework  Virtual memory (bytes) snapshot: ~27GB
  Counters  Giraph Stats  Sent messages: ~181M


  hadoop/giraph command 

 hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
 -Dgiraph.zkList=xx.xx.xx.edu:2181 \
 -libjars ${LIBJARS} \
 relpath.RelPathVertex \
 -wc relpath.RelPathWorkerContext \
 -mc relpath.RelPathMasterCompute \
 -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
 -vip $REL_PATH_INPUT \
 -of relpath.RelPathVertexValueTextOutputFormat \
 -op $REL_PATH_OUTPUT \
 -ca RelPathVertex.path=$REL_PATH_PATH \
 -w 8


  cluster, versions, and configuration 

 We have a five-node cluster with a head and four compute nodes. The
 head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
 and 48 cores.

 Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
 on the head node, and task trackers on all five nodes.

 Hadoop configuration (mapred-site.xml and CDH interface - sorry for
 the mix) - not sure I'm listing 

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-22 Thread Matthew Saltz
Hi Matthew,

I answered a few of your questions in-line (unfortunately they might not
help the larger problem, but hopefully it'll help a bit).

Best,
Matthew


On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell m...@matthewcornell.org
wrote:

 Hi Folks,

 I've spent the last two months learning, installing, coding, and
 analyzing the performance of our Giraph app, and I'm able to run on
 small inputs on our tiny cluster (yay!) I am now stuck trying to
 figure out why larger inputs fail, why only some compute nodes are
 being used, and generally how to make sure I've configured hadoop and
 giraph to use all available CPUs and RAM. I feel that I'm this
 close, and I could really use some pointers.

 Below I share our app, configuration, results and log messages, some
 questions, and counter output for the successful run. My post here is
 long (I've broken it into sections delimited with ''), but I hope
 I've provided good enough information to get help on. I'm happy to add
 to it.

 Thanks!


  application 

 Our application is a kind of path search where all nodes have a type
 and source database ID (e.g., movie 99), and searches are expressed
 as type paths, such as movie, acted_in, actor, which would start
 with movies and then find all actors in each movie, for all movies in
 the database. The program does a kind of filtering by keeping track of
 previously-processed initial IDs.

 Our database is a small movie one with 2K movies, 6K users (people who
 rate movies), and 80K ratings of movies by users. Though small, we've
 found this kind of search can result in a massive explosion of
 messages, as was well put by Rob Vesse (

 http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
 ):

  even with this relatively small graph you get a massive explosion of
 messages by the later super steps which exhausts memory (in my graph the
 theoretical maximum messages by the last super step was ~3 billion)


  job failure and error messages 

 Currently I have a four-step path that completes in ~20 seconds
 (rates, movie, rates, user - counter output shown at bottom) but a
 five-step one (rates, movie, rates, user, rates) fails after a few
 minutes. I've looked carefully at the task logs, but I find it a
 little difficult to discern what the actual failure was. However,
 looking at system information (e.g., top and ganglia) during the run
 indicates hosts are running out of memory. There are no
 OutOfMemoryErrors in the logs, and only this one stsands out:

  ERROR org.apache.giraph.master.BspServiceMaster:
 superstepChosenWorkerAlive: Missing chosen worker
 Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4

 NB: So far I've been ignoring these other types of messages:

  FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
 No last good checkpoints can be found, killing the job.

  java.io.FileNotFoundException: File
 _bsp/_checkpoints/job_201409191450_0003 does not exist.

  WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
 event
 (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
 type=NodeDeleted, state=SyncConnected)

  ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
 failure, unregistering health on
 /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
 on superstep 4

 The counter statistics are minimal after the run fails, but during it
 I see something like this when refreshing the Job Tracker Web UI:

  Counters  Map-Reduce Framework  Physical memory (bytes) snapshot: ~28GB
  Counters  Map-Reduce Framework  Virtual memory (bytes) snapshot: ~27GB
  Counters  Giraph Stats  Sent messages: ~181M


  hadoop/giraph command 

 hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
 -Dgiraph.zkList=xx.xx.xx.edu:2181 \
 -libjars ${LIBJARS} \
 relpath.RelPathVertex \
 -wc relpath.RelPathWorkerContext \
 -mc relpath.RelPathMasterCompute \
 -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
 -vip $REL_PATH_INPUT \
 -of relpath.RelPathVertexValueTextOutputFormat \
 -op $REL_PATH_OUTPUT \
 -ca RelPathVertex.path=$REL_PATH_PATH \
 -w 8


  cluster, versions, and configuration 

 We have a five-node cluster with a head and four compute nodes. The
 head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
 and 48 cores.

 Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
 on the head node, and task trackers on all five nodes.

 Hadoop configuration (mapred-site.xml and CDH interface - sorry for
 the mix) - not sure I'm listing all of them of interest:

  mapreduce.job.counters.max: 120
  mapred.output.compress: false
  mapred.reduce.tasks: 12
  mapred.child.java.opts: -Xmx2147483648