Re: The relation between the number of partitions, number of workers, number of mappers

2014-09-22 Thread Lukas Nalezenec

Hi,

Number of mappers = number of workers
Number of partitions = (multiplier * (number of workers) ^ 2 ) by default
(multiplier = 1 by default)

Lukas

On 22.9.2014 23:18, xuhong zhang wrote:


I know that the number of mappers equals to the number of worker * 
 mapred.tasktracker.map.tasks.maximum


How about the number of partitions?

Thanks

--
Xuhong Zhang




The relation between the number of partitions, number of workers, number of mappers

2014-09-22 Thread xuhong zhang
I know that the number of mappers equals to the number of worker *
 mapred.tasktracker.map.tasks.maximum

How about the number of partitions?

Thanks

-- 
Xuhong Zhang


Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-22 Thread Matthew Saltz
Hi Matthew,

I answered a few of your questions in-line (unfortunately they might not
help the larger problem, but hopefully it'll help a bit).

Best,
Matthew


On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell 
wrote:

> Hi Folks,
>
> I've spent the last two months learning, installing, coding, and
> analyzing the performance of our Giraph app, and I'm able to run on
> small inputs on our tiny cluster (yay!) I am now stuck trying to
> figure out why larger inputs fail, why only some compute nodes are
> being used, and generally how to make sure I've configured hadoop and
> giraph to use all available CPUs and RAM. I feel that I'm "this
> close," and I could really use some pointers.
>
> Below I share our app, configuration, results and log messages, some
> questions, and counter output for the successful run. My post here is
> long (I've broken it into sections delimited with ''), but I hope
> I've provided good enough information to get help on. I'm happy to add
> to it.
>
> Thanks!
>
>
>  application 
>
> Our application is a kind of path search where all nodes have a type
> and source database ID (e.g., "movie 99"), and searches are expressed
> as type paths, such as "movie, acted_in, actor", which would start
> with movies and then find all actors in each movie, for all movies in
> the database. The program does a kind of filtering by keeping track of
> previously-processed initial IDs.
>
> Our database is a small movie one with 2K movies, 6K users (people who
> rate movies), and 80K ratings of movies by users. Though small, we've
> found this kind of search can result in a massive explosion of
> messages, as was well put by Rob Vesse (
>
> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
> ):
>
> > even with this relatively small graph you get a massive explosion of
> messages by the later super steps which exhausts memory (in my graph the
> theoretical maximum messages by the last super step was ~3 billion)
>
>
>  job failure and error messages 
>
> Currently I have a four-step path that completes in ~20 seconds
> ("rates, movie, rates, user" - counter output shown at bottom) but a
> five-step one ("rates, movie, rates, user, rates") fails after a few
> minutes. I've looked carefully at the task logs, but I find it a
> little difficult to discern what the actual failure was. However,
> looking at system information (e.g., top and ganglia) during the run
> indicates hosts are running out of memory. There are no
> OutOfMemoryErrors in the logs, and only this one stsands out:
>
> > ERROR org.apache.giraph.master.BspServiceMaster:
> superstepChosenWorkerAlive: Missing chosen worker
> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>
> NB: So far I've been ignoring these other types of messages:
>
> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
> No last good checkpoints can be found, killing the job.
>
> > java.io.FileNotFoundException: File
> _bsp/_checkpoints/job_201409191450_0003 does not exist.
>
> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
> event
> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
> type=NodeDeleted, state=SyncConnected)
>
> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
> failure, unregistering health on
> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
> on superstep 4
>
> The counter statistics are minimal after the run fails, but during it
> I see something like this when refreshing the Job Tracker Web UI:
>
> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB
> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
> > Counters > Giraph Stats > Sent messages: ~181M
>
>
>  hadoop/giraph command 
>
> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
> -libjars ${LIBJARS} \
> relpath.RelPathVertex \
> -wc relpath.RelPathWorkerContext \
> -mc relpath.RelPathMasterCompute \
> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
> -vip $REL_PATH_INPUT \
> -of relpath.RelPathVertexValueTextOutputFormat \
> -op $REL_PATH_OUTPUT \
> -ca RelPathVertex.path=$REL_PATH_PATH \
> -w 8
>
>
>  cluster, versions, and configuration 
>
> We have a five-node cluster with a head and four compute nodes. The
> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
> and 48 cores.
>
> Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
> on the head node, and task trackers on all five nodes.
>
> Hadoop configuration (mapred-site.xml and CDH interface - sorry for
> the mix) - not sure I'm listing all of them of interest:
>
> > mapreduce.job.counters.max: 120
> > mapred.

Re: understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-22 Thread Matthew Saltz
Sorry, should be
"*org.apache.giraph.utils.MemoryUtils.getRuntimeMemoryStats()",
*I left out the giraph.

On Mon, Sep 22, 2014 at 8:10 PM, Matthew Saltz  wrote:

> Hi Matthew,
>
> I answered a few of your questions in-line (unfortunately they might not
> help the larger problem, but hopefully it'll help a bit).
>
> Best,
> Matthew
>
>
> On Mon, Sep 22, 2014 at 5:50 PM, Matthew Cornell 
> wrote:
>
>> Hi Folks,
>>
>> I've spent the last two months learning, installing, coding, and
>> analyzing the performance of our Giraph app, and I'm able to run on
>> small inputs on our tiny cluster (yay!) I am now stuck trying to
>> figure out why larger inputs fail, why only some compute nodes are
>> being used, and generally how to make sure I've configured hadoop and
>> giraph to use all available CPUs and RAM. I feel that I'm "this
>> close," and I could really use some pointers.
>>
>> Below I share our app, configuration, results and log messages, some
>> questions, and counter output for the successful run. My post here is
>> long (I've broken it into sections delimited with ''), but I hope
>> I've provided good enough information to get help on. I'm happy to add
>> to it.
>>
>> Thanks!
>>
>>
>>  application 
>>
>> Our application is a kind of path search where all nodes have a type
>> and source database ID (e.g., "movie 99"), and searches are expressed
>> as type paths, such as "movie, acted_in, actor", which would start
>> with movies and then find all actors in each movie, for all movies in
>> the database. The program does a kind of filtering by keeping track of
>> previously-processed initial IDs.
>>
>> Our database is a small movie one with 2K movies, 6K users (people who
>> rate movies), and 80K ratings of movies by users. Though small, we've
>> found this kind of search can result in a massive explosion of
>> messages, as was well put by Rob Vesse (
>>
>> http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
>> ):
>>
>> > even with this relatively small graph you get a massive explosion of
>> messages by the later super steps which exhausts memory (in my graph the
>> theoretical maximum messages by the last super step was ~3 billion)
>>
>>
>>  job failure and error messages 
>>
>> Currently I have a four-step path that completes in ~20 seconds
>> ("rates, movie, rates, user" - counter output shown at bottom) but a
>> five-step one ("rates, movie, rates, user, rates") fails after a few
>> minutes. I've looked carefully at the task logs, but I find it a
>> little difficult to discern what the actual failure was. However,
>> looking at system information (e.g., top and ganglia) during the run
>> indicates hosts are running out of memory. There are no
>> OutOfMemoryErrors in the logs, and only this one stsands out:
>>
>> > ERROR org.apache.giraph.master.BspServiceMaster:
>> superstepChosenWorkerAlive: Missing chosen worker
>> Worker(hostname=compute-0-3.wright, MRtaskID=1, port=30001) on superstep 4
>>
>> NB: So far I've been ignoring these other types of messages:
>>
>> > FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint:
>> No last good checkpoints can be found, killing the job.
>>
>> > java.io.FileNotFoundException: File
>> _bsp/_checkpoints/job_201409191450_0003 does not exist.
>>
>> > WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed
>> event
>> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>> type=NodeDeleted, state=SyncConnected)
>>
>> > ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got
>> failure, unregistering health on
>> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
>> on superstep 4
>>
>> The counter statistics are minimal after the run fails, but during it
>> I see something like this when refreshing the Job Tracker Web UI:
>>
>> > Counters > Map-Reduce Framework > Physical memory (bytes) snapshot:
>> ~28GB
>> > Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
>> > Counters > Giraph Stats > Sent messages: ~181M
>>
>>
>>  hadoop/giraph command 
>>
>> hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
>> -Dgiraph.zkList=xx.xx.xx.edu:2181 \
>> -libjars ${LIBJARS} \
>> relpath.RelPathVertex \
>> -wc relpath.RelPathWorkerContext \
>> -mc relpath.RelPathMasterCompute \
>> -vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
>> -vip $REL_PATH_INPUT \
>> -of relpath.RelPathVertexValueTextOutputFormat \
>> -op $REL_PATH_OUTPUT \
>> -ca RelPathVertex.path=$REL_PATH_PATH \
>> -w 8
>>
>>
>>  cluster, versions, and configuration 
>>
>> We have a five-node cluster with a head and four compute nodes. The
>> head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
>> 4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
>> and 48 cores.
>>
>> Hadoop: Cloudera C

understanding failing my job, Giraph/Hadoop memory usage, under-utilized nodes, and moving forward

2014-09-22 Thread Matthew Cornell
Hi Folks,

I've spent the last two months learning, installing, coding, and
analyzing the performance of our Giraph app, and I'm able to run on
small inputs on our tiny cluster (yay!) I am now stuck trying to
figure out why larger inputs fail, why only some compute nodes are
being used, and generally how to make sure I've configured hadoop and
giraph to use all available CPUs and RAM. I feel that I'm "this
close," and I could really use some pointers.

Below I share our app, configuration, results and log messages, some
questions, and counter output for the successful run. My post here is
long (I've broken it into sections delimited with ''), but I hope
I've provided good enough information to get help on. I'm happy to add
to it.

Thanks!


 application 

Our application is a kind of path search where all nodes have a type
and source database ID (e.g., "movie 99"), and searches are expressed
as type paths, such as "movie, acted_in, actor", which would start
with movies and then find all actors in each movie, for all movies in
the database. The program does a kind of filtering by keeping track of
previously-processed initial IDs.

Our database is a small movie one with 2K movies, 6K users (people who
rate movies), and 80K ratings of movies by users. Though small, we've
found this kind of search can result in a massive explosion of
messages, as was well put by Rob Vesse (
http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3ccec4a409.2d7ad%25rve...@dotnetrdf.org%3E
):

> even with this relatively small graph you get a massive explosion of messages 
> by the later super steps which exhausts memory (in my graph the theoretical 
> maximum messages by the last super step was ~3 billion)


 job failure and error messages 

Currently I have a four-step path that completes in ~20 seconds
("rates, movie, rates, user" - counter output shown at bottom) but a
five-step one ("rates, movie, rates, user, rates") fails after a few
minutes. I've looked carefully at the task logs, but I find it a
little difficult to discern what the actual failure was. However,
looking at system information (e.g., top and ganglia) during the run
indicates hosts are running out of memory. There are no
OutOfMemoryErrors in the logs, and only this one stsands out:

> ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: 
> Missing chosen worker Worker(hostname=compute-0-3.wright, MRtaskID=1, 
> port=30001) on superstep 4

NB: So far I've been ignoring these other types of messages:

> FATAL org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No 
> last good checkpoints can be found, killing the job.

> java.io.FileNotFoundException: File _bsp/_checkpoints/job_201409191450_0003 
> does not exist.

> WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event 
> (path=/_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished,
>  type=NodeDeleted, state=SyncConnected)

> ERROR org.apache.giraph.worker.BspServiceWorker: unregisterHealth: Got 
> failure, unregistering health on 
> /_hadoopBsp/job_201409191450_0003/_applicationAttemptsDir/0/_superstepDir/4/_workerHealthyDir/compute-0-3.wright_1
>  on superstep 4

The counter statistics are minimal after the run fails, but during it
I see something like this when refreshing the Job Tracker Web UI:

> Counters > Map-Reduce Framework > Physical memory (bytes) snapshot: ~28GB
> Counters > Map-Reduce Framework > Virtual memory (bytes) snapshot: ~27GB
> Counters > Giraph Stats > Sent messages: ~181M


 hadoop/giraph command 

hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=xx.xx.xx.edu:2181 \
-libjars ${LIBJARS} \
relpath.RelPathVertex \
-wc relpath.RelPathWorkerContext \
-mc relpath.RelPathMasterCompute \
-vif relpath.CausalityJsonAdjacencyListVertexInputFormat \
-vip $REL_PATH_INPUT \
-of relpath.RelPathVertexValueTextOutputFormat \
-op $REL_PATH_OUTPUT \
-ca RelPathVertex.path=$REL_PATH_PATH \
-w 8


 cluster, versions, and configuration 

We have a five-node cluster with a head and four compute nodes. The
head has 2 CPUs, 16 cores each, and 64 GB RAM. Each compute has 1 CPU,
4 cores each, and 16 GB RAM, making a total cluster of 128 GB of RAM
and 48 cores.

Hadoop: Cloudera CDH4 with a mapreduce service running the job tracker
on the head node, and task trackers on all five nodes.

Hadoop configuration (mapred-site.xml and CDH interface - sorry for
the mix) - not sure I'm listing all of them of interest:

> mapreduce.job.counters.max: 120
> mapred.output.compress: false
> mapred.reduce.tasks: 12
> mapred.child.java.opts: -Xmx2147483648
> mapred.job.reuse.jvm.num.tasks: 1
> MapReduce Child Java Maximum Heap Size: 2 GB
> I/O Sort Memory Buffer (io.sort.mb): 512 MB
> Client Java Heap Size in Bytes: 256 MB
> Java Heap Size of Jobtracker in Bytes 1 GB
> Java Heap Size of TaskTracker in Bytes: 1 GB

Cluster summary from the Job Tr