Hi,

I have a computer cluster consisting of 15 slave machines and 1 master
machine.

On each slave machine, there are two Xeon E5-2620 CPUs. With the help of
HT, there are 24 threads.

I am wondering how to specify parameters in order to run giraph job in
parallel on my cluster.

I am using the following parameters to run a pagerank algorithm.

hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner
SimplePageRank -vif PageRankInputFormat -vip /input -vof
PageRankOutputFormat -op /pagerank -w 1 -mc
SimplePageRank\$SimplePageRankMasterCompute -wc
SimplePageRank\$SimplePageRankWorkerContext

In particular,

1)I know I can use “-w” to specify the number of workers. In my opinion,
the number of workers equals to the number of mappers in hadoop except
zookeeper. Therefore, in my case(15 slave machine), which number should be
chosen? Is 15 a good choice? Since, I find if I input a large number, e.g.
100, the mappers will hang.

2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex
computing thread number. However, if I specify it to 10, the total runtime
is much longer than default. I think the default is 1, which is found in
the source code. I wonder if I want to use this parameter, which number
should be chosen.

3)When the giraph job is running, I use “top” command to monitor my cpu
usage on slave machines. I find that the java process can use 200%-300% cpu
resource. However, if I change the number of vertex computing threads to
10, the java process can use 800% cpu resource. I think it is not a linear
relation and I want to know why.


Thanks for your help.

Best,

-Yi

Reply via email to