Re: how to use out of core options

2013-10-18 Thread Jianqiang Ou
Thanks, I just tried another dataset, which could be successfully handled
by my cluster within memory. However, exceptions still occurred with the
-Dgiraph.useOutOfCoreGraph=true option, but it works fine with only
-Dgiraph.useOutOfCoreMessages=true
option, so do you still think it is the dir permission issue?

By the way, the dir path you mentioned should be the dir to store the
outofcore partion and messages in local file system, right? But how do I
know where it is? It should be determined by Giraph instead of the
applications, right?

Thanks for your time and patience again,
Jian


On Thu, Oct 17, 2013 at 5:32 PM, Jyotirmoy Sundi sundi...@gmail.com wrote:

 apart from these you might also want to check permissions of the dir path
 where offloading of vertices and messages happen.
 Ideally giraph is not meant for out-of-core if you graph is much bigger
 then the cluster can handle in memory, using giraph defeats the purpose in
 this case.



 On Thu, Oct 17, 2013 at 8:13 AM, Jianqiang Ou oujianqiang...@gmail.comwrote:

 Thanks very much, so are you saying if I use Dgiraph.maxPartitionsInMemory
 and Dgiraph.maxMessagesInMemory to make them both smaller number, then
 it might work?

 Thanks again,
 Jian


 On Thu, Oct 17, 2013 at 12:56 AM, Jyotirmoy Sundi sundi...@gmail.comwrote:

 You need to tune it per your cluster. This is what mentioned in the docs:
 *It is difficult to decide a general policy to use out-of-core
 capabilities*, as it depends on the behavior of the algorithm and the
 input graph. The exact number of partitions and messages to keep in memory
 depends on the cluster capabilities, the number of messages produced per
 superstep, and number of active vertices per superstep. Moreover, it
 depends on the type and size of vertex values and messages. For example,
 algorithms such as Belief Propagation tend to keep large vertex values,
 while algorithms such as clique computations tend to send large messages
 along. Hence, it depends on your algorithm what feature to rely on more.

 Thanks
  Sundi


 On Wed, Oct 16, 2013 at 9:41 PM, Jianqiang Ou 
 oujianqiang...@gmail.comwrote:

 Hi Sundi,

 I just tried your method, but somehow the job failed, the attached is
 the history of the job. and it was good without the outofcore options. Do
 you have any clue why is that?

 The command I used to run the program is below:

 $HADOOP_HOME/bin/hadoop jar
 $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true
 -Dgiraph.useOutOfCoreGraph=true
 org.apache.giraph.examples.SimplePageRankComputation -vif
 org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
 -vip /user/andy/input/tiny_graph.txt -vof
 org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
 /user/andy/output/page3 -w 3 -mc
 org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute

 Many thanks,

 Jianqiang

 On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou 
 oujianqiang...@gmail.com wrote:

 got it, thank you very much!


 On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi 
 sundi...@gmail.comwrote:

 Put it as -Dgiraph.useOutOfCoreMessages=true
 -Dgiraph.useOutOfCoreGraph=true  after GiraphRuuner
 like
 hadoop jar girap.jar org.apache.giraph.GiraphRunner 
 -Dgiraph.useOutOfCoreMessages=true
 -Dgiraph.useOutOfCoreGraph=true ...




 On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou 
 oujianqiang...@gmail.com wrote:

 Hi I have a question about the out of core giraph. It is said that,
 in order to use disk to store the partions, we need to use 
 giraph.useOutOfCoreGraph=true, but where should I put this
 statement to?

 BTW, I am just trying to use the pagerank or shortestpath example to
 test the out of core performance of my cluster.

 Thanks very much,
 Jian




 --
 Best Regards,
 Jyotirmoy Sundi
 Data Engineer,
 Admobius

 San Francisco, CA 94158



 On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou 
 oujianqiang...@gmail.com wrote:

 got it, thank you very much!


 On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi 
 sundi...@gmail.comwrote:

 Put it as -Dgiraph.useOutOfCoreMessages=true
 -Dgiraph.useOutOfCoreGraph=true  after GiraphRuuner
 like
 hadoop jar girap.jar org.apache.giraph.GiraphRunner 
 -Dgiraph.useOutOfCoreMessages=true
 -Dgiraph.useOutOfCoreGraph=true ...




 On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou 
 oujianqiang...@gmail.com wrote:

 Hi I have a question about the out of core giraph. It is said that,
 in order to use disk to store the partions, we need to use 
 giraph.useOutOfCoreGraph=true, but where should I put this
 statement to?

 BTW, I am just trying to use the pagerank or shortestpath example to
 test the out of core performance of my cluster.

 Thanks very much,
 Jian




 --
 Best Regards,
 Jyotirmoy Sundi
 Data Engineer,
 Admobius

 San Francisco, CA 94158






 --
 Best Regards,
 Jyotirmoy Sundi
 Data Engineer,
 Admobius

 San Francisco, CA 

Re: Master always fails on dataset

2013-10-18 Thread Simon McGloin
Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have
ganglia installed. You were right it is a memory issue. I've reduced the
number of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and
now my jobs are failing due to running out of diskspace on HDFS. Each HDFS
mount has 100gb of space. I will increase the size of HDFS and order more
memory next week. Is there anyway to calculate the memory requirements of a
giraph job? I presume it depends on the algorithm being run.


On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella 
claudio.marte...@gmail.com wrote:

 Try decreasing the number of partitions you keep in memory. You're running
 out of memory. Also, are your nodes homogenous? It could be one particular
 machine swapping or something. If you have ganglia, try investigating the
 usage of memory.


 On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin simonmcgl...@gmail.comwrote:

 Hey Guys.

 I have a problem running my giraph job on a dataset with 20,000,000 edges
 and 2,000,000 vertices. All the vertices are Text based. The giraph job
 works perfectly on smaller datasets but always fails on larger ones. The
 setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The
 cluster has a total of 60 mappers each with mapred.child.java.opts set to
 -Xmx1000m.
 If I don't use the Out-of-Core option then the job fails due to running
 out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the
 master eventually fails due to a worker disconnecting from zookeeper. The
 worker just throws a warning and doesn't actually fail. I've been using the
 -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the
 mapper. I'm new to zookeeper too so if this is a zookeeper problem then let
 me know and I can investigate it as such.

 Below is the options I'm using and the errors I'm currently getting
 Any help or tips are appreciated,
 Simon

 Options:
 -Dgiraph.zkList=10.10.5.103:2181,10.10.5.104:2181,10.10.5.105:2181
 -Dgiraph.checkpointFrequency=1
 -Dgiraph.useOutOfCoreGraph=true
 -Dgiraph.zkSessionMsecTimeout=60
 -Dgiraph.numComputeThreads=2

 Master Log:
 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster:
 barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path
 /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir
 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster:
 superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=
 node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1
 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread:
 masterThread: Coordination of superstep 1 took 78.851 seconds ended with
 state WORKER_FAILURE and is now on superstep 1
 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread:
 masterThread: Master algorithm failed with RuntimeException
 java.lang.RuntimeException: restartFromCheckpoint: KeeperException
 at
 org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
  at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
 Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode = NoNode for
 /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
 at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
  at
 org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
 ... 1 more
 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper:
 uncaughtException: OverrideExceptionHandler on thread
 org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException:
 restartFromCheckpoint: KeeperException, exiting...
 java.lang.IllegalStateException: java.lang.RuntimeException:
 restartFromCheckpoint: KeeperException
 at org.apache.giraph.master.MasterThread.run(MasterThread.java:181)
 Caused by: java.lang.RuntimeException: restartFromCheckpoint:
 KeeperException
 at
 org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
  at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
 Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode = NoNode for
 /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
  at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
 at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
  at
 org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
 ... 1 more


 Worker 30 log:
 2013-10-17 18:19:07,309 INFO
 

RE: Master always fails on dataset

2013-10-18 Thread Tunvall, Fredrik
Please disregard - Outlook sent it to the wrong address. Sorry. - F

From: Tunvall, Fredrik [mailto:fredrik.tunv...@ovum.com]
Sent: Friday, October 18, 2013 12:25 PM
To: user@giraph.apache.org
Subject: RE: Master always fails on dataset

I will reach out right now

From: Simon McGloin 
[mailto:simonmcgl...@gmail.com]mailto:[mailto:simonmcgl...@gmail.com]
Sent: Friday, October 18, 2013 12:24 PM
To: user@giraph.apache.orgmailto:user@giraph.apache.org
Subject: Re: Master always fails on dataset

Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have 
ganglia installed. You were right it is a memory issue. I've reduced the number 
of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and now my jobs 
are failing due to running out of diskspace on HDFS. Each HDFS mount has 100gb 
of space. I will increase the size of HDFS and order more memory next week. Is 
there anyway to calculate the memory requirements of a giraph job? I presume it 
depends on the algorithm being run.

On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella 
claudio.marte...@gmail.commailto:claudio.marte...@gmail.com wrote:
Try decreasing the number of partitions you keep in memory. You're running out 
of memory. Also, are your nodes homogenous? It could be one particular machine 
swapping or something. If you have ganglia, try investigating the usage of 
memory.

On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin 
simonmcgl...@gmail.commailto:simonmcgl...@gmail.com wrote:
Hey Guys.

I have a problem running my giraph job on a dataset with 20,000,000 edges and 
2,000,000 vertices. All the vertices are Text based. The giraph job works 
perfectly on smaller datasets but always fails on larger ones. The setup I have 
is a 3 node cluster, each with 24 cores and 24 GB of ram. The cluster has a 
total of 60 mappers each with mapred.child.java.opts set to -Xmx1000m.
If I don't use the Out-of-Core option then the job fails due to running out of 
java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the master 
eventually fails due to a worker disconnecting from zookeeper. The worker just 
throws a warning and doesn't actually fail. I've been using the 
-Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the 
mapper. I'm new to zookeeper too so if this is a zookeeper problem then let me 
know and I can investigate it as such.

Below is the options I'm using and the errors I'm currently getting
Any help or tips are appreciated,
Simon

Options:
-Dgiraph.zkList=10.10.5.103:2181http://10.10.5.103:2181,10.10.5.104:2181http://10.10.5.104:2181,10.10.5.105:2181http://10.10.5.105:2181
-Dgiraph.checkpointFrequency=1
-Dgiraph.useOutOfCoreGraph=true
-Dgiraph.zkSessionMsecTimeout=60
-Dgiraph.numComputeThreads=2

Master Log:
2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: 
barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path 
/_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir
2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: 
superstepChosenWorkerAlive: Missing chosen worker 
Worker(hostname=node1.mycompany.comhttp://node1.mycompany.com, MRtaskID=30, 
port=30030) on superstep 1
2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: 
masterThread: Coordination of superstep 1 took 78.851 seconds ended with state 
WORKER_FAILURE and is now on superstep 1
2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: 
masterThread: Master algorithm failed with RuntimeException
java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at 
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for 
/_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307)
at 
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177)
... 1 more
2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: 
uncaughtException: OverrideExceptionHandler on thread 
org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: 
restartFromCheckpoint: KeeperException, exiting...
java.lang.IllegalStateException: java.lang.RuntimeException: 
restartFromCheckpoint: KeeperException
at org.apache.giraph.master.MasterThread.run(MasterThread.java:181)
Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException
at 
org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)

Re: How to specify parameters in order to run giraph job in parallel

2013-10-18 Thread YAN Da
Dear Claudio Martella,

According to https://reviews.apache.org/r/7990/diff/?page=2, Giraph
currently organize vertices as byte streams, probabily pages.

In the url, This also significantly reduces GC time, as there are less
objects to GC.

Why there's also there? I mean, is reducing GC time the only reason for
doing serialization?

Regards,
Da

 Dear Claudio Martella,

 I don't quite get what you mean. Our cluster has 15 servers each with 24
 cores, so ideally there can be 15*24 threads/partitions work in parallel,
 right? (Perhaps deduct one for ZooKeeper)

 However, when we set the -Dgiraph.numComputeThreads option, we find that
 we cannot have even 20 threads, and when set to 10, the CPU usage is just
 a little bit doubles that of the default setting, not anything close to
 100*numComputeThreads%.

 How can we set it to work on our server to utilize all the processors?

 Regards,
 Da Yan

 It actually depends on the setup of your cluster.

 Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node
 (ideally to run giraph), so that you would have 14 workers, one per
 computing node, plus one for master+zookeeper. Once that is reached, you
 would have a number of compute threads equals to the number of threads
 that
 you can run on each node (24 in your case).

 Does this make sense to you?


 On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote:

 Hi,

 I have a computer cluster consisting of 15 slave machines and 1 master
 machine.

 On each slave machine, there are two Xeon E5-2620 CPUs. With the help
 of
 HT, there are 24 threads.

 I am wondering how to specify parameters in order to run giraph job in
 parallel on my cluster.

 I am using the following parameters to run a pagerank algorithm.

 hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner
 SimplePageRank -vif PageRankInputFormat -vip /input -vof
 PageRankOutputFormat -op /pagerank -w 1 -mc
 SimplePageRank\$SimplePageRankMasterCompute -wc
 SimplePageRank\$SimplePageRankWorkerContext

 In particular,

 1)I know I can use “-w” to specify the number of workers. In my
 opinion,
 the number of workers equals to the number of mappers in hadoop except
 zookeeper. Therefore, in my case(15 slave machine), which number should
 be
 chosen? Is 15 a good choice? Since, I find if I input a large number,
 e.g.
 100, the mappers will hang.

 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex
 computing thread number. However, if I specify it to 10, the total
 runtime
 is much longer than default. I think the default is 1, which is found
 in
 the source code. I wonder if I want to use this parameter, which number
 should be chosen.

 3)When the giraph job is running, I use “top” command to monitor my cpu
 usage on slave machines. I find that the java process can use 200%-300%
 cpu
 resource. However, if I change the number of vertex computing threads
 to
 10, the java process can use 800% cpu resource. I think it is not a
 linear
 relation and I want to know why.


 Thanks for your help.

 Best,

 -Yi




 --
Claudio Martella
claudio.marte...@gmail.com










Re: How to specify parameters in order to run giraph job in parallel

2013-10-18 Thread Sebastian Schelter
Da,

Holding objects in serialized form as bytes in byte arrays consumes much
less memory than holding them as Java objects (which have a huge
overhead), I think that is the other main reason for serialization.

--sebastian

On 18.10.2013 19:28, YAN Da wrote:
 Dear Claudio Martella,
 
 According to https://reviews.apache.org/r/7990/diff/?page=2, Giraph
 currently organize vertices as byte streams, probabily pages.
 
 In the url, This also significantly reduces GC time, as there are less
 objects to GC.
 
 Why there's also there? I mean, is reducing GC time the only reason for
 doing serialization?
 
 Regards,
 Da
 
 Dear Claudio Martella,

 I don't quite get what you mean. Our cluster has 15 servers each with 24
 cores, so ideally there can be 15*24 threads/partitions work in parallel,
 right? (Perhaps deduct one for ZooKeeper)

 However, when we set the -Dgiraph.numComputeThreads option, we find that
 we cannot have even 20 threads, and when set to 10, the CPU usage is just
 a little bit doubles that of the default setting, not anything close to
 100*numComputeThreads%.

 How can we set it to work on our server to utilize all the processors?

 Regards,
 Da Yan

 It actually depends on the setup of your cluster.

 Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node
 (ideally to run giraph), so that you would have 14 workers, one per
 computing node, plus one for master+zookeeper. Once that is reached, you
 would have a number of compute threads equals to the number of threads
 that
 you can run on each node (24 in your case).

 Does this make sense to you?


 On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote:

 Hi,

 I have a computer cluster consisting of 15 slave machines and 1 master
 machine.

 On each slave machine, there are two Xeon E5-2620 CPUs. With the help
 of
 HT, there are 24 threads.

 I am wondering how to specify parameters in order to run giraph job in
 parallel on my cluster.

 I am using the following parameters to run a pagerank algorithm.

 hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner
 SimplePageRank -vif PageRankInputFormat -vip /input -vof
 PageRankOutputFormat -op /pagerank -w 1 -mc
 SimplePageRank\$SimplePageRankMasterCompute -wc
 SimplePageRank\$SimplePageRankWorkerContext

 In particular,

 1)I know I can use “-w” to specify the number of workers. In my
 opinion,
 the number of workers equals to the number of mappers in hadoop except
 zookeeper. Therefore, in my case(15 slave machine), which number should
 be
 chosen? Is 15 a good choice? Since, I find if I input a large number,
 e.g.
 100, the mappers will hang.

 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex
 computing thread number. However, if I specify it to 10, the total
 runtime
 is much longer than default. I think the default is 1, which is found
 in
 the source code. I wonder if I want to use this parameter, which number
 should be chosen.

 3)When the giraph job is running, I use “top” command to monitor my cpu
 usage on slave machines. I find that the java process can use 200%-300%
 cpu
 resource. However, if I change the number of vertex computing threads
 to
 10, the java process can use 800% cpu resource. I think it is not a
 linear
 relation and I want to know why.


 Thanks for your help.

 Best,

 -Yi




 --
Claudio Martella
claudio.marte...@gmail.com




 
 
 
 



Re: knowing about the vertex id of the sender of the message.

2013-10-18 Thread Haowei Liu
Sorry to be off the topic a bit.

I made my own message structure but was wondering if the received messages
are sorted based on sender id?

I am not able to verify since the call Log.debug() doesn't seem to print
out anything.

Thanks,

Haowei


On Thu, Oct 17, 2013 at 9:31 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Jyoti,

 You can simply make the sender id a part of the message.

 Best,
 Sebastian

 On 17.10.2013 18:10, Jyoti Yadav wrote:
  Hi..
  In vertex computation code,at the start of the superstep every vertex
  processes its received messages.. Is there any way for the vertex to know
  who is the sender of the message it is currenty processing.?
 
  Thanks
  Jyoti