Re: how to use out of core options
Thanks, I just tried another dataset, which could be successfully handled by my cluster within memory. However, exceptions still occurred with the -Dgiraph.useOutOfCoreGraph=true option, but it works fine with only -Dgiraph.useOutOfCoreMessages=true option, so do you still think it is the dir permission issue? By the way, the dir path you mentioned should be the dir to store the outofcore partion and messages in local file system, right? But how do I know where it is? It should be determined by Giraph instead of the applications, right? Thanks for your time and patience again, Jian On Thu, Oct 17, 2013 at 5:32 PM, Jyotirmoy Sundi sundi...@gmail.com wrote: apart from these you might also want to check permissions of the dir path where offloading of vertices and messages happen. Ideally giraph is not meant for out-of-core if you graph is much bigger then the cluster can handle in memory, using giraph defeats the purpose in this case. On Thu, Oct 17, 2013 at 8:13 AM, Jianqiang Ou oujianqiang...@gmail.comwrote: Thanks very much, so are you saying if I use Dgiraph.maxPartitionsInMemory and Dgiraph.maxMessagesInMemory to make them both smaller number, then it might work? Thanks again, Jian On Thu, Oct 17, 2013 at 12:56 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: You need to tune it per your cluster. This is what mentioned in the docs: *It is difficult to decide a general policy to use out-of-core capabilities*, as it depends on the behavior of the algorithm and the input graph. The exact number of partitions and messages to keep in memory depends on the cluster capabilities, the number of messages produced per superstep, and number of active vertices per superstep. Moreover, it depends on the type and size of vertex values and messages. For example, algorithms such as Belief Propagation tend to keep large vertex values, while algorithms such as clique computations tend to send large messages along. Hence, it depends on your algorithm what feature to rely on more. Thanks Sundi On Wed, Oct 16, 2013 at 9:41 PM, Jianqiang Ou oujianqiang...@gmail.comwrote: Hi Sundi, I just tried your method, but somehow the job failed, the attached is the history of the job. and it was good without the outofcore options. Do you have any clue why is that? The command I used to run the program is below: $HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true org.apache.giraph.examples.SimplePageRankComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/andy/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/andy/output/page3 -w 3 -mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute Many thanks, Jianqiang On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou oujianqiang...@gmail.com wrote: got it, thank you very much! On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: Put it as -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true after GiraphRuuner like hadoop jar girap.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true ... On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou oujianqiang...@gmail.com wrote: Hi I have a question about the out of core giraph. It is said that, in order to use disk to store the partions, we need to use giraph.useOutOfCoreGraph=true, but where should I put this statement to? BTW, I am just trying to use the pagerank or shortestpath example to test the out of core performance of my cluster. Thanks very much, Jian -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou oujianqiang...@gmail.com wrote: got it, thank you very much! On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: Put it as -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true after GiraphRuuner like hadoop jar girap.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true ... On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou oujianqiang...@gmail.com wrote: Hi I have a question about the out of core giraph. It is said that, in order to use disk to store the partions, we need to use giraph.useOutOfCoreGraph=true, but where should I put this statement to? BTW, I am just trying to use the pagerank or shortestpath example to test the out of core performance of my cluster. Thanks very much, Jian -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA
Re: Master always fails on dataset
Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have ganglia installed. You were right it is a memory issue. I've reduced the number of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and now my jobs are failing due to running out of diskspace on HDFS. Each HDFS mount has 100gb of space. I will increase the size of HDFS and order more memory next week. Is there anyway to calculate the memory requirements of a giraph job? I presume it depends on the algorithm being run. On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella claudio.marte...@gmail.com wrote: Try decreasing the number of partitions you keep in memory. You're running out of memory. Also, are your nodes homogenous? It could be one particular machine swapping or something. If you have ganglia, try investigating the usage of memory. On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin simonmcgl...@gmail.comwrote: Hey Guys. I have a problem running my giraph job on a dataset with 20,000,000 edges and 2,000,000 vertices. All the vertices are Text based. The giraph job works perfectly on smaller datasets but always fails on larger ones. The setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The cluster has a total of 60 mappers each with mapred.child.java.opts set to -Xmx1000m. If I don't use the Out-of-Core option then the job fails due to running out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the master eventually fails due to a worker disconnecting from zookeeper. The worker just throws a warning and doesn't actually fail. I've been using the -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the mapper. I'm new to zookeeper too so if this is a zookeeper problem then let me know and I can investigate it as such. Below is the options I'm using and the errors I'm currently getting Any help or tips are appreciated, Simon Options: -Dgiraph.zkList=10.10.5.103:2181,10.10.5.104:2181,10.10.5.105:2181 -Dgiraph.checkpointFrequency=1 -Dgiraph.useOutOfCoreGraph=true -Dgiraph.zkSessionMsecTimeout=60 -Dgiraph.numComputeThreads=2 Master Log: 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname= node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 1 took 78.851 seconds ended with state WORKER_FAILURE and is now on superstep 1 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with RuntimeException java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) ... 1 more 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: restartFromCheckpoint: KeeperException, exiting... java.lang.IllegalStateException: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.MasterThread.run(MasterThread.java:181) Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) ... 1 more Worker 30 log: 2013-10-17 18:19:07,309 INFO
RE: Master always fails on dataset
Please disregard - Outlook sent it to the wrong address. Sorry. - F From: Tunvall, Fredrik [mailto:fredrik.tunv...@ovum.com] Sent: Friday, October 18, 2013 12:25 PM To: user@giraph.apache.org Subject: RE: Master always fails on dataset I will reach out right now From: Simon McGloin [mailto:simonmcgl...@gmail.com]mailto:[mailto:simonmcgl...@gmail.com] Sent: Friday, October 18, 2013 12:24 PM To: user@giraph.apache.orgmailto:user@giraph.apache.org Subject: Re: Master always fails on dataset Thanks Claudio. Yes the machines are homogenous. Unfortunately I don't have ganglia installed. You were right it is a memory issue. I've reduced the number of partitions down to 1 with -Dgiraph.maxPartitionsInMemory=1 and now my jobs are failing due to running out of diskspace on HDFS. Each HDFS mount has 100gb of space. I will increase the size of HDFS and order more memory next week. Is there anyway to calculate the memory requirements of a giraph job? I presume it depends on the algorithm being run. On Thu, Oct 17, 2013 at 6:42 PM, Claudio Martella claudio.marte...@gmail.commailto:claudio.marte...@gmail.com wrote: Try decreasing the number of partitions you keep in memory. You're running out of memory. Also, are your nodes homogenous? It could be one particular machine swapping or something. If you have ganglia, try investigating the usage of memory. On Thu, Oct 17, 2013 at 7:39 PM, Simon McGloin simonmcgl...@gmail.commailto:simonmcgl...@gmail.com wrote: Hey Guys. I have a problem running my giraph job on a dataset with 20,000,000 edges and 2,000,000 vertices. All the vertices are Text based. The giraph job works perfectly on smaller datasets but always fails on larger ones. The setup I have is a 3 node cluster, each with 24 cores and 24 GB of ram. The cluster has a total of 60 mappers each with mapred.child.java.opts set to -Xmx1000m. If I don't use the Out-of-Core option then the job fails due to running out of java heap space. When I use -Dgiraph.useOutOfCoreGraph=true then the master eventually fails due to a worker disconnecting from zookeeper. The worker just throws a warning and doesn't actually fail. I've been using the -Dgiraph.checkpointFrequency=1 option but this doesn't seem to restart the mapper. I'm new to zookeeper too so if this is a zookeeper problem then let me know and I can investigate it as such. Below is the options I'm using and the errors I'm currently getting Any help or tips are appreciated, Simon Options: -Dgiraph.zkList=10.10.5.103:2181http://10.10.5.103:2181,10.10.5.104:2181http://10.10.5.104:2181,10.10.5.105:2181http://10.10.5.105:2181 -Dgiraph.checkpointFrequency=1 -Dgiraph.useOutOfCoreGraph=true -Dgiraph.zkSessionMsecTimeout=60 -Dgiraph.numComputeThreads=2 Master Log: 2013-10-17 18:19:34,638 INFO org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: 0 out of 50 workers finished on superstep 1 on path /_hadoopBsp/job_201310161506_0064/_applicationAttemptsDir/0/_superstepDir/1/_workerWroteCheckpointDir 2013-10-17 18:20:52,105 ERROR org.apache.giraph.master.BspServiceMaster: superstepChosenWorkerAlive: Missing chosen worker Worker(hostname=node1.mycompany.comhttp://node1.mycompany.com, MRtaskID=30, port=30030) on superstep 1 2013-10-17 18:20:52,106 INFO org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 1 took 78.851 seconds ended with state WORKER_FAILURE and is now on superstep 1 2013-10-17 18:20:52,112 ERROR org.apache.giraph.master.MasterThread: masterThread: Master algorithm failed with RuntimeException java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /_hadoopBsp/job_201310161506_0064/_vertexInputSplitDir at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.zk.ZooKeeperExt.deleteExt(ZooKeeperExt.java:307) at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1177) ... 1 more 2013-10-17 18:20:52,115 FATAL org.apache.giraph.graph.GraphMapper: uncaughtException: OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, msg = java.lang.RuntimeException: restartFromCheckpoint: KeeperException, exiting... java.lang.IllegalStateException: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.MasterThread.run(MasterThread.java:181) Caused by: java.lang.RuntimeException: restartFromCheckpoint: KeeperException at org.apache.giraph.master.BspServiceMaster.restartFromCheckpoint(BspServiceMaster.java:1185) at org.apache.giraph.master.MasterThread.run(MasterThread.java:135)
Re: How to specify parameters in order to run giraph job in parallel
Dear Claudio Martella, According to https://reviews.apache.org/r/7990/diff/?page=2, Giraph currently organize vertices as byte streams, probabily pages. In the url, This also significantly reduces GC time, as there are less objects to GC. Why there's also there? I mean, is reducing GC time the only reason for doing serialization? Regards, Da Dear Claudio Martella, I don't quite get what you mean. Our cluster has 15 servers each with 24 cores, so ideally there can be 15*24 threads/partitions work in parallel, right? (Perhaps deduct one for ZooKeeper) However, when we set the -Dgiraph.numComputeThreads option, we find that we cannot have even 20 threads, and when set to 10, the CPU usage is just a little bit doubles that of the default setting, not anything close to 100*numComputeThreads%. How can we set it to work on our server to utilize all the processors? Regards, Da Yan It actually depends on the setup of your cluster. Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node (ideally to run giraph), so that you would have 14 workers, one per computing node, plus one for master+zookeeper. Once that is reached, you would have a number of compute threads equals to the number of threads that you can run on each node (24 in your case). Does this make sense to you? On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote: Hi, I have a computer cluster consisting of 15 slave machines and 1 master machine. On each slave machine, there are two Xeon E5-2620 CPUs. With the help of HT, there are 24 threads. I am wondering how to specify parameters in order to run giraph job in parallel on my cluster. I am using the following parameters to run a pagerank algorithm. hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner SimplePageRank -vif PageRankInputFormat -vip /input -vof PageRankOutputFormat -op /pagerank -w 1 -mc SimplePageRank\$SimplePageRankMasterCompute -wc SimplePageRank\$SimplePageRankWorkerContext In particular, 1)I know I can use “-w” to specify the number of workers. In my opinion, the number of workers equals to the number of mappers in hadoop except zookeeper. Therefore, in my case(15 slave machine), which number should be chosen? Is 15 a good choice? Since, I find if I input a large number, e.g. 100, the mappers will hang. 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex computing thread number. However, if I specify it to 10, the total runtime is much longer than default. I think the default is 1, which is found in the source code. I wonder if I want to use this parameter, which number should be chosen. 3)When the giraph job is running, I use “top” command to monitor my cpu usage on slave machines. I find that the java process can use 200%-300% cpu resource. However, if I change the number of vertex computing threads to 10, the java process can use 800% cpu resource. I think it is not a linear relation and I want to know why. Thanks for your help. Best, -Yi -- Claudio Martella claudio.marte...@gmail.com
Re: How to specify parameters in order to run giraph job in parallel
Da, Holding objects in serialized form as bytes in byte arrays consumes much less memory than holding them as Java objects (which have a huge overhead), I think that is the other main reason for serialization. --sebastian On 18.10.2013 19:28, YAN Da wrote: Dear Claudio Martella, According to https://reviews.apache.org/r/7990/diff/?page=2, Giraph currently organize vertices as byte streams, probabily pages. In the url, This also significantly reduces GC time, as there are less objects to GC. Why there's also there? I mean, is reducing GC time the only reason for doing serialization? Regards, Da Dear Claudio Martella, I don't quite get what you mean. Our cluster has 15 servers each with 24 cores, so ideally there can be 15*24 threads/partitions work in parallel, right? (Perhaps deduct one for ZooKeeper) However, when we set the -Dgiraph.numComputeThreads option, we find that we cannot have even 20 threads, and when set to 10, the CPU usage is just a little bit doubles that of the default setting, not anything close to 100*numComputeThreads%. How can we set it to work on our server to utilize all the processors? Regards, Da Yan It actually depends on the setup of your cluster. Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node (ideally to run giraph), so that you would have 14 workers, one per computing node, plus one for master+zookeeper. Once that is reached, you would have a number of compute threads equals to the number of threads that you can run on each node (24 in your case). Does this make sense to you? On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote: Hi, I have a computer cluster consisting of 15 slave machines and 1 master machine. On each slave machine, there are two Xeon E5-2620 CPUs. With the help of HT, there are 24 threads. I am wondering how to specify parameters in order to run giraph job in parallel on my cluster. I am using the following parameters to run a pagerank algorithm. hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner SimplePageRank -vif PageRankInputFormat -vip /input -vof PageRankOutputFormat -op /pagerank -w 1 -mc SimplePageRank\$SimplePageRankMasterCompute -wc SimplePageRank\$SimplePageRankWorkerContext In particular, 1)I know I can use “-w” to specify the number of workers. In my opinion, the number of workers equals to the number of mappers in hadoop except zookeeper. Therefore, in my case(15 slave machine), which number should be chosen? Is 15 a good choice? Since, I find if I input a large number, e.g. 100, the mappers will hang. 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex computing thread number. However, if I specify it to 10, the total runtime is much longer than default. I think the default is 1, which is found in the source code. I wonder if I want to use this parameter, which number should be chosen. 3)When the giraph job is running, I use “top” command to monitor my cpu usage on slave machines. I find that the java process can use 200%-300% cpu resource. However, if I change the number of vertex computing threads to 10, the java process can use 800% cpu resource. I think it is not a linear relation and I want to know why. Thanks for your help. Best, -Yi -- Claudio Martella claudio.marte...@gmail.com
Re: knowing about the vertex id of the sender of the message.
Sorry to be off the topic a bit. I made my own message structure but was wondering if the received messages are sorted based on sender id? I am not able to verify since the call Log.debug() doesn't seem to print out anything. Thanks, Haowei On Thu, Oct 17, 2013 at 9:31 AM, Sebastian Schelter s...@apache.org wrote: Hi Jyoti, You can simply make the sender id a part of the message. Best, Sebastian On 17.10.2013 18:10, Jyoti Yadav wrote: Hi.. In vertex computation code,at the start of the superstep every vertex processes its received messages.. Is there any way for the vertex to know who is the sender of the message it is currenty processing.? Thanks Jyoti