Re: Error when using ArrayListWritableText as message
Thanks for the advice Manuel. I created a TextArrayListMessage object to use as a message between supersteps. I had to include a line to wipe the ArrayList during the readFields method or else during the next superstep the array would have unwanted Text objects in it. It seemed like the TextArrayListMessage gets reused and the readFields just keeps adding to the textArrayList. e.g. @Override public void readFields(DataInput in) throws IOException { int numFields = in.readInt(); textArrayList.clear(); // Have to clear the list or get unexpected results for(int i = 0; i numFields; i++) { Text t = new Text(WritableUtils.readCompressedByteArray(in)); textArrayList.add(t); } } On Wed, Oct 16, 2013 at 7:21 PM, Manuel Lagang manuellag...@gmail.comwrote: I think you need to have your message value class as TextArrayListMessage instead of ArrayListWritableText. That might require you to move TextArrayListMessage outside of ArrayListTextBug. On Wed, Oct 16, 2013 at 10:01 AM, Simon McGloin simonmcgl...@gmail.comwrote: Hey Guys, I've only been using Giraph a few days so am very new to it. I'm currently using Giraph 1.0.0. I'm getting the error below when I try to send an ArrayListWritableText message. The error happens between supersteps. If you run the sample code I've included Superstep 1 never gets printed as the job fails after Superstep 0. Is this a bug or am I doing something wrong. In my full code I need to be able to send a list of Text based vertex ids between supersteps. Should I not be using org.apache.hadoop.io.Text and implement my own writable object? Any help is appreciated. Regards, Simon Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: createMessageValue: Failed to instantiate at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232) at java.util.concurrent.FutureTask.get(FutureTask.java:91) at org.apache.giraph.utils.ProgressableUtils$FutureWaitable.waitFor(ProgressableUtils.java:271) at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:143) ... 13 more Caused by: java.lang.IllegalArgumentException: createMessageValue: Failed to instantiate at org.apache.giraph.conf.ImmutableClassesGiraphConfiguration.createMessageValue(ImmutableClassesGiraphConfiguration.java:581) at org.apache.giraph.utils.ByteArrayVertexIdMessages.createData(ByteArrayVertexIdMessages.java:66) at org.apache.giraph.utils.ByteArrayVertexIdMessages.createData(ByteArrayVertexIdMessages.java:34) at org.apache.giraph.utils.ByteArrayVertexIdData$VertexIdDataIterator.next(ByteArrayVertexIdData.java:205) at org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore.addPartitionMessages(ByteArrayMessagesPerVertexStore.java:116) at org.apache.giraph.comm.requests.SendWorkerMessagesRequest.doRequest(SendWorkerMessagesRequest.java:72) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doRequest(NettyWorkerClientRequestProcessor.java:470) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.flush(NettyWorkerClientRequestProcessor.java:419) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:193) at org.apache.giraph.graph.ComputeCallable.call(ComputeCallable.java:70) at org.apache.giraph.utils.LogStacktraceCallable.call(LogStacktraceCallable.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) package com.adaptivemobile.tarantula.batchlayer.giraph.run; import java.io.IOException; import org.apache.giraph.graph.Vertex; import org.apache.giraph.utils.ArrayListWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; public class ArrayListTextBug extends VertexText, NullWritable, NullWritable, ArrayListWritableText{ @Override public void compute(IterableArrayListWritableText messages) throws IOException { if (getSuperstep() == 0) { System.out.println(\nSUPERSTEP 0 - + getId() + \n--); TextArrayListMessage initialMessage = new TextArrayListMessage(); initialMessage.add(getId()); this.sendMessageToAllEdges(initialMessage); System.out.println(Vertex + getId() + sends TextArrayListMessage to + getNumEdges() + edges); } if (getSuperstep() == 1) { System.out.println(\nSUPERSTEP 1 - + getId() + \n--); } }
Re: how to use out of core options
Thanks very much, so are you saying if I use Dgiraph.maxPartitionsInMemory and Dgiraph.maxMessagesInMemory to make them both smaller number, then it might work? Thanks again, Jian On Thu, Oct 17, 2013 at 12:56 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: You need to tune it per your cluster. This is what mentioned in the docs: *It is difficult to decide a general policy to use out-of-core capabilities*, as it depends on the behavior of the algorithm and the input graph. The exact number of partitions and messages to keep in memory depends on the cluster capabilities, the number of messages produced per superstep, and number of active vertices per superstep. Moreover, it depends on the type and size of vertex values and messages. For example, algorithms such as Belief Propagation tend to keep large vertex values, while algorithms such as clique computations tend to send large messages along. Hence, it depends on your algorithm what feature to rely on more. Thanks Sundi On Wed, Oct 16, 2013 at 9:41 PM, Jianqiang Ou oujianqiang...@gmail.comwrote: Hi Sundi, I just tried your method, but somehow the job failed, the attached is the history of the job. and it was good without the outofcore options. Do you have any clue why is that? The command I used to run the program is below: $HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true org.apache.giraph.examples.SimplePageRankComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/andy/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/andy/output/page3 -w 3 -mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute Many thanks, Jianqiang On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou oujianqiang...@gmail.comwrote: got it, thank you very much! On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: Put it as -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true after GiraphRuuner like hadoop jar girap.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true ... On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou oujianqiang...@gmail.com wrote: Hi I have a question about the out of core giraph. It is said that, in order to use disk to store the partions, we need to use giraph.useOutOfCoreGraph=true, but where should I put this statement to? BTW, I am just trying to use the pagerank or shortestpath example to test the out of core performance of my cluster. Thanks very much, Jian -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou oujianqiang...@gmail.comwrote: got it, thank you very much! On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: Put it as -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true after GiraphRuuner like hadoop jar girap.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true ... On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou oujianqiang...@gmail.com wrote: Hi I have a question about the out of core giraph. It is said that, in order to use disk to store the partions, we need to use giraph.useOutOfCoreGraph=true, but where should I put this statement to? BTW, I am just trying to use the pagerank or shortestpath example to test the out of core performance of my cluster. Thanks very much, Jian -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158
Re: How to specify parameters in order to run giraph job in parallel
It actually depends on the setup of your cluster. Ideally, with 15 nodes (tasktrackers) you'd want 1 mapper slot per node (ideally to run giraph), so that you would have 14 workers, one per computing node, plus one for master+zookeeper. Once that is reached, you would have a number of compute threads equals to the number of threads that you can run on each node (24 in your case). Does this make sense to you? On Thu, Oct 17, 2013 at 5:04 PM, Yi Lu luyi0...@gmail.com wrote: Hi, I have a computer cluster consisting of 15 slave machines and 1 master machine. On each slave machine, there are two Xeon E5-2620 CPUs. With the help of HT, there are 24 threads. I am wondering how to specify parameters in order to run giraph job in parallel on my cluster. I am using the following parameters to run a pagerank algorithm. hadoop jar ~/giraph-examples.jar org.apache.giraph.GiraphRunner SimplePageRank -vif PageRankInputFormat -vip /input -vof PageRankOutputFormat -op /pagerank -w 1 -mc SimplePageRank\$SimplePageRankMasterCompute -wc SimplePageRank\$SimplePageRankWorkerContext In particular, 1)I know I can use “-w” to specify the number of workers. In my opinion, the number of workers equals to the number of mappers in hadoop except zookeeper. Therefore, in my case(15 slave machine), which number should be chosen? Is 15 a good choice? Since, I find if I input a large number, e.g. 100, the mappers will hang. 2)I know I can use “-Dgiraph.numComputeThreads=1” to specify vertex computing thread number. However, if I specify it to 10, the total runtime is much longer than default. I think the default is 1, which is found in the source code. I wonder if I want to use this parameter, which number should be chosen. 3)When the giraph job is running, I use “top” command to monitor my cpu usage on slave machines. I find that the java process can use 200%-300% cpu resource. However, if I change the number of vertex computing threads to 10, the java process can use 800% cpu resource. I think it is not a linear relation and I want to know why. Thanks for your help. Best, -Yi -- Claudio Martella claudio.marte...@gmail.com
Re: knowing about the vertex id of the sender of the message.
No, you'll have to add it to the message data. On Thu, Oct 17, 2013 at 6:10 PM, Jyoti Yadav rao.jyoti26ya...@gmail.comwrote: Hi.. In vertex computation code,at the start of the superstep every vertex processes its received messages.. Is there any way for the vertex to know who is the sender of the message it is currenty processing.? Thanks Jyoti -- Claudio Martella claudio.marte...@gmail.com
Re: how to use out of core options
apart from these you might also want to check permissions of the dir path where offloading of vertices and messages happen. Ideally giraph is not meant for out-of-core if you graph is much bigger then the cluster can handle in memory, using giraph defeats the purpose in this case. On Thu, Oct 17, 2013 at 8:13 AM, Jianqiang Ou oujianqiang...@gmail.comwrote: Thanks very much, so are you saying if I use Dgiraph.maxPartitionsInMemory and Dgiraph.maxMessagesInMemory to make them both smaller number, then it might work? Thanks again, Jian On Thu, Oct 17, 2013 at 12:56 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: You need to tune it per your cluster. This is what mentioned in the docs: *It is difficult to decide a general policy to use out-of-core capabilities*, as it depends on the behavior of the algorithm and the input graph. The exact number of partitions and messages to keep in memory depends on the cluster capabilities, the number of messages produced per superstep, and number of active vertices per superstep. Moreover, it depends on the type and size of vertex values and messages. For example, algorithms such as Belief Propagation tend to keep large vertex values, while algorithms such as clique computations tend to send large messages along. Hence, it depends on your algorithm what feature to rely on more. Thanks Sundi On Wed, Oct 16, 2013 at 9:41 PM, Jianqiang Ou oujianqiang...@gmail.comwrote: Hi Sundi, I just tried your method, but somehow the job failed, the attached is the history of the job. and it was good without the outofcore options. Do you have any clue why is that? The command I used to run the program is below: $HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true org.apache.giraph.examples.SimplePageRankComputation -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/andy/input/tiny_graph.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/andy/output/page3 -w 3 -mc org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute Many thanks, Jianqiang On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou oujianqiang...@gmail.com wrote: got it, thank you very much! On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: Put it as -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true after GiraphRuuner like hadoop jar girap.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true ... On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou oujianqiang...@gmail.com wrote: Hi I have a question about the out of core giraph. It is said that, in order to use disk to store the partions, we need to use giraph.useOutOfCoreGraph=true, but where should I put this statement to? BTW, I am just trying to use the pagerank or shortestpath example to test the out of core performance of my cluster. Thanks very much, Jian -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 On Wed, Oct 16, 2013 at 12:11 PM, Jianqiang Ou oujianqiang...@gmail.com wrote: got it, thank you very much! On Wed, Oct 16, 2013 at 10:43 AM, Jyotirmoy Sundi sundi...@gmail.comwrote: Put it as -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true after GiraphRuuner like hadoop jar girap.jar org.apache.giraph.GiraphRunner -Dgiraph.useOutOfCoreMessages=true -Dgiraph.useOutOfCoreGraph=true ... On Wed, Oct 16, 2013 at 7:29 AM, Jianqiang Ou oujianqiang...@gmail.com wrote: Hi I have a question about the out of core giraph. It is said that, in order to use disk to store the partions, we need to use giraph.useOutOfCoreGraph=true, but where should I put this statement to? BTW, I am just trying to use the pagerank or shortestpath example to test the out of core performance of my cluster. Thanks very much, Jian -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158 -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158