I don't know, maybe I'm missing something, or there's a bug there as well. I do agree that this is spooky. Armando has tested it also with the WattsStrogatzInputformat, that creates another type of graph. For what I understand, this should not happen due to the topology. I think we should just try to replicate this behavior, hopefully without a very large graph that makes debugging difficult.
On Sat, Feb 15, 2014 at 8:42 PM, Sebastian Schelter <s...@apache.org> wrote: > I copied the caching from o.a.g.io.formats.IntIntNullTextInputFormat and > it worked well during my tests (it did not happen that all vertices had the > same id). > > I'm happy to remove this and rerun the tests. It's strange that > out-of-core works with PageRank on a generated graph, but not with > Hyperball on the twitter graph. The generated graph has a uniform degree > distribution, while the twitter graph's degree distribution is heavily > skewed, can that have an influence on the behavior of ooc? > > Best, > Sebastian > > > > On 02/15/2014 08:32 PM, Claudio Martella wrote: > >> Sebastian, I had a look at your vertexinputformat. I think there might be >> a >> bug. Why are you caching/reusing the id? This way every vertex parsed by >> the vertexreader will share the same ID object, and hence have the same >> ID. >> I think this is broken. you should instantiate a new ID object in the >> preprocessLine. >> Can you try like that? >> >> >> On Thu, Feb 13, 2014 at 9:50 PM, Sebastian Schelter <s...@apache.org> >> wrote: >> >> Hi Armando, >>> >>> I uploaded my test code to github at: >>> >>> https://github.com/sscdotopen/giraph/tree/hyperball64-ooc >>> >>> I'm working on an algorithm to estimate the neighborhood function of the >>> graph (similar to [1]). I'm running this on the transposed adjacency >>> matrix >>> of a snapshot of the twitter follower graph [2]. For this graph >>> out-of-core >>> is not necessary, but I would like to run my algorithm on another larger >>> graph that doesn't fit into the aggregated main memory of the cluster >>> anymore. >>> >>> I think for testing purposes, you can run it on any large graph in >>> adjacency form. >>> >>> Our cluster consists of 25 machines with 32GB ram, 8 cores and 4 disks >>> per >>> machine. I use the following options to run the algorithm: >>> >>> hadoop jar giraph-examples-1.1.0-SNAPSHOT-for-hadoop-1.2.1-jar- >>> with-dependencies.jar >>> org.apache.giraph.GiraphRunner >>> >>> org.apache.giraph.examples.hyperball.HyperBall >>> >>> --vertexInputFormat org.apache.giraph.examples.hyperball. >>> HyperBallTextInputFormat >>> >>> --vertexInputPath hdfs:///ssc/twitter-negative/ >>> >>> --vertexOutputFormat org.apache.giraph.io.formats. >>> IdWithValueTextOutputFormat >>> >>> --outputPath hdfs:///ssc/tmp-123/ >>> >>> --combiner org.apache.giraph.comm.messages.HyperLogLogCombiner >>> >>> --outEdges org.apache.giraph.edge.LongNullArrayEdges >>> >>> --workers 24 >>> >>> --customArguments >>> >>> giraph.oneToAllMsgSending=true, >>> giraph.isStaticGraph=true, >>> giraph.numComputeThreads=15, >>> giraph.numInputThreads=15, >>> giraph.numOutputThreads=15, >>> giraph.maxNumberOfSupersteps=30, >>> giraph.useOutOfCoreGraph=true, >>> giraph.maxPartitionsInMemory=20 >>> >>> Best, >>> Sebastian >>> >>> [1] http://arxiv.org/abs/1308.2144 >>> [2] http://konect.uni-koblenz.de/networks/twitter_mpi >>> >>> >>> On 02/12/2014 04:21 PM, Armando Miraglia wrote: >>> >>> >>>> Hi Sebastian, >>>> >>>> On Wed, Feb 12, 2014 at 02:59:20PM +0100, Sebastian Schelter wrote: >>>> >>>> No. Should I have done that? >>>>> >>>>> >>>> could you please provide me with the test you have done together with >>>> the variables that you have set during for the computation? This would >>>> help me a lot. >>>> >>>> Cheers, >>>> Armando >>>> >>>> >>>> >>> >> >> > -- Claudio Martella