Hi Sebastian, Thanks a lot for the help! Sorry for the late response!
At this point, I would only need a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the pagerank example (of giraph) performance. I am wondering how to convert the data from Koblenz to the such a graph ? Is there any pointer of doing this ? (This is the same kind of question that I raised to Alok on SNAP) Thanks! Wei P.S.: I forgot to mention in all my previous emails that I just get started with distributed graph engine, so please forgive if my questions are too naive. From: Sebastian Schelter <s...@apache.org> To: user@giraph.apache.org, Date: 10/02/2013 12:41 PM Subject: Re: workload used to measure Giraph performance number Another option is to use the Koblenz network collection [1], which offers even more (and larger) datasets than Snap. Best, Sebastian [1] http://konect.uni-koblenz.de/ On 02.10.2013 17:41, Alok Kumbhare wrote: > There are a number real (medium sized) graphs at > http://snap.stanford.edu/data/index.html which we use for similar > benchmarks. It has a good mix of graph types, sparse/dense, ground truth > graphs (e.g. social networks that follow power law distribution etc.). So > far we have observed that the type of graph has a high impact on the > performance of algorithms that Claudio mentioned. > > > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <claudio.marte...@gmail.com >> wrote: > >> Hi Wei, >> >> it depends on what you mean by workload for a batch processing system. I >> believe we can split the problem in two: generating a realistic graph, and >> using "representative" algorithms. >> >> To generate graphs we have two options in giraph: >> >> 1) random graph: you specify the number of vertices and the number of >> edges for each vertex, and the edges will connect two random vertices. This >> creates a graph with (i) low clustering coefficient, (ii) low average path >> length, (ii) a uniform degree distribution >> >> 2) watts strogatz: you specify the number of vertices, the number of >> edges, and a rewire probability beta. giraph will generate a ring lattice >> (each vertex is connected to k preceeding vertices and k following >> vertices) and rewire some of the edges randomly. This will create a graph >> with (i) high clustering coefficient, (ii) low average path length, (iii) >> poisson-like degree distribution (depends on beta). This graph will >> resemble a small world graph such as a social network, except for the >> degree distribution which will not a power law. >> >> To use representative algorithms you can choose: >> >> 1) PageRank: it's a ranking algorithm where all the vertices are active >> and send messages along the edges at each superstep (hence you'll have O (V) >> active vertices and O(E) messages) >> >> 2) Shortest Paths: starting from a random vertex you'll visit al the >> vertices in the graph (some multiple times). This will have an aggregate >> O(V) active vertices and O(E) messages, but this is only a lower bound. In >> general you'l have different areas of the graph explored at each superstep, >> and hence potentially a varying workload across different supersteps. >> >> 3) Connected Components: this will have something opposite to (2) as it >> will have many active vertices at the beginning, where the detection is >> refined towards the end. >> >> >> Hope this helps, >> Claudio >> >> >> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <w...@us.ibm.com> wrote: >> >>> Hi, >>> >>> I am interested in measuring some performance numbers of Giraph on my >>> machine. >>> >>> I am wondering are there some pointers where I can get some >>> (configurable) reasonably large workload to work on ? >>> >>> Thanks! >>> >>> Wei >>> >> >> >> >> -- >> Claudio Martella >> claudio.marte...@gmail.com >> >
<<inline: graycol.gif>>