Re: workload used to measure Giraph performance number

Sebastian Schelter Wed, 02 Oct 2013 09:41:57 -0700

Another option is to use the Koblenz network collection [1], which
offers even more (and larger) datasets than Snap.


Best,
Sebastian

[1] http://konect.uni-koblenz.de/


On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for similar
> benchmarks. It has a good mix of graph types, sparse/dense, ground truth
> graphs (e.g. social networks that follow power law distribution etc.). So
> far we have observed that the type of graph has a high impact on the
> performance of algorithms that Claudio mentioned.
> 
> 
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <claudio.marte...@gmail.com
>> wrote:
> 
>> Hi Wei,
>>
>> it depends on what you mean by workload for a batch processing system. I
>> believe we can split the problem in two: generating a realistic graph, and
>> using "representative" algorithms.
>>
>> To generate graphs we have two options in giraph:
>>
>> 1) random graph: you specify the number of vertices and the number of
>> edges for each vertex, and the edges will connect two random vertices. This
>> creates a graph with (i) low clustering coefficient, (ii) low average path
>> length, (ii) a uniform degree distribution
>>
>> 2) watts strogatz: you specify the number of vertices, the number of
>> edges, and a rewire probability beta. giraph will generate a ring lattice
>> (each vertex is connected to k preceeding vertices and k following
>> vertices) and rewire some of the edges randomly. This will create a graph
>> with (i) high clustering coefficient, (ii) low average path length, (iii)
>> poisson-like degree distribution (depends on beta). This graph will
>> resemble a small world graph such as a social network, except for the
>> degree distribution which will not a power law.
>>
>> To use representative algorithms you can choose:
>>
>> 1) PageRank: it's a ranking algorithm where all the vertices are active
>> and send messages along the edges at each superstep (hence you'll have O(V)
>> active vertices and O(E) messages)
>>
>> 2) Shortest Paths: starting from a random vertex you'll visit al the
>> vertices in the graph (some multiple times). This will have an aggregate
>> O(V) active vertices and O(E) messages, but this is only a lower bound. In
>> general you'l have different areas of the graph explored at each superstep,
>> and hence potentially a varying workload across different supersteps.
>>
>> 3) Connected Components: this will have something opposite to (2) as it
>> will have many active vertices at the beginning, where the detection is
>> refined towards the end.
>>
>>
>> Hope this helps,
>> Claudio
>>
>>
>> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <w...@us.ibm.com> wrote:
>>
>>> Hi,
>>>
>>> I am interested in measuring some performance numbers of Giraph on my
>>> machine.
>>>
>>> I am wondering are there some pointers where I can get some
>>> (configurable) reasonably large workload to work on ?
>>>
>>> Thanks!
>>>
>>> Wei
>>>
>>
>>
>>
>> --
>>    Claudio Martella
>>    claudio.marte...@gmail.com
>>
>

Re: workload used to measure Giraph performance number

Reply via email to