Hi Sebastian,
Thanks a lot for the help!
Sorry for the late response!
At this point, I would only
need a random graph that complies with
JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to
measure the pagerank example (of giraph) performance.
I am wondering how to convert
the data from Koblenz to the such a graph ? Is there any
pointer of doing this ? (This is the same kind of question
that I raised to Alok on SNAP)
Thanks!
Wei
P.S.: I forgot to mention in
all my previous emails that I just get started with
distributed graph engine, so please forgive if my questions
are too naive.
Sebastian Schelter ---10/02/2013
12:41:27 PM---Another option is to use the Koblenz network
collection [1], which offers even more (and larger) dat
From: Sebastian Schelter
<s...@apache.org>
To: user@giraph.apache.org,
Date: 10/02/2013 12:41 PM
Subject: Re: workload used to measure Giraph
performance number
Another option is to use the Koblenz network
collection [1], which
offers even more (and larger) datasets than Snap.
Best,
Sebastian
[1] http://konect.uni-koblenz.de/
On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for similar
> benchmarks. It has a good mix of graph types,
sparse/dense, ground truth
> graphs (e.g. social networks that follow power law
distribution etc.). So
> far we have observed that the type of graph has a high
impact on the
> performance of algorithms that Claudio mentioned.
>
>
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella
<claudio.marte...@gmail.com
>> wrote:
>
>> Hi Wei,
>>
>> it depends on what you mean by workload for a batch
processing system. I
>> believe we can split the problem in two: generating a
realistic graph, and
>> using "representative" algorithms.
>>
>> To generate graphs we have two options in giraph:
>>
>> 1) random graph: you specify the number of vertices
and the number of
>> edges for each vertex, and the edges will connect two
random vertices. This
>> creates a graph with (i) low clustering coefficient,
(ii) low average path
>> length, (ii) a uniform degree distribution
>>
>> 2) watts strogatz: you specify the number of
vertices, the number of
>> edges, and a rewire probability beta. giraph will
generate a ring lattice
>> (each vertex is connected to k preceeding vertices
and k following
>> vertices) and rewire some of the edges randomly. This
will create a graph
>> with (i) high clustering coefficient, (ii) low
average path length, (iii)
>> poisson-like degree distribution (depends on beta).
This graph will
>> resemble a small world graph such as a social
network, except for the
>> degree distribution which will not a power law.
>>
>> To use representative algorithms you can choose:
>>
>> 1) PageRank: it's a ranking algorithm where all the
vertices are active
>> and send messages along the edges at each superstep
(hence you'll have O(V)
>> active vertices and O(E) messages)
>>
>> 2) Shortest Paths: starting from a random vertex
you'll visit al the
>> vertices in the graph (some multiple times). This
will have an aggregate
>> O(V) active vertices and O(E) messages, but this is
only a lower bound. In
>> general you'l have different areas of the graph
explored at each superstep,
>> and hence potentially a varying workload across
different supersteps.
>>
>> 3) Connected Components: this will have something
opposite to (2) as it
>> will have many active vertices at the beginning,
where the detection is
>> refined towards the end.
>>
>>
>> Hope this helps,
>> Claudio
>>
>>
>> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang
<w...@us.ibm.com> wrote:
>>
>>> Hi,
>>>
>>> I am interested in measuring some performance
numbers of Giraph on my
>>> machine.
>>>
>>> I am wondering are there some pointers where I
can get some
>>> (configurable) reasonably large workload to work
on ?
>>>
>>> Thanks!
>>>
>>> Wei
>>>
>>
>>
>>
>> --
>> Claudio Martella
>> claudio.marte...@gmail.com
>>
>
|