We don't have a data generator that
produces HDFS files, but PageRankBenchmark does have a data
generator (see PseudoRandomIntNullVertexInputFormat)
Hope that helps,
Avery
On 10/9/13 2:49 PM, Wei Zhang wrote:
Hi Avery,
Thanks a lot for the pointer! I
haven't quite got to the point where I can tune the Giraph
performance yet (hopefully I can get there soon)
I am still at the early stage
of finding/generating the right workload to measure the
performance. I am wondering is there some pointer (in Giraph
or elsewhere) that I can use to generate a random or
representative workload (for pagerank, for now), which
complies with JsonLongDoubleFloatDoubleVertexInputFormat t?
Any pointer is greatly appreciated. -- Sorry for being a slow
starter.
I have written a small program
myself to generate a random graph that has an average
out-degree of 6 connected graph that complies with the JSON
format. For a 2000-vertice graph, it takes about 12 seconds to
run the pagerank example provided by Giraph (30 supersteps)
I am only using 1 core of my
machine (as I run in the Hadoop's "local" mode so that I could
jdb into the Giraph job to observe how code runs if needed).
Each core (out of 8 in total) of my machine is Intel(R)
Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in
total. Assuming the linear speedup when turning on all the 8
cores, it is about 1.5 seconds to run the work -- not sure if
it is a reasonable number or not .
Thanks!
Wei
Avery Ching ---10/09/2013 05:11:39
PM---Hi Wei, For best performance, please be sure to tune the
GC settings, use Java
From: Avery Ching
<ach...@apache.org>
To: user@giraph.apache.org,
Date: 10/09/2013 05:11 PM
Subject: Re: workload used to measure Giraph
performance number
Hi Wei,
For best performance, please be sure to tune the GC settings,
use Java 7, tune the number of cores used for computation,
communication, etc. and the combiner.
We also have some numbers on our recent Facebook blog post.
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
Avery
On 10/8/13 7:43 PM, Wei Zhang wrote:
Hi Sebastian,
Thanks a lot for the help! Sorry for the late response!
At this point, I would only need a random graph that complies
with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to
measure the pagerank example (of giraph) performance.
I am wondering how to convert the data from Koblenz to the
such a graph ? Is there any pointer of doing this ? (This is
the same kind of question that I raised to Alok on SNAP)
Thanks!
Wei
P.S.: I forgot to mention in all my previous emails that I
just get started with distributed graph engine, so please
forgive if my questions are too naive.
Sebastian Schelter
---10/02/2013 12:41:27 PM---Another option is to use the
Koblenz network collection [1], which offers even more (and
larger) dat
From: Sebastian
Schelter <s...@apache.org>
To: user@giraph.apache.org,
Date: 10/02/2013
12:41 PM
Subject: Re: workload
used to measure Giraph performance number
Another option is to use the Koblenz network collection [1],
which
offers even more (and larger) datasets than Snap.
Best,
Sebastian
[1] http://konect.uni-koblenz.de/
On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for similar
> benchmarks. It has a good mix of graph types,
sparse/dense, ground truth
> graphs (e.g. social networks that follow power law
distribution etc.). So
> far we have observed that the type of graph has a high
impact on the
> performance of algorithms that Claudio mentioned.
>
>
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <claudio.marte...@gmail.com
>> wrote:
>
>> Hi Wei,
>>
>> it depends on what you mean by workload for a batch
processing system. I
>> believe we can split the problem in two: generating
a realistic graph, and
>> using "representative" algorithms.
>>
>> To generate graphs we have two options in giraph:
>>
>> 1) random graph: you specify the number of vertices
and the number of
>> edges for each vertex, and the edges will connect
two random vertices. This
>> creates a graph with (i) low clustering
coefficient, (ii) low average path
>> length, (ii) a uniform degree distribution
>>
>> 2) watts strogatz: you specify the number of
vertices, the number of
>> edges, and a rewire probability beta. giraph will
generate a ring lattice
>> (each vertex is connected to k preceeding vertices
and k following
>> vertices) and rewire some of the edges randomly.
This will create a graph
>> with (i) high clustering coefficient, (ii) low
average path length, (iii)
>> poisson-like degree distribution (depends on beta).
This graph will
>> resemble a small world graph such as a social
network, except for the
>> degree distribution which will not a power law.
>>
>> To use representative algorithms you can choose:
>>
>> 1) PageRank: it's a ranking algorithm where all the
vertices are active
>> and send messages along the edges at each superstep
(hence you'll have O(V)
>> active vertices and O(E) messages)
>>
>> 2) Shortest Paths: starting from a random vertex
you'll visit al the
>> vertices in the graph (some multiple times). This
will have an aggregate
>> O(V) active vertices and O(E) messages, but this is
only a lower bound. In
>> general you'l have different areas of the graph
explored at each superstep,
>> and hence potentially a varying workload across
different supersteps.
>>
>> 3) Connected Components: this will have something
opposite to (2) as it
>> will have many active vertices at the beginning,
where the detection is
>> refined towards the end.
>>
>>
>> Hope this helps,
>> Claudio
>>
>>
>> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <w...@us.ibm.com> wrote:
>>
>>> Hi,
>>>
>>> I am interested in measuring some performance
numbers of Giraph on my
>>> machine.
>>>
>>> I am wondering are there some pointers where I
can get some
>>> (configurable) reasonably large workload to
work on ?
>>>
>>> Thanks!
>>>
>>> Wei
>>>
>>
>>
>>
>> --
>> Claudio Martella
>> claudio.marte...@gmail.com
>>
>
|