Re: workload used to measure Giraph performance number

Wei Zhang Tue, 08 Oct 2013 19:41:08 -0700

Hi Alok,

Thanks a lot for the help! Sorry for the late response!


At this point, I would only need a random graph that complies with
JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the
pagerank example (of giraph) performance.
I am wondering how to convert the data from SNAP to the  such a graph ? Is
there any pointer of doing this ?

Thanks!

Wei



From:   Alok Kumbhare <kumbh...@usc.edu>
To:     user@giraph.apache.org,
Date:   10/02/2013 11:41 AM
Subject:        Re: workload used to measure Giraph performance number



There are a number real (medium sized) graphs at
http://snap.stanford.edu/data/index.html which we use for similar
benchmarks. It has a good mix of graph types, sparse/dense, ground truth
graphs (e.g. social networks that follow power law distribution etc.). So
far we have observed that the type of graph has a high impact on the
performance of algorithms that Claudio mentioned.


On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <
claudio.marte...@gmail.com> wrote:
  Hi Wei,

  it depends on what you mean by workload for a batch processing system. I
  believe we can split the problem in two: generating a realistic graph,
  and using "representative" algorithms.

  To generate graphs we have two options in giraph:

  1) random graph: you specify the number of vertices and the number of
  edges for each vertex, and the edges will connect two random vertices.
  This creates a graph with (i) low clustering coefficient, (ii) low
  average path length, (ii) a uniform degree distribution

  2) watts strogatz: you specify the number of vertices, the number of
  edges, and a rewire probability beta. giraph will generate a ring lattice
  (each vertex is connected to k preceeding vertices and k following
  vertices) and rewire some of the edges randomly. This will create a graph
  with (i) high clustering coefficient, (ii) low average path length, (iii)
  poisson-like degree distribution (depends on beta). This graph will
  resemble a small world graph such as a social network, except for the
  degree distribution which will not a power law.

  To use representative algorithms you can choose:

  1) PageRank: it's a ranking algorithm where all the vertices are active
  and send messages along the edges at each superstep (hence you'll have O
  (V) active vertices and O(E) messages)

  2) Shortest Paths: starting from a random vertex you'll visit al the
  vertices in the graph (some multiple times). This will have an aggregate
  O(V) active vertices and O(E) messages, but this is only a lower bound.
  In general you'l have different areas of the graph explored at each
  superstep, and hence potentially a varying workload across different
  supersteps.

  3) Connected Components: this will have something opposite to (2) as it
  will have many active vertices at the beginning, where the detection is
  refined towards the end.


  Hope this helps,
  Claudio


  On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang <w...@us.ibm.com> wrote:
   Hi,

   I am interested in measuring some performance numbers of Giraph on my
   machine.

   I am wondering are there some pointers where I can get some
   (configurable) reasonably large workload to work on ?

   Thanks!

   Wei





  --
     Claudio Martella
     claudio.marte...@gmail.com

<<inline: graycol.gif>>

Re: workload used to measure Giraph performance number

Reply via email to