Re: workload used to measure Giraph performance number

2013-10-10 Thread Wei Zhang

Thanks a lot Avery, I will take look into PageRankBenchmark  and
PseudoRandomIntNullVertexInputFormat class.

Wei





From:   Avery Ching 
To: user@giraph.apache.org,
Date:   10/10/2013 01:59 PM
Subject:Re: workload used to measure Giraph performance number



We don't have a data generator that produces HDFS files, but
PageRankBenchmark does have a data generator (see
PseudoRandomIntNullVertexInputFormat)

Hope that helps,

Avery

On 10/9/13 2:49 PM, Wei Zhang wrote:


  Hi Avery,

  Thanks a lot for the pointer! I haven't quite got to the point where
  I can tune the Giraph performance yet (hopefully I can get there
  soon)

  I am still at the early stage of finding/generating the right
  workload to measure the performance. I am wondering is there some
  pointer (in Giraph or elsewhere) that I can use to generate a random
  or representative workload (for pagerank, for now), which complies
  with JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is
  greatly appreciated. -- Sorry for being a slow starter.

  I have written a small program myself to generate a random graph that
  has an average out-degree of 6 connected graph that complies with the
  JSON format. For a 2000-vertice graph, it takes about 12 seconds to
  run the pagerank example provided by Giraph (30 supersteps)
  I am only using 1 core of my machine (as I run in the Hadoop's
  "local" mode so that I could jdb into the Giraph job to observe how
  code runs if needed). Each core (out of 8 in total) of my machine is
  Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in
  total. Assuming the linear speedup when turning on all the 8 cores,
  it is about 1.5 seconds to run the work -- not sure if it is a
  reasonable number or not .

  Thanks!

  Wei
  Inactive
  hide details for Avery Ching ---10/09/2013 05:11:39
  PM---Hi
  Wei, For best performance, please be sure to tune
  the GC sAvery Ching ---10/09/2013 05:11:39 PM---Hi Wei, For best
  performance, please be sure to tune the GC settings, use Java

  From: Avery Ching 
  To: user@giraph.apache.org,
  Date: 10/09/2013 05:11 PM
      Subject: Re: workload used to measure Giraph performance number





  Hi Wei,

  For best performance, please be sure to tune the GC settings, use
  Java 7, tune the number of cores used for computation, communication,
  etc. and the combiner.

  We also have some numbers on our recent Facebook blog post.

  
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920


  Avery

  On 10/8/13 7:43 PM, Wei Zhang wrote:

Hi Sebastian,

Thanks a lot for the help! Sorry for the late response!

At this point, I would only need a random graph that complies
with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to
measure the pagerank example (of giraph) performance.
I am wondering how to convert the data from Koblenz to the
such a graph ? Is there any pointer of doing this ? (This is
the same kind of question that I raised to Alok on SNAP)

Thanks!

Wei

P.S.: I forgot to mention in all my previous emails that I just
get started with distributed graph engine, so please forgive if
my questions are too naive.

Inactive hide details for Sebastian Schelter
---10/02/2013 12:41:27 PM---Another option is to use the
Koblenz network collectioSebastian Schelter ---10/02/2013
12:41:27 PM---Another option is to use the Koblenz network
collection [1], which offers even more (and larger) dat

From: Sebastian Schelter 
To: user@giraph.apache.org,
Date: 10/02/2013 12:41 PM
    Subject: Re: workload used to measure Giraph performance number



Another option is to use the Koblenz network collection [1],
which
offers even more (and larger) datasets than Snap.

Best,
Sebastian

[1] http://konect.uni-koblenz.de/


On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for
similar
> benchmarks. It has a good mix of graph types, sparse/dense,
ground truth
> graphs (e.g. social networks that follow power law
distribution etc.). So
> far we have observed that the type of graph has a high impact
on the
> performance of algorithms that Claudio mentioned.
>
>
> On Wed, Oct 2, 2013 at 8:22 AM, Claud

Re: workload used to measure Giraph performance number

2013-10-10 Thread Avery Ching

  
  
We don't have a data generator that
  produces HDFS files, but PageRankBenchmark does have a data
  generator (see PseudoRandomIntNullVertexInputFormat)
  
  Hope that helps,
  
  Avery
  
  On 10/9/13 2:49 PM, Wei Zhang wrote:


  Hi Avery,

Thanks a lot for the pointer! I
  haven't quite got to the point where I can tune the Giraph
  performance yet (hopefully I can get there soon)

I am still at the early stage
  of finding/generating the right workload to measure the
  performance. I am wondering is there some pointer (in Giraph
  or elsewhere) that I can use to generate a random or
  representative workload (for pagerank, for now), which
  complies with JsonLongDoubleFloatDoubleVertexInputFormat t?
  Any pointer is greatly appreciated. -- Sorry for being a slow
  starter.

I have written a small program
  myself to generate a random graph that has an average
  out-degree of 6 connected graph that complies with the JSON
  format. For a 2000-vertice graph, it takes about 12 seconds to
  run the pagerank example provided by Giraph (30 supersteps)
I am only using 1 core of my
  machine (as I run in the Hadoop's "local" mode so that I could
  jdb into the Giraph job to observe how code runs if needed).
  Each core (out of 8 in total) of my machine is Intel(R)
  Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in
  total. Assuming the linear speedup when turning on all the 8
  cores, it is about 1.5 seconds to run the work -- not sure if
  it is a reasonable number or not .

Thanks!

Wei
Avery Ching ---10/09/2013 05:11:39
  PM---Hi Wei, For best performance, please be sure to tune the
  GC settings, use Java

From: Avery Ching
  
To: user@giraph.apache.org, 
Date: 10/09/2013 05:11 PM
    Subject: Re: workload used to measure Giraph
      performance number
  
  
  
  
  Hi Wei,

For best performance, please be sure to tune the GC settings,
use Java 7, tune the number of cores used for computation,
communication, etc. and the combiner.

We also have some numbers on our recent Facebook blog post.  
  
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920

Avery

On 10/8/13 7:43 PM, Wei Zhang wrote:
  

Hi Sebastian,

  Thanks a lot for the help! Sorry for the late response!

  At this point, I would only need a random graph that complies
  with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to
  measure the pagerank example (of giraph) performance.
  I am wondering how to convert the data from Koblenz to the
   such a graph ? Is there any pointer of doing this ? (This is
  the same kind of question that I raised to Alok on SNAP)

  Thanks!

  Wei

  P.S.: I forgot to mention in all my previous emails that I
  just get started with distributed graph engine, so please
  forgive if my questions are too naive.
  
Sebastian Schelter
  ---10/02/2013 12:41:27 PM---Another option is to use the
  Koblenz network collection [1], which offers even more (and
  larger) dat

  From: Sebastian
  Schelter 
  To: user@giraph.apache.org, 
  Date: 10/02/2013
  12:41 PM
  Subject: Re: workload
      used to measure Giraph performance number

  

Another option is to use the Koblenz network collection [1],
which
offers even more (and larger) datasets than Snap.

Best,
Sebastian

[1] http://konect.uni-koblenz.de/


On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for similar
> benchmarks. It has a good mix of graph types,
sparse/dense, ground truth
> graphs (e.g. social networks that follow power law
distribution etc.). So
> far we have observed that the type of graph has a high
impact on the
> performance of algorithms that Claudio mentioned.
> 
> 
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <claudio.marte...@gmail.com
>> wrote:
> 
   

Re: workload used to measure Giraph performance number

2013-10-09 Thread Wei Zhang

Hi Avery,

Thanks a lot for the pointer! I haven't quite got to the point where I can
tune the Giraph performance yet (hopefully I can get there soon)

I am still at the early stage of finding/generating the right workload to
measure the performance. I am wondering is there some pointer (in Giraph or
elsewhere) that I can use to generate a random or representative workload
(for pagerank, for now), which complies with
JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is greatly
appreciated. -- Sorry for being a slow starter.

I have written a small program myself to generate a random graph that has
an average out-degree of 6 connected graph that complies with the JSON
format. For a 2000-vertice graph, it takes about 12 seconds to run the
pagerank example provided by Giraph (30 supersteps)
I am only using 1 core of my machine (as I run in the Hadoop's "local" mode
so that I could jdb into the Giraph job to observe how code runs if
needed). Each core (out of 8 in total) of my machine is Intel(R) Core(TM)
i7-3740QM CPU @ 2.70GHz, there is 16GB memory in total. Assuming the linear
speedup when turning on all the 8 cores, it is about 1.5 seconds to run the
work -- not sure if it is a reasonable number or not .

Thanks!

Wei


From:   Avery Ching 
To: user@giraph.apache.org,
Date:   10/09/2013 05:11 PM
Subject:    Re: workload used to measure Giraph performance number



Hi Wei,

For best performance, please be sure to tune the GC settings, use Java 7,
tune the number of cores used for computation, communication, etc. and the
combiner.

We also have some numbers on our recent Facebook blog post.

https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920


Avery

On 10/8/13 7:43 PM, Wei Zhang wrote:


  Hi Sebastian,

  Thanks a lot for the help! Sorry for the late response!

  At this point, I would only need a random graph that complies with
  JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the
  pagerank example (of giraph) performance.
  I am wondering how to convert the data from Koblenz to the  such a
  graph ? Is there any pointer of doing this ? (This is the same kind
  of question that I raised to Alok on SNAP)

  Thanks!

  Wei

  P.S.: I forgot to mention in all my previous emails that I just get
  started with distributed graph engine, so please forgive if my
  questions are too naive.

  Inactive
  hide details for Sebastian Schelter ---10/02/2013
  12:41:27
  PM---Another option is to use the Koblenz network
  collectioSebastian Schelter ---10/02/2013 12:41:27 PM---Another
  option is to use the Koblenz network collection [1], which offers
  even more (and larger) dat

  From: Sebastian Schelter 
  To: user@giraph.apache.org,
  Date: 10/02/2013 12:41 PM
  Subject: Re: workload used to measure Giraph performance number





  Another option is to use the Koblenz network collection [1], which
  offers even more (and larger) datasets than Snap.

  Best,
  Sebastian

  [1] http://konect.uni-koblenz.de/


  On 02.10.2013 17:41, Alok Kumbhare wrote:
  > There are a number real (medium sized) graphs at
  > http://snap.stanford.edu/data/index.html which we use for similar
  > benchmarks. It has a good mix of graph types, sparse/dense, ground
  truth
  > graphs (e.g. social networks that follow power law distribution
  etc.). So
  > far we have observed that the type of graph has a high impact on
  the
  > performance of algorithms that Claudio mentioned.
  >
  >
  > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <
  claudio.marte...@gmail.com
  >> wrote:
  >
  >> Hi Wei,
  >>
  >> it depends on what you mean by workload for a batch processing
  system. I
  >> believe we can split the problem in two: generating a realistic
  graph, and
  >> using "representative" algorithms.
  >>
  >> To generate graphs we have two options in giraph:
  >>
  >> 1) random graph: you specify the number of vertices and the number
  of
  >> edges for each vertex, and the edges will connect two random
  vertices. This
  >> creates a graph with (i) low clustering coefficient, (ii) low
  average path
  >> length, (ii) a uniform degree distribution
  >>
  >> 2) watts strogatz: you specify the number of vertices, the number
  of
  >> edges, and a rewire probability beta. giraph will generate a ring
  lattice
  >> (each vertex is connected to k preceeding vertices and k following
  >> vertices) and rewire some of the edges randomly. This will create
  a graph
  >> with (i) high clusterin

Re: workload used to measure Giraph performance number

2013-10-09 Thread Avery Ching

  
  
Hi Wei,
  
  For best performance, please be sure to tune the GC settings, use
  Java 7, tune the number of cores used for computation,
  communication, etc. and the combiner.
  
  We also have some numbers on our recent Facebook blog post.  
  
https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
  
  Avery
  
  On 10/8/13 7:43 PM, Wei Zhang wrote:


  Hi Sebastian,

Thanks a lot for the help!
  Sorry for the late response!

At this point, I would only
  need a random graph that complies with
  JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to
  measure the pagerank example (of giraph) performance.
I am wondering how to convert
  the data from Koblenz to the  such a graph ? Is there any
  pointer of doing this ? (This is the same kind of question
  that I raised to Alok on SNAP)

Thanks!

Wei

P.S.: I forgot to mention in
  all my previous emails that I just get started with
  distributed graph engine, so please forgive if my questions
  are too naive.

Sebastian Schelter ---10/02/2013
  12:41:27 PM---Another option is to use the Koblenz network
  collection [1], which offers even more (and larger) dat

From: Sebastian Schelter
  
To: user@giraph.apache.org, 
Date: 10/02/2013 12:41 PM
Subject: Re: workload used to measure Giraph
  performance number
  
  
  
  
  Another option is to use the Koblenz network
  collection [1], which
  offers even more (and larger) datasets than Snap.
  
  Best,
  Sebastian
  
  [1] http://konect.uni-koblenz.de/
  
  
  On 02.10.2013 17:41, Alok Kumbhare wrote:
  > There are a number real (medium sized) graphs at
  > http://snap.stanford.edu/data/index.html which we use for similar
  > benchmarks. It has a good mix of graph types,
  sparse/dense, ground truth
  > graphs (e.g. social networks that follow power law
  distribution etc.). So
  > far we have observed that the type of graph has a high
  impact on the
  > performance of algorithms that Claudio mentioned.
  > 
  > 
  > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella
  <claudio.marte...@gmail.com
  >> wrote:
  > 
  >> Hi Wei,
  >>
  >> it depends on what you mean by workload for a batch
  processing system. I
  >> believe we can split the problem in two: generating a
  realistic graph, and
  >> using "representative" algorithms.
  >>
  >> To generate graphs we have two options in giraph:
  >>
  >> 1) random graph: you specify the number of vertices
  and the number of
  >> edges for each vertex, and the edges will connect two
  random vertices. This
  >> creates a graph with (i) low clustering coefficient,
  (ii) low average path
  >> length, (ii) a uniform degree distribution
  >>
  >> 2) watts strogatz: you specify the number of
  vertices, the number of
  >> edges, and a rewire probability beta. giraph will
  generate a ring lattice
  >> (each vertex is connected to k preceeding vertices
  and k following
  >> vertices) and rewire some of the edges randomly. This
  will create a graph
  >> with (i) high clustering coefficient, (ii) low
  average path length, (iii)
  >> poisson-like degree distribution (depends on beta).
  This graph will
  >> resemble a small world graph such as a social
  network, except for the
  >> degree distribution which will not a power law.
  >>
  >> To use representative algorithms you can choose:
  >>
  >> 1) PageRank: it's a ranking algorithm where all the
  vertices are active
  >> and send messages along the edges at each superstep
  (hence you'll have O(V)
  >> active vertices and O(E) messages)
  >>
  >> 2) Shortest Paths: starting from a random vertex
  you'll visit al the
  >> vertices in the graph (some multiple times). This
  will have an aggregate
  >> O(V) active vertices and O(E) messages, but this is
  only a lower bound. In
  >>

Re: workload used to measure Giraph performance number

2013-10-08 Thread Wei Zhang

Hi Sebastian,

Thanks a lot for the help! Sorry for the late response!

At this point, I would only need a random graph that complies with
JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the
pagerank example (of giraph) performance.
I am wondering how to convert the data from Koblenz to the  such a graph ?
Is there any pointer of doing this ? (This is the same kind of question
that I raised to Alok on SNAP)

Thanks!

Wei

P.S.: I forgot to mention in all my previous emails that I just get started
with distributed graph engine, so please forgive if my questions are too
naive.



From:   Sebastian Schelter 
To: user@giraph.apache.org,
Date:   10/02/2013 12:41 PM
Subject:Re: workload used to measure Giraph performance number



Another option is to use the Koblenz network collection [1], which
offers even more (and larger) datasets than Snap.

Best,
Sebastian

[1] http://konect.uni-koblenz.de/


On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for similar
> benchmarks. It has a good mix of graph types, sparse/dense, ground truth
> graphs (e.g. social networks that follow power law distribution etc.). So
> far we have observed that the type of graph has a high impact on the
> performance of algorithms that Claudio mentioned.
>
>
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella
> wrote:
>
>> Hi Wei,
>>
>> it depends on what you mean by workload for a batch processing system. I
>> believe we can split the problem in two: generating a realistic graph,
and
>> using "representative" algorithms.
>>
>> To generate graphs we have two options in giraph:
>>
>> 1) random graph: you specify the number of vertices and the number of
>> edges for each vertex, and the edges will connect two random vertices.
This
>> creates a graph with (i) low clustering coefficient, (ii) low average
path
>> length, (ii) a uniform degree distribution
>>
>> 2) watts strogatz: you specify the number of vertices, the number of
>> edges, and a rewire probability beta. giraph will generate a ring
lattice
>> (each vertex is connected to k preceeding vertices and k following
>> vertices) and rewire some of the edges randomly. This will create a
graph
>> with (i) high clustering coefficient, (ii) low average path length,
(iii)
>> poisson-like degree distribution (depends on beta). This graph will
>> resemble a small world graph such as a social network, except for the
>> degree distribution which will not a power law.
>>
>> To use representative algorithms you can choose:
>>
>> 1) PageRank: it's a ranking algorithm where all the vertices are active
>> and send messages along the edges at each superstep (hence you'll have O
(V)
>> active vertices and O(E) messages)
>>
>> 2) Shortest Paths: starting from a random vertex you'll visit al the
>> vertices in the graph (some multiple times). This will have an aggregate
>> O(V) active vertices and O(E) messages, but this is only a lower bound.
In
>> general you'l have different areas of the graph explored at each
superstep,
>> and hence potentially a varying workload across different supersteps.
>>
>> 3) Connected Components: this will have something opposite to (2) as it
>> will have many active vertices at the beginning, where the detection is
>> refined towards the end.
>>
>>
>> Hope this helps,
>> Claudio
>>
>>
>> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang  wrote:
>>
>>> Hi,
>>>
>>> I am interested in measuring some performance numbers of Giraph on my
>>> machine.
>>>
>>> I am wondering are there some pointers where I can get some
>>> (configurable) reasonably large workload to work on ?
>>>
>>> Thanks!
>>>
>>> Wei
>>>
>>
>>
>>
>> --
>>Claudio Martella
>>claudio.marte...@gmail.com
>>
>

<>

Re: workload used to measure Giraph performance number

2013-10-08 Thread Wei Zhang

Hi Alok,

Thanks a lot for the help! Sorry for the late response!

At this point, I would only need a random graph that complies with
JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the
pagerank example (of giraph) performance.
I am wondering how to convert the data from SNAP to the  such a graph ? Is
there any pointer of doing this ?

Thanks!

Wei



From:   Alok Kumbhare 
To: user@giraph.apache.org,
Date:   10/02/2013 11:41 AM
Subject:Re: workload used to measure Giraph performance number



There are a number real (medium sized) graphs at
http://snap.stanford.edu/data/index.html which we use for similar
benchmarks. It has a good mix of graph types, sparse/dense, ground truth
graphs (e.g. social networks that follow power law distribution etc.). So
far we have observed that the type of graph has a high impact on the
performance of algorithms that Claudio mentioned.


On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <
claudio.marte...@gmail.com> wrote:
  Hi Wei,

  it depends on what you mean by workload for a batch processing system. I
  believe we can split the problem in two: generating a realistic graph,
  and using "representative" algorithms.

  To generate graphs we have two options in giraph:

  1) random graph: you specify the number of vertices and the number of
  edges for each vertex, and the edges will connect two random vertices.
  This creates a graph with (i) low clustering coefficient, (ii) low
  average path length, (ii) a uniform degree distribution

  2) watts strogatz: you specify the number of vertices, the number of
  edges, and a rewire probability beta. giraph will generate a ring lattice
  (each vertex is connected to k preceeding vertices and k following
  vertices) and rewire some of the edges randomly. This will create a graph
  with (i) high clustering coefficient, (ii) low average path length, (iii)
  poisson-like degree distribution (depends on beta). This graph will
  resemble a small world graph such as a social network, except for the
  degree distribution which will not a power law.

  To use representative algorithms you can choose:

  1) PageRank: it's a ranking algorithm where all the vertices are active
  and send messages along the edges at each superstep (hence you'll have O
  (V) active vertices and O(E) messages)

  2) Shortest Paths: starting from a random vertex you'll visit al the
  vertices in the graph (some multiple times). This will have an aggregate
  O(V) active vertices and O(E) messages, but this is only a lower bound.
  In general you'l have different areas of the graph explored at each
  superstep, and hence potentially a varying workload across different
  supersteps.

  3) Connected Components: this will have something opposite to (2) as it
  will have many active vertices at the beginning, where the detection is
  refined towards the end.


  Hope this helps,
  Claudio


  On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang  wrote:
   Hi,

   I am interested in measuring some performance numbers of Giraph on my
   machine.

   I am wondering are there some pointers where I can get some
   (configurable) reasonably large workload to work on ?

   Thanks!

   Wei





  --
     Claudio Martella
     claudio.marte...@gmail.com
<>

Re: workload used to measure Giraph performance number

2013-10-08 Thread Wei Zhang

Hi Claudio,

Thanks a lot for the help! Sorry for the late response.

My question was more aligned with your first aspect (to generate graphs) of
problem domain.

I think I can start by running the SimplePageRank example. I have been able
to run some very small data set ( couple of vertices and tens of edges)
over Giraph. The data set is created manually by myself. And the vertex
input format complies with JsonLongDoubleFloatDoubleVertexInputFormat
(following the instructions on the giraph website, unfortunately this is
the only vertex input format that I am aware of. If you think I should use
other vertex input format to ease the performance measuring of Giraph or
other Graph engine, please let me know. I do intend to potentially try with
different Graph engine(s) and compare the numbers. )

To keep everything simple at the beginning, I can start with experimenting
with random graph. So my concrete question now is how to generate a random
graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat, so I
can use this random graph as an input to measure some number of Giraph's
pagerank example.

Thank you very much!

Wei



From:   Claudio Martella 
To: "user@giraph.apache.org" ,
Date:   10/02/2013 11:32 AM
Subject:    Re: workload used to measure Giraph performance number



Hi Wei,

it depends on what you mean by workload for a batch processing system. I
believe we can split the problem in two: generating a realistic graph, and
using "representative" algorithms.

To generate graphs we have two options in giraph:

1) random graph: you specify the number of vertices and the number of edges
for each vertex, and the edges will connect two random vertices. This
creates a graph with (i) low clustering coefficient, (ii) low average path
length, (ii) a uniform degree distribution

2) watts strogatz: you specify the number of vertices, the number of edges,
and a rewire probability beta. giraph will generate a ring lattice (each
vertex is connected to k preceeding vertices and k following vertices) and
rewire some of the edges randomly. This will create a graph with (i) high
clustering coefficient, (ii) low average path length, (iii) poisson-like
degree distribution (depends on beta). This graph will resemble a small
world graph such as a social network, except for the degree distribution
which will not a power law.

To use representative algorithms you can choose:

1) PageRank: it's a ranking algorithm where all the vertices are active and
send messages along the edges at each superstep (hence you'll have O(V)
active vertices and O(E) messages)

2) Shortest Paths: starting from a random vertex you'll visit al the
vertices in the graph (some multiple times). This will have an aggregate O
(V) active vertices and O(E) messages, but this is only a lower bound. In
general you'l have different areas of the graph explored at each superstep,
and hence potentially a varying workload across different supersteps.

3) Connected Components: this will have something opposite to (2) as it
will have many active vertices at the beginning, where the detection is
refined towards the end.


Hope this helps,
Claudio


On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang  wrote:
  Hi,

  I am interested in measuring some performance numbers of Giraph on my
  machine.

  I am wondering are there some pointers where I can get some
  (configurable) reasonably large workload to work on ?

  Thanks!

  Wei





--
   Claudio Martella
   claudio.marte...@gmail.com<>

Re: workload used to measure Giraph performance number

2013-10-02 Thread Sebastian Schelter
Another option is to use the Koblenz network collection [1], which
offers even more (and larger) datasets than Snap.

Best,
Sebastian

[1] http://konect.uni-koblenz.de/


On 02.10.2013 17:41, Alok Kumbhare wrote:
> There are a number real (medium sized) graphs at
> http://snap.stanford.edu/data/index.html which we use for similar
> benchmarks. It has a good mix of graph types, sparse/dense, ground truth
> graphs (e.g. social networks that follow power law distribution etc.). So
> far we have observed that the type of graph has a high impact on the
> performance of algorithms that Claudio mentioned.
> 
> 
> On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella > wrote:
> 
>> Hi Wei,
>>
>> it depends on what you mean by workload for a batch processing system. I
>> believe we can split the problem in two: generating a realistic graph, and
>> using "representative" algorithms.
>>
>> To generate graphs we have two options in giraph:
>>
>> 1) random graph: you specify the number of vertices and the number of
>> edges for each vertex, and the edges will connect two random vertices. This
>> creates a graph with (i) low clustering coefficient, (ii) low average path
>> length, (ii) a uniform degree distribution
>>
>> 2) watts strogatz: you specify the number of vertices, the number of
>> edges, and a rewire probability beta. giraph will generate a ring lattice
>> (each vertex is connected to k preceeding vertices and k following
>> vertices) and rewire some of the edges randomly. This will create a graph
>> with (i) high clustering coefficient, (ii) low average path length, (iii)
>> poisson-like degree distribution (depends on beta). This graph will
>> resemble a small world graph such as a social network, except for the
>> degree distribution which will not a power law.
>>
>> To use representative algorithms you can choose:
>>
>> 1) PageRank: it's a ranking algorithm where all the vertices are active
>> and send messages along the edges at each superstep (hence you'll have O(V)
>> active vertices and O(E) messages)
>>
>> 2) Shortest Paths: starting from a random vertex you'll visit al the
>> vertices in the graph (some multiple times). This will have an aggregate
>> O(V) active vertices and O(E) messages, but this is only a lower bound. In
>> general you'l have different areas of the graph explored at each superstep,
>> and hence potentially a varying workload across different supersteps.
>>
>> 3) Connected Components: this will have something opposite to (2) as it
>> will have many active vertices at the beginning, where the detection is
>> refined towards the end.
>>
>>
>> Hope this helps,
>> Claudio
>>
>>
>> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang  wrote:
>>
>>> Hi,
>>>
>>> I am interested in measuring some performance numbers of Giraph on my
>>> machine.
>>>
>>> I am wondering are there some pointers where I can get some
>>> (configurable) reasonably large workload to work on ?
>>>
>>> Thanks!
>>>
>>> Wei
>>>
>>
>>
>>
>> --
>>Claudio Martella
>>claudio.marte...@gmail.com
>>
> 



Re: workload used to measure Giraph performance number

2013-10-02 Thread Alok Kumbhare
There are a number real (medium sized) graphs at
http://snap.stanford.edu/data/index.html which we use for similar
benchmarks. It has a good mix of graph types, sparse/dense, ground truth
graphs (e.g. social networks that follow power law distribution etc.). So
far we have observed that the type of graph has a high impact on the
performance of algorithms that Claudio mentioned.


On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella  wrote:

> Hi Wei,
>
> it depends on what you mean by workload for a batch processing system. I
> believe we can split the problem in two: generating a realistic graph, and
> using "representative" algorithms.
>
> To generate graphs we have two options in giraph:
>
> 1) random graph: you specify the number of vertices and the number of
> edges for each vertex, and the edges will connect two random vertices. This
> creates a graph with (i) low clustering coefficient, (ii) low average path
> length, (ii) a uniform degree distribution
>
> 2) watts strogatz: you specify the number of vertices, the number of
> edges, and a rewire probability beta. giraph will generate a ring lattice
> (each vertex is connected to k preceeding vertices and k following
> vertices) and rewire some of the edges randomly. This will create a graph
> with (i) high clustering coefficient, (ii) low average path length, (iii)
> poisson-like degree distribution (depends on beta). This graph will
> resemble a small world graph such as a social network, except for the
> degree distribution which will not a power law.
>
> To use representative algorithms you can choose:
>
> 1) PageRank: it's a ranking algorithm where all the vertices are active
> and send messages along the edges at each superstep (hence you'll have O(V)
> active vertices and O(E) messages)
>
> 2) Shortest Paths: starting from a random vertex you'll visit al the
> vertices in the graph (some multiple times). This will have an aggregate
> O(V) active vertices and O(E) messages, but this is only a lower bound. In
> general you'l have different areas of the graph explored at each superstep,
> and hence potentially a varying workload across different supersteps.
>
> 3) Connected Components: this will have something opposite to (2) as it
> will have many active vertices at the beginning, where the detection is
> refined towards the end.
>
>
> Hope this helps,
> Claudio
>
>
> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang  wrote:
>
>> Hi,
>>
>> I am interested in measuring some performance numbers of Giraph on my
>> machine.
>>
>> I am wondering are there some pointers where I can get some
>> (configurable) reasonably large workload to work on ?
>>
>> Thanks!
>>
>> Wei
>>
>
>
>
> --
>Claudio Martella
>claudio.marte...@gmail.com
>


Re: workload used to measure Giraph performance number

2013-10-02 Thread Claudio Martella
Hi Wei,

it depends on what you mean by workload for a batch processing system. I
believe we can split the problem in two: generating a realistic graph, and
using "representative" algorithms.

To generate graphs we have two options in giraph:

1) random graph: you specify the number of vertices and the number of edges
for each vertex, and the edges will connect two random vertices. This
creates a graph with (i) low clustering coefficient, (ii) low average path
length, (ii) a uniform degree distribution

2) watts strogatz: you specify the number of vertices, the number of edges,
and a rewire probability beta. giraph will generate a ring lattice (each
vertex is connected to k preceeding vertices and k following vertices) and
rewire some of the edges randomly. This will create a graph with (i) high
clustering coefficient, (ii) low average path length, (iii) poisson-like
degree distribution (depends on beta). This graph will resemble a small
world graph such as a social network, except for the degree distribution
which will not a power law.

To use representative algorithms you can choose:

1) PageRank: it's a ranking algorithm where all the vertices are active and
send messages along the edges at each superstep (hence you'll have O(V)
active vertices and O(E) messages)

2) Shortest Paths: starting from a random vertex you'll visit al the
vertices in the graph (some multiple times). This will have an aggregate
O(V) active vertices and O(E) messages, but this is only a lower bound. In
general you'l have different areas of the graph explored at each superstep,
and hence potentially a varying workload across different supersteps.

3) Connected Components: this will have something opposite to (2) as it
will have many active vertices at the beginning, where the detection is
refined towards the end.


Hope this helps,
Claudio


On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang  wrote:

> Hi,
>
> I am interested in measuring some performance numbers of Giraph on my
> machine.
>
> I am wondering are there some pointers where I can get some (configurable)
> reasonably large workload to work on ?
>
> Thanks!
>
> Wei
>



-- 
   Claudio Martella
   claudio.marte...@gmail.com


Re: workload used to measure Giraph performance number

2013-10-02 Thread David Boyd

I have used the Random Message Benchmark as a starting workload.

On 10/2/2013 10:59 AM, Wei Zhang wrote:


Hi,

I am interested in measuring some performance numbers of Giraph on my 
machine.


I am wondering are there some pointers where I can get some 
(configurable) reasonably large workload to work on ?


Thanks!

Wei




--
= mailto:db...@data-tactics.com 
David W. Boyd
Director, Engineering
7901 Jones Branch, Suite 700
Mclean, VA 22102
office:   +1-571-279-2122
fax: +1-703-506-6703
cell: +1-703-402-7908
== http://www.data-tactics.com.com/ 
First Robotic Mentor - FRC, FTC - www.iliterobotics.org
President - USSTEM Foundation - www.usstem.org

The information contained in this message may be privileged
and/or confidential and protected from disclosure.
If the reader of this message is not the intended recipient
or an employee or agent responsible for delivering this message
to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication
is strictly prohibited.  If you have received this communication
in error, please notify the sender immediately by replying to
this message and deleting the material from any computer.

 



workload used to measure Giraph performance number

2013-10-02 Thread Wei Zhang

Hi,

I am interested in measuring some performance numbers of Giraph on my
machine.

I am wondering are there some pointers where I can get some (configurable)
reasonably large workload to work on ?

Thanks!

Wei