Re: workload used to measure Giraph performance number
Thanks a lot Avery, I will take look into PageRankBenchmark and PseudoRandomIntNullVertexInputFormat class. Wei From: Avery Ching To: user@giraph.apache.org, Date: 10/10/2013 01:59 PM Subject:Re: workload used to measure Giraph performance number We don't have a data generator that produces HDFS files, but PageRankBenchmark does have a data generator (see PseudoRandomIntNullVertexInputFormat) Hope that helps, Avery On 10/9/13 2:49 PM, Wei Zhang wrote: Hi Avery, Thanks a lot for the pointer! I haven't quite got to the point where I can tune the Giraph performance yet (hopefully I can get there soon) I am still at the early stage of finding/generating the right workload to measure the performance. I am wondering is there some pointer (in Giraph or elsewhere) that I can use to generate a random or representative workload (for pagerank, for now), which complies with JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is greatly appreciated. -- Sorry for being a slow starter. I have written a small program myself to generate a random graph that has an average out-degree of 6 connected graph that complies with the JSON format. For a 2000-vertice graph, it takes about 12 seconds to run the pagerank example provided by Giraph (30 supersteps) I am only using 1 core of my machine (as I run in the Hadoop's "local" mode so that I could jdb into the Giraph job to observe how code runs if needed). Each core (out of 8 in total) of my machine is Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in total. Assuming the linear speedup when turning on all the 8 cores, it is about 1.5 seconds to run the work -- not sure if it is a reasonable number or not . Thanks! Wei Inactive hide details for Avery Ching ---10/09/2013 05:11:39 PM---Hi Wei, For best performance, please be sure to tune the GC sAvery Ching ---10/09/2013 05:11:39 PM---Hi Wei, For best performance, please be sure to tune the GC settings, use Java From: Avery Ching To: user@giraph.apache.org, Date: 10/09/2013 05:11 PM Subject: Re: workload used to measure Giraph performance number Hi Wei, For best performance, please be sure to tune the GC settings, use Java 7, tune the number of cores used for computation, communication, etc. and the combiner. We also have some numbers on our recent Facebook blog post. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920 Avery On 10/8/13 7:43 PM, Wei Zhang wrote: Hi Sebastian, Thanks a lot for the help! Sorry for the late response! At this point, I would only need a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the pagerank example (of giraph) performance. I am wondering how to convert the data from Koblenz to the such a graph ? Is there any pointer of doing this ? (This is the same kind of question that I raised to Alok on SNAP) Thanks! Wei P.S.: I forgot to mention in all my previous emails that I just get started with distributed graph engine, so please forgive if my questions are too naive. Inactive hide details for Sebastian Schelter ---10/02/2013 12:41:27 PM---Another option is to use the Koblenz network collectioSebastian Schelter ---10/02/2013 12:41:27 PM---Another option is to use the Koblenz network collection [1], which offers even more (and larger) dat From: Sebastian Schelter To: user@giraph.apache.org, Date: 10/02/2013 12:41 PM Subject: Re: workload used to measure Giraph performance number Another option is to use the Koblenz network collection [1], which offers even more (and larger) datasets than Snap. Best, Sebastian [1] http://konect.uni-koblenz.de/ On 02.10.2013 17:41, Alok Kumbhare wrote: > There are a number real (medium sized) graphs at > http://snap.stanford.edu/data/index.html which we use for similar > benchmarks. It has a good mix of graph types, sparse/dense, ground truth > graphs (e.g. social networks that follow power law distribution etc.). So > far we have observed that the type of graph has a high impact on the > performance of algorithms that Claudio mentioned. > > > On Wed, Oct 2, 2013 at 8:22 AM, Claud
Re: workload used to measure Giraph performance number
We don't have a data generator that produces HDFS files, but PageRankBenchmark does have a data generator (see PseudoRandomIntNullVertexInputFormat) Hope that helps, Avery On 10/9/13 2:49 PM, Wei Zhang wrote: Hi Avery, Thanks a lot for the pointer! I haven't quite got to the point where I can tune the Giraph performance yet (hopefully I can get there soon) I am still at the early stage of finding/generating the right workload to measure the performance. I am wondering is there some pointer (in Giraph or elsewhere) that I can use to generate a random or representative workload (for pagerank, for now), which complies with JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is greatly appreciated. -- Sorry for being a slow starter. I have written a small program myself to generate a random graph that has an average out-degree of 6 connected graph that complies with the JSON format. For a 2000-vertice graph, it takes about 12 seconds to run the pagerank example provided by Giraph (30 supersteps) I am only using 1 core of my machine (as I run in the Hadoop's "local" mode so that I could jdb into the Giraph job to observe how code runs if needed). Each core (out of 8 in total) of my machine is Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in total. Assuming the linear speedup when turning on all the 8 cores, it is about 1.5 seconds to run the work -- not sure if it is a reasonable number or not . Thanks! Wei Avery Ching ---10/09/2013 05:11:39 PM---Hi Wei, For best performance, please be sure to tune the GC settings, use Java From: Avery Ching To: user@giraph.apache.org, Date: 10/09/2013 05:11 PM Subject: Re: workload used to measure Giraph performance number Hi Wei, For best performance, please be sure to tune the GC settings, use Java 7, tune the number of cores used for computation, communication, etc. and the combiner. We also have some numbers on our recent Facebook blog post. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920 Avery On 10/8/13 7:43 PM, Wei Zhang wrote: Hi Sebastian, Thanks a lot for the help! Sorry for the late response! At this point, I would only need a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the pagerank example (of giraph) performance. I am wondering how to convert the data from Koblenz to the such a graph ? Is there any pointer of doing this ? (This is the same kind of question that I raised to Alok on SNAP) Thanks! Wei P.S.: I forgot to mention in all my previous emails that I just get started with distributed graph engine, so please forgive if my questions are too naive. Sebastian Schelter ---10/02/2013 12:41:27 PM---Another option is to use the Koblenz network collection [1], which offers even more (and larger) dat From: Sebastian Schelter To: user@giraph.apache.org, Date: 10/02/2013 12:41 PM Subject: Re: workload used to measure Giraph performance number Another option is to use the Koblenz network collection [1], which offers even more (and larger) datasets than Snap. Best, Sebastian [1] http://konect.uni-koblenz.de/ On 02.10.2013 17:41, Alok Kumbhare wrote: > There are a number real (medium sized) graphs at > http://snap.stanford.edu/data/index.html which we use for similar > benchmarks. It has a good mix of graph types, sparse/dense, ground truth > graphs (e.g. social networks that follow power law distribution etc.). So > far we have observed that the type of graph has a high impact on the > performance of algorithms that Claudio mentioned. > > > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <claudio.marte...@gmail.com >> wrote: >
Re: workload used to measure Giraph performance number
Hi Avery, Thanks a lot for the pointer! I haven't quite got to the point where I can tune the Giraph performance yet (hopefully I can get there soon) I am still at the early stage of finding/generating the right workload to measure the performance. I am wondering is there some pointer (in Giraph or elsewhere) that I can use to generate a random or representative workload (for pagerank, for now), which complies with JsonLongDoubleFloatDoubleVertexInputFormat t? Any pointer is greatly appreciated. -- Sorry for being a slow starter. I have written a small program myself to generate a random graph that has an average out-degree of 6 connected graph that complies with the JSON format. For a 2000-vertice graph, it takes about 12 seconds to run the pagerank example provided by Giraph (30 supersteps) I am only using 1 core of my machine (as I run in the Hadoop's "local" mode so that I could jdb into the Giraph job to observe how code runs if needed). Each core (out of 8 in total) of my machine is Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz, there is 16GB memory in total. Assuming the linear speedup when turning on all the 8 cores, it is about 1.5 seconds to run the work -- not sure if it is a reasonable number or not . Thanks! Wei From: Avery Ching To: user@giraph.apache.org, Date: 10/09/2013 05:11 PM Subject: Re: workload used to measure Giraph performance number Hi Wei, For best performance, please be sure to tune the GC settings, use Java 7, tune the number of cores used for computation, communication, etc. and the combiner. We also have some numbers on our recent Facebook blog post. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920 Avery On 10/8/13 7:43 PM, Wei Zhang wrote: Hi Sebastian, Thanks a lot for the help! Sorry for the late response! At this point, I would only need a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the pagerank example (of giraph) performance. I am wondering how to convert the data from Koblenz to the such a graph ? Is there any pointer of doing this ? (This is the same kind of question that I raised to Alok on SNAP) Thanks! Wei P.S.: I forgot to mention in all my previous emails that I just get started with distributed graph engine, so please forgive if my questions are too naive. Inactive hide details for Sebastian Schelter ---10/02/2013 12:41:27 PM---Another option is to use the Koblenz network collectioSebastian Schelter ---10/02/2013 12:41:27 PM---Another option is to use the Koblenz network collection [1], which offers even more (and larger) dat From: Sebastian Schelter To: user@giraph.apache.org, Date: 10/02/2013 12:41 PM Subject: Re: workload used to measure Giraph performance number Another option is to use the Koblenz network collection [1], which offers even more (and larger) datasets than Snap. Best, Sebastian [1] http://konect.uni-koblenz.de/ On 02.10.2013 17:41, Alok Kumbhare wrote: > There are a number real (medium sized) graphs at > http://snap.stanford.edu/data/index.html which we use for similar > benchmarks. It has a good mix of graph types, sparse/dense, ground truth > graphs (e.g. social networks that follow power law distribution etc.). So > far we have observed that the type of graph has a high impact on the > performance of algorithms that Claudio mentioned. > > > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella < claudio.marte...@gmail.com >> wrote: > >> Hi Wei, >> >> it depends on what you mean by workload for a batch processing system. I >> believe we can split the problem in two: generating a realistic graph, and >> using "representative" algorithms. >> >> To generate graphs we have two options in giraph: >> >> 1) random graph: you specify the number of vertices and the number of >> edges for each vertex, and the edges will connect two random vertices. This >> creates a graph with (i) low clustering coefficient, (ii) low average path >> length, (ii) a uniform degree distribution >> >> 2) watts strogatz: you specify the number of vertices, the number of >> edges, and a rewire probability beta. giraph will generate a ring lattice >> (each vertex is connected to k preceeding vertices and k following >> vertices) and rewire some of the edges randomly. This will create a graph >> with (i) high clusterin
Re: workload used to measure Giraph performance number
Hi Wei, For best performance, please be sure to tune the GC settings, use Java 7, tune the number of cores used for computation, communication, etc. and the combiner. We also have some numbers on our recent Facebook blog post. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920 Avery On 10/8/13 7:43 PM, Wei Zhang wrote: Hi Sebastian, Thanks a lot for the help! Sorry for the late response! At this point, I would only need a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the pagerank example (of giraph) performance. I am wondering how to convert the data from Koblenz to the such a graph ? Is there any pointer of doing this ? (This is the same kind of question that I raised to Alok on SNAP) Thanks! Wei P.S.: I forgot to mention in all my previous emails that I just get started with distributed graph engine, so please forgive if my questions are too naive. Sebastian Schelter ---10/02/2013 12:41:27 PM---Another option is to use the Koblenz network collection [1], which offers even more (and larger) dat From: Sebastian Schelter To: user@giraph.apache.org, Date: 10/02/2013 12:41 PM Subject: Re: workload used to measure Giraph performance number Another option is to use the Koblenz network collection [1], which offers even more (and larger) datasets than Snap. Best, Sebastian [1] http://konect.uni-koblenz.de/ On 02.10.2013 17:41, Alok Kumbhare wrote: > There are a number real (medium sized) graphs at > http://snap.stanford.edu/data/index.html which we use for similar > benchmarks. It has a good mix of graph types, sparse/dense, ground truth > graphs (e.g. social networks that follow power law distribution etc.). So > far we have observed that the type of graph has a high impact on the > performance of algorithms that Claudio mentioned. > > > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella <claudio.marte...@gmail.com >> wrote: > >> Hi Wei, >> >> it depends on what you mean by workload for a batch processing system. I >> believe we can split the problem in two: generating a realistic graph, and >> using "representative" algorithms. >> >> To generate graphs we have two options in giraph: >> >> 1) random graph: you specify the number of vertices and the number of >> edges for each vertex, and the edges will connect two random vertices. This >> creates a graph with (i) low clustering coefficient, (ii) low average path >> length, (ii) a uniform degree distribution >> >> 2) watts strogatz: you specify the number of vertices, the number of >> edges, and a rewire probability beta. giraph will generate a ring lattice >> (each vertex is connected to k preceeding vertices and k following >> vertices) and rewire some of the edges randomly. This will create a graph >> with (i) high clustering coefficient, (ii) low average path length, (iii) >> poisson-like degree distribution (depends on beta). This graph will >> resemble a small world graph such as a social network, except for the >> degree distribution which will not a power law. >> >> To use representative algorithms you can choose: >> >> 1) PageRank: it's a ranking algorithm where all the vertices are active >> and send messages along the edges at each superstep (hence you'll have O(V) >> active vertices and O(E) messages) >> >> 2) Shortest Paths: starting from a random vertex you'll visit al the >> vertices in the graph (some multiple times). This will have an aggregate >> O(V) active vertices and O(E) messages, but this is only a lower bound. In >>
Re: workload used to measure Giraph performance number
Hi Sebastian, Thanks a lot for the help! Sorry for the late response! At this point, I would only need a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the pagerank example (of giraph) performance. I am wondering how to convert the data from Koblenz to the such a graph ? Is there any pointer of doing this ? (This is the same kind of question that I raised to Alok on SNAP) Thanks! Wei P.S.: I forgot to mention in all my previous emails that I just get started with distributed graph engine, so please forgive if my questions are too naive. From: Sebastian Schelter To: user@giraph.apache.org, Date: 10/02/2013 12:41 PM Subject:Re: workload used to measure Giraph performance number Another option is to use the Koblenz network collection [1], which offers even more (and larger) datasets than Snap. Best, Sebastian [1] http://konect.uni-koblenz.de/ On 02.10.2013 17:41, Alok Kumbhare wrote: > There are a number real (medium sized) graphs at > http://snap.stanford.edu/data/index.html which we use for similar > benchmarks. It has a good mix of graph types, sparse/dense, ground truth > graphs (e.g. social networks that follow power law distribution etc.). So > far we have observed that the type of graph has a high impact on the > performance of algorithms that Claudio mentioned. > > > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella > wrote: > >> Hi Wei, >> >> it depends on what you mean by workload for a batch processing system. I >> believe we can split the problem in two: generating a realistic graph, and >> using "representative" algorithms. >> >> To generate graphs we have two options in giraph: >> >> 1) random graph: you specify the number of vertices and the number of >> edges for each vertex, and the edges will connect two random vertices. This >> creates a graph with (i) low clustering coefficient, (ii) low average path >> length, (ii) a uniform degree distribution >> >> 2) watts strogatz: you specify the number of vertices, the number of >> edges, and a rewire probability beta. giraph will generate a ring lattice >> (each vertex is connected to k preceeding vertices and k following >> vertices) and rewire some of the edges randomly. This will create a graph >> with (i) high clustering coefficient, (ii) low average path length, (iii) >> poisson-like degree distribution (depends on beta). This graph will >> resemble a small world graph such as a social network, except for the >> degree distribution which will not a power law. >> >> To use representative algorithms you can choose: >> >> 1) PageRank: it's a ranking algorithm where all the vertices are active >> and send messages along the edges at each superstep (hence you'll have O (V) >> active vertices and O(E) messages) >> >> 2) Shortest Paths: starting from a random vertex you'll visit al the >> vertices in the graph (some multiple times). This will have an aggregate >> O(V) active vertices and O(E) messages, but this is only a lower bound. In >> general you'l have different areas of the graph explored at each superstep, >> and hence potentially a varying workload across different supersteps. >> >> 3) Connected Components: this will have something opposite to (2) as it >> will have many active vertices at the beginning, where the detection is >> refined towards the end. >> >> >> Hope this helps, >> Claudio >> >> >> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang wrote: >> >>> Hi, >>> >>> I am interested in measuring some performance numbers of Giraph on my >>> machine. >>> >>> I am wondering are there some pointers where I can get some >>> (configurable) reasonably large workload to work on ? >>> >>> Thanks! >>> >>> Wei >>> >> >> >> >> -- >>Claudio Martella >>claudio.marte...@gmail.com >> > <>
Re: workload used to measure Giraph performance number
Hi Alok, Thanks a lot for the help! Sorry for the late response! At this point, I would only need a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat of Giraph to measure the pagerank example (of giraph) performance. I am wondering how to convert the data from SNAP to the such a graph ? Is there any pointer of doing this ? Thanks! Wei From: Alok Kumbhare To: user@giraph.apache.org, Date: 10/02/2013 11:41 AM Subject:Re: workload used to measure Giraph performance number There are a number real (medium sized) graphs at http://snap.stanford.edu/data/index.html which we use for similar benchmarks. It has a good mix of graph types, sparse/dense, ground truth graphs (e.g. social networks that follow power law distribution etc.). So far we have observed that the type of graph has a high impact on the performance of algorithms that Claudio mentioned. On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella < claudio.marte...@gmail.com> wrote: Hi Wei, it depends on what you mean by workload for a batch processing system. I believe we can split the problem in two: generating a realistic graph, and using "representative" algorithms. To generate graphs we have two options in giraph: 1) random graph: you specify the number of vertices and the number of edges for each vertex, and the edges will connect two random vertices. This creates a graph with (i) low clustering coefficient, (ii) low average path length, (ii) a uniform degree distribution 2) watts strogatz: you specify the number of vertices, the number of edges, and a rewire probability beta. giraph will generate a ring lattice (each vertex is connected to k preceeding vertices and k following vertices) and rewire some of the edges randomly. This will create a graph with (i) high clustering coefficient, (ii) low average path length, (iii) poisson-like degree distribution (depends on beta). This graph will resemble a small world graph such as a social network, except for the degree distribution which will not a power law. To use representative algorithms you can choose: 1) PageRank: it's a ranking algorithm where all the vertices are active and send messages along the edges at each superstep (hence you'll have O (V) active vertices and O(E) messages) 2) Shortest Paths: starting from a random vertex you'll visit al the vertices in the graph (some multiple times). This will have an aggregate O(V) active vertices and O(E) messages, but this is only a lower bound. In general you'l have different areas of the graph explored at each superstep, and hence potentially a varying workload across different supersteps. 3) Connected Components: this will have something opposite to (2) as it will have many active vertices at the beginning, where the detection is refined towards the end. Hope this helps, Claudio On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang wrote: Hi, I am interested in measuring some performance numbers of Giraph on my machine. I am wondering are there some pointers where I can get some (configurable) reasonably large workload to work on ? Thanks! Wei -- Claudio Martella claudio.marte...@gmail.com <>
Re: workload used to measure Giraph performance number
Hi Claudio, Thanks a lot for the help! Sorry for the late response. My question was more aligned with your first aspect (to generate graphs) of problem domain. I think I can start by running the SimplePageRank example. I have been able to run some very small data set ( couple of vertices and tens of edges) over Giraph. The data set is created manually by myself. And the vertex input format complies with JsonLongDoubleFloatDoubleVertexInputFormat (following the instructions on the giraph website, unfortunately this is the only vertex input format that I am aware of. If you think I should use other vertex input format to ease the performance measuring of Giraph or other Graph engine, please let me know. I do intend to potentially try with different Graph engine(s) and compare the numbers. ) To keep everything simple at the beginning, I can start with experimenting with random graph. So my concrete question now is how to generate a random graph that complies with JsonLongDoubleFloatDoubleVertexInputFormat, so I can use this random graph as an input to measure some number of Giraph's pagerank example. Thank you very much! Wei From: Claudio Martella To: "user@giraph.apache.org" , Date: 10/02/2013 11:32 AM Subject: Re: workload used to measure Giraph performance number Hi Wei, it depends on what you mean by workload for a batch processing system. I believe we can split the problem in two: generating a realistic graph, and using "representative" algorithms. To generate graphs we have two options in giraph: 1) random graph: you specify the number of vertices and the number of edges for each vertex, and the edges will connect two random vertices. This creates a graph with (i) low clustering coefficient, (ii) low average path length, (ii) a uniform degree distribution 2) watts strogatz: you specify the number of vertices, the number of edges, and a rewire probability beta. giraph will generate a ring lattice (each vertex is connected to k preceeding vertices and k following vertices) and rewire some of the edges randomly. This will create a graph with (i) high clustering coefficient, (ii) low average path length, (iii) poisson-like degree distribution (depends on beta). This graph will resemble a small world graph such as a social network, except for the degree distribution which will not a power law. To use representative algorithms you can choose: 1) PageRank: it's a ranking algorithm where all the vertices are active and send messages along the edges at each superstep (hence you'll have O(V) active vertices and O(E) messages) 2) Shortest Paths: starting from a random vertex you'll visit al the vertices in the graph (some multiple times). This will have an aggregate O (V) active vertices and O(E) messages, but this is only a lower bound. In general you'l have different areas of the graph explored at each superstep, and hence potentially a varying workload across different supersteps. 3) Connected Components: this will have something opposite to (2) as it will have many active vertices at the beginning, where the detection is refined towards the end. Hope this helps, Claudio On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang wrote: Hi, I am interested in measuring some performance numbers of Giraph on my machine. I am wondering are there some pointers where I can get some (configurable) reasonably large workload to work on ? Thanks! Wei -- Claudio Martella claudio.marte...@gmail.com<>
Re: workload used to measure Giraph performance number
Another option is to use the Koblenz network collection [1], which offers even more (and larger) datasets than Snap. Best, Sebastian [1] http://konect.uni-koblenz.de/ On 02.10.2013 17:41, Alok Kumbhare wrote: > There are a number real (medium sized) graphs at > http://snap.stanford.edu/data/index.html which we use for similar > benchmarks. It has a good mix of graph types, sparse/dense, ground truth > graphs (e.g. social networks that follow power law distribution etc.). So > far we have observed that the type of graph has a high impact on the > performance of algorithms that Claudio mentioned. > > > On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella > wrote: > >> Hi Wei, >> >> it depends on what you mean by workload for a batch processing system. I >> believe we can split the problem in two: generating a realistic graph, and >> using "representative" algorithms. >> >> To generate graphs we have two options in giraph: >> >> 1) random graph: you specify the number of vertices and the number of >> edges for each vertex, and the edges will connect two random vertices. This >> creates a graph with (i) low clustering coefficient, (ii) low average path >> length, (ii) a uniform degree distribution >> >> 2) watts strogatz: you specify the number of vertices, the number of >> edges, and a rewire probability beta. giraph will generate a ring lattice >> (each vertex is connected to k preceeding vertices and k following >> vertices) and rewire some of the edges randomly. This will create a graph >> with (i) high clustering coefficient, (ii) low average path length, (iii) >> poisson-like degree distribution (depends on beta). This graph will >> resemble a small world graph such as a social network, except for the >> degree distribution which will not a power law. >> >> To use representative algorithms you can choose: >> >> 1) PageRank: it's a ranking algorithm where all the vertices are active >> and send messages along the edges at each superstep (hence you'll have O(V) >> active vertices and O(E) messages) >> >> 2) Shortest Paths: starting from a random vertex you'll visit al the >> vertices in the graph (some multiple times). This will have an aggregate >> O(V) active vertices and O(E) messages, but this is only a lower bound. In >> general you'l have different areas of the graph explored at each superstep, >> and hence potentially a varying workload across different supersteps. >> >> 3) Connected Components: this will have something opposite to (2) as it >> will have many active vertices at the beginning, where the detection is >> refined towards the end. >> >> >> Hope this helps, >> Claudio >> >> >> On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang wrote: >> >>> Hi, >>> >>> I am interested in measuring some performance numbers of Giraph on my >>> machine. >>> >>> I am wondering are there some pointers where I can get some >>> (configurable) reasonably large workload to work on ? >>> >>> Thanks! >>> >>> Wei >>> >> >> >> >> -- >>Claudio Martella >>claudio.marte...@gmail.com >> >
Re: workload used to measure Giraph performance number
There are a number real (medium sized) graphs at http://snap.stanford.edu/data/index.html which we use for similar benchmarks. It has a good mix of graph types, sparse/dense, ground truth graphs (e.g. social networks that follow power law distribution etc.). So far we have observed that the type of graph has a high impact on the performance of algorithms that Claudio mentioned. On Wed, Oct 2, 2013 at 8:22 AM, Claudio Martella wrote: > Hi Wei, > > it depends on what you mean by workload for a batch processing system. I > believe we can split the problem in two: generating a realistic graph, and > using "representative" algorithms. > > To generate graphs we have two options in giraph: > > 1) random graph: you specify the number of vertices and the number of > edges for each vertex, and the edges will connect two random vertices. This > creates a graph with (i) low clustering coefficient, (ii) low average path > length, (ii) a uniform degree distribution > > 2) watts strogatz: you specify the number of vertices, the number of > edges, and a rewire probability beta. giraph will generate a ring lattice > (each vertex is connected to k preceeding vertices and k following > vertices) and rewire some of the edges randomly. This will create a graph > with (i) high clustering coefficient, (ii) low average path length, (iii) > poisson-like degree distribution (depends on beta). This graph will > resemble a small world graph such as a social network, except for the > degree distribution which will not a power law. > > To use representative algorithms you can choose: > > 1) PageRank: it's a ranking algorithm where all the vertices are active > and send messages along the edges at each superstep (hence you'll have O(V) > active vertices and O(E) messages) > > 2) Shortest Paths: starting from a random vertex you'll visit al the > vertices in the graph (some multiple times). This will have an aggregate > O(V) active vertices and O(E) messages, but this is only a lower bound. In > general you'l have different areas of the graph explored at each superstep, > and hence potentially a varying workload across different supersteps. > > 3) Connected Components: this will have something opposite to (2) as it > will have many active vertices at the beginning, where the detection is > refined towards the end. > > > Hope this helps, > Claudio > > > On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang wrote: > >> Hi, >> >> I am interested in measuring some performance numbers of Giraph on my >> machine. >> >> I am wondering are there some pointers where I can get some >> (configurable) reasonably large workload to work on ? >> >> Thanks! >> >> Wei >> > > > > -- >Claudio Martella >claudio.marte...@gmail.com >
Re: workload used to measure Giraph performance number
Hi Wei, it depends on what you mean by workload for a batch processing system. I believe we can split the problem in two: generating a realistic graph, and using "representative" algorithms. To generate graphs we have two options in giraph: 1) random graph: you specify the number of vertices and the number of edges for each vertex, and the edges will connect two random vertices. This creates a graph with (i) low clustering coefficient, (ii) low average path length, (ii) a uniform degree distribution 2) watts strogatz: you specify the number of vertices, the number of edges, and a rewire probability beta. giraph will generate a ring lattice (each vertex is connected to k preceeding vertices and k following vertices) and rewire some of the edges randomly. This will create a graph with (i) high clustering coefficient, (ii) low average path length, (iii) poisson-like degree distribution (depends on beta). This graph will resemble a small world graph such as a social network, except for the degree distribution which will not a power law. To use representative algorithms you can choose: 1) PageRank: it's a ranking algorithm where all the vertices are active and send messages along the edges at each superstep (hence you'll have O(V) active vertices and O(E) messages) 2) Shortest Paths: starting from a random vertex you'll visit al the vertices in the graph (some multiple times). This will have an aggregate O(V) active vertices and O(E) messages, but this is only a lower bound. In general you'l have different areas of the graph explored at each superstep, and hence potentially a varying workload across different supersteps. 3) Connected Components: this will have something opposite to (2) as it will have many active vertices at the beginning, where the detection is refined towards the end. Hope this helps, Claudio On Wed, Oct 2, 2013 at 4:59 PM, Wei Zhang wrote: > Hi, > > I am interested in measuring some performance numbers of Giraph on my > machine. > > I am wondering are there some pointers where I can get some (configurable) > reasonably large workload to work on ? > > Thanks! > > Wei > -- Claudio Martella claudio.marte...@gmail.com
Re: workload used to measure Giraph performance number
I have used the Random Message Benchmark as a starting workload. On 10/2/2013 10:59 AM, Wei Zhang wrote: Hi, I am interested in measuring some performance numbers of Giraph on my machine. I am wondering are there some pointers where I can get some (configurable) reasonably large workload to work on ? Thanks! Wei -- = mailto:db...@data-tactics.com David W. Boyd Director, Engineering 7901 Jones Branch, Suite 700 Mclean, VA 22102 office: +1-571-279-2122 fax: +1-703-506-6703 cell: +1-703-402-7908 == http://www.data-tactics.com.com/ First Robotic Mentor - FRC, FTC - www.iliterobotics.org President - USSTEM Foundation - www.usstem.org The information contained in this message may be privileged and/or confidential and protected from disclosure. If the reader of this message is not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and deleting the material from any computer.
workload used to measure Giraph performance number
Hi, I am interested in measuring some performance numbers of Giraph on my machine. I am wondering are there some pointers where I can get some (configurable) reasonably large workload to work on ? Thanks! Wei