Re: [GraphX] Graph 500 graph generator

2015-06-24 Thread Burak Yavuz
Hi Ryan,
If you can get past the paperwork, I'm sure this can make a great Spark
Package (http://spark-packages.org). People then can use it for
benchmarking purposes, and I'm sure people will be looking for graph
generators!

Best,
Burak

On Wed, Jun 24, 2015 at 7:55 AM, Carr, J. Ryan  wrote:

>  Hi Spark Devs,
>
>As part of a project at work, I have written a graph generator for
> RMAT graphs consistent with the specifications in the Graph 500 benchmark (
> http://www.graph500.org/specifications). We had originally planned to use
> the rmatGenerator function in GraphGenerators, but found that it wasn’t
> suitable for generating graphs with billions of edges; the edges are
> generated in a single thread and stored in a Set, meaning it can’t generate
> a graph larger than memory on a single JVM (and I think Sets are limited to 
> Int.MaxValue
> elements anyway).
>
>The generator I have is essentially a more scalable version of
> rmatGenerator. We have used it to generate a graph with 2^32 vertices and
> 2^36 edges on our modestly-specced cluster of 16 machines. It seems like
> other people interested in Spark might want to play with some large RMAT
> graphs (or run the Graph 500 benchmark), so I would like to contribute my
> generator. It does have some minor differences from the current
> generator, though:
>
>1. Vertex IDs are shuffled after the graph structure is generated, so
>the degree of a vertex cannot be predicted from its ID (without this step
>vertex 0 would always have the largest degree, followed by vertices
>1,2,4,8, etc.). This is per the Graph 500 spec. It could be easily made
>optional.
>2. Duplicate edges are not removed from the resulting graph. This
>could easily be done with a call to distinct() on the resulting edge list,
>but then there would be slightly fewer edges than one generated by the
>current rmatGenerator. Also this process would be very slow on large graphs
>due to skew.
>3. Doesn’t set the out degree as the vertex attribute. Again this
>would be simple to add, but it could be slow on the super vertices.
>
>   My question for the Spark Devs is: Is this something you would want as
> part of GraphX (either as a replacement for the current rmatGenerator or a
> separate function in GraphGenerators)? Since it was developed at work I
> need to go through our legal department and QA processes to open-source it,
> and to fill out the paperwork I need to know whether I’ll be submitting a
> pull request or standing it up as a separate project on GitHub.
>
>  Thanks!
>
>  -Ryan
>
>   --
> J. Ryan Carr, Ph. D.
>
>  The Johns Hopkins University, Applied Physics Laboratory
> 11100 Johns Hopkins Rd., Laurel, MD 20723
> Office: 240-228-9157
> Cell: 443-744-1004
> Email: *ryan.c...@jhuapl.edu * or *james.c...@jhuapl.edu
> *
>
>


[GraphX] Graph 500 graph generator

2015-06-24 Thread Carr, J. Ryan
Hi Spark Devs,

  As part of a project at work, I have written a graph generator for RMAT 
graphs consistent with the specifications in the Graph 500 benchmark 
(http://www.graph500.org/specifications). We had originally planned to use the 
rmatGenerator function in GraphGenerators, but found that it wasn't suitable 
for generating graphs with billions of edges; the edges are generated in a 
single thread and stored in a Set, meaning it can't generate a graph larger 
than memory on a single JVM (and I think Sets are limited to Int.MaxValue 
elements anyway).

  The generator I have is essentially a more scalable version of rmatGenerator. 
We have used it to generate a graph with 2^32 vertices and 2^36 edges on our 
modestly-specced cluster of 16 machines. It seems like other people interested 
in Spark might want to play with some large RMAT graphs (or run the Graph 500 
benchmark), so I would like to contribute my generator. It does have some minor 
differences from the current generator, though:

  1.  Vertex IDs are shuffled after the graph structure is generated, so the 
degree of a vertex cannot be predicted from its ID (without this step vertex 0 
would always have the largest degree, followed by vertices 1,2,4,8, etc.). This 
is per the Graph 500 spec. It could be easily made optional.
  2.  Duplicate edges are not removed from the resulting graph. This could 
easily be done with a call to distinct() on the resulting edge list, but then 
there would be slightly fewer edges than one generated by the current 
rmatGenerator. Also this process would be very slow on large graphs due to skew.
  3.  Doesn't set the out degree as the vertex attribute. Again this would be 
simple to add, but it could be slow on the super vertices.

  My question for the Spark Devs is: Is this something you would want as part 
of GraphX (either as a replacement for the current rmatGenerator or a separate 
function in GraphGenerators)? Since it was developed at work I need to go 
through our legal department and QA processes to open-source it, and to fill 
out the paperwork I need to know whether I'll be submitting a pull request or 
standing it up as a separate project on GitHub.

Thanks!

-Ryan

--
J. Ryan Carr, Ph. D.

The Johns Hopkins University, Applied Physics Laboratory
11100 Johns Hopkins Rd., Laurel, MD 20723
Office: 240-228-9157
Cell: 443-744-1004
Email: ryan.c...@jhuapl.edu or 
james.c...@jhuapl.edu