Hi Ryan,
If you can get past the paperwork, I'm sure this can make a great Spark
Package (http://spark-packages.org). People then can use it for
benchmarking purposes, and I'm sure people will be looking for graph
generators!

Best,
Burak

On Wed, Jun 24, 2015 at 7:55 AM, Carr, J. Ryan <ryan.c...@jhuapl.edu> wrote:

>  Hi Spark Devs,
>
>    As part of a project at work, I have written a graph generator for
> RMAT graphs consistent with the specifications in the Graph 500 benchmark (
> http://www.graph500.org/specifications). We had originally planned to use
> the rmatGenerator function in GraphGenerators, but found that it wasn’t
> suitable for generating graphs with billions of edges; the edges are
> generated in a single thread and stored in a Set, meaning it can’t generate
> a graph larger than memory on a single JVM (and I think Sets are limited to 
> Int.MaxValue
> elements anyway).
>
>    The generator I have is essentially a more scalable version of
> rmatGenerator. We have used it to generate a graph with 2^32 vertices and
> 2^36 edges on our modestly-specced cluster of 16 machines. It seems like
> other people interested in Spark might want to play with some large RMAT
> graphs (or run the Graph 500 benchmark), so I would like to contribute my
> generator. It does have some minor differences from the current
> generator, though:
>
>    1. Vertex IDs are shuffled after the graph structure is generated, so
>    the degree of a vertex cannot be predicted from its ID (without this step
>    vertex 0 would always have the largest degree, followed by vertices
>    1,2,4,8, etc.). This is per the Graph 500 spec. It could be easily made
>    optional.
>    2. Duplicate edges are not removed from the resulting graph. This
>    could easily be done with a call to distinct() on the resulting edge list,
>    but then there would be slightly fewer edges than one generated by the
>    current rmatGenerator. Also this process would be very slow on large graphs
>    due to skew.
>    3. Doesn’t set the out degree as the vertex attribute. Again this
>    would be simple to add, but it could be slow on the super vertices.
>
>   My question for the Spark Devs is: Is this something you would want as
> part of GraphX (either as a replacement for the current rmatGenerator or a
> separate function in GraphGenerators)? Since it was developed at work I
> need to go through our legal department and QA processes to open-source it,
> and to fill out the paperwork I need to know whether I’ll be submitting a
> pull request or standing it up as a separate project on GitHub.
>
>  Thanks!
>
>  -Ryan
>
>   --
> J. Ryan Carr, Ph. D.
>
>  The Johns Hopkins University, Applied Physics Laboratory
> 11100 Johns Hopkins Rd., Laurel, MD 20723
> Office: 240-228-9157
> Cell: 443-744-1004
> Email: *ryan.c...@jhuapl.edu <ryan.c...@jhuapl.edu>* or *james.c...@jhuapl.edu
> <james.c...@jhuapl.edu>*
>
>

Reply via email to