Hi Spark Devs, As part of a project at work, I have written a graph generator for RMAT graphs consistent with the specifications in the Graph 500 benchmark (http://www.graph500.org/specifications). We had originally planned to use the rmatGenerator function in GraphGenerators, but found that it wasn't suitable for generating graphs with billions of edges; the edges are generated in a single thread and stored in a Set, meaning it can't generate a graph larger than memory on a single JVM (and I think Sets are limited to Int.MaxValue elements anyway).
The generator I have is essentially a more scalable version of rmatGenerator. We have used it to generate a graph with 2^32 vertices and 2^36 edges on our modestly-specced cluster of 16 machines. It seems like other people interested in Spark might want to play with some large RMAT graphs (or run the Graph 500 benchmark), so I would like to contribute my generator. It does have some minor differences from the current generator, though: 1. Vertex IDs are shuffled after the graph structure is generated, so the degree of a vertex cannot be predicted from its ID (without this step vertex 0 would always have the largest degree, followed by vertices 1,2,4,8, etc.). This is per the Graph 500 spec. It could be easily made optional. 2. Duplicate edges are not removed from the resulting graph. This could easily be done with a call to distinct() on the resulting edge list, but then there would be slightly fewer edges than one generated by the current rmatGenerator. Also this process would be very slow on large graphs due to skew. 3. Doesn't set the out degree as the vertex attribute. Again this would be simple to add, but it could be slow on the super vertices. My question for the Spark Devs is: Is this something you would want as part of GraphX (either as a replacement for the current rmatGenerator or a separate function in GraphGenerators)? Since it was developed at work I need to go through our legal department and QA processes to open-source it, and to fill out the paperwork I need to know whether I'll be submitting a pull request or standing it up as a separate project on GitHub. Thanks! -Ryan -- J. Ryan Carr, Ph. D. The Johns Hopkins University, Applied Physics Laboratory 11100 Johns Hopkins Rd., Laurel, MD 20723 Office: 240-228-9157 Cell: 443-744-1004 Email: ryan.c...@jhuapl.edu<mailto:ryan.c...@jhuapl.edu> or james.c...@jhuapl.edu<mailto:james.c...@jhuapl.edu>