Hi Greg, thank you for this proposal! I think graph generators will be a very useful addition to Gelly.
I'm not quite familiar with the state-of-the-art algorithms for distributed graph generation. I suppose that we could easily provide an efficient random graph generator and I've also seen some work on parallel/distributed algorithms for R-MAT [1, 2]. Are you aware of similar work for Erdos-Reniy, Kronecker or other types of graphs? Another place we might want to look at is Giraph's Watts-Strogatz generator [3]. Cheers, Vasia. [1]: https://github.com/farkhor/PaRMAT/ [2]: http://arxiv.org/pdf/1210.0187.pdf [3]: https://giraph.apache.org/apidocs/org/apache/giraph/io/formats/WattsStrogatzVertexInputFormat.html On 23 September 2015 at 19:49, Greg Hogan <[email protected]> wrote: > I would like to propose that Flink include a selection of graph generators > in Gelly. Generated graphs will be useful for performing scalability, > stress, and regression testing as well as benchmarking and comparing > algorithms, both for Flink users and developers. Generated data is > infinitely scalable yet described by a few simple parameters and can often > substitute for user data or sharing large files when reporting issues. > > Spark's GraphX includes a modest GraphGenerators class [1]. > > The initial implementation would focus on Erdos-Renyi, R-Mat [2], and > Kronecker [3] generators. > > A key consideration is that the graphs should be seedable and generate the > same Graph regardless of parallelism. > > Generated data is a complement to my proposed "Checksum method for DataSet > and Graph" [4]. > > [1] > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$ > [2] R-MAT: A Recursive Model for Graph Mining; > http://snap.stanford.edu/class/cs224w-readings/chakrabarti04rmat.pdf > [3] Kronecker graphs: An Approach to Modeling Networks; > http://arxiv.org/pdf/0812.4905v2.pdf > [4] https://issues.apache.org/jira/browse/FLINK-2716 > > Greg Hogan >
