I would like to propose that Flink include a selection of graph generators in Gelly. Generated graphs will be useful for performing scalability, stress, and regression testing as well as benchmarking and comparing algorithms, both for Flink users and developers. Generated data is infinitely scalable yet described by a few simple parameters and can often substitute for user data or sharing large files when reporting issues.
Spark's GraphX includes a modest GraphGenerators class [1]. The initial implementation would focus on Erdos-Renyi, R-Mat [2], and Kronecker [3] generators. A key consideration is that the graphs should be seedable and generate the same Graph regardless of parallelism. Generated data is a complement to my proposed "Checksum method for DataSet and Graph" [4]. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$ [2] R-MAT: A Recursive Model for Graph Mining; http://snap.stanford.edu/class/cs224w-readings/chakrabarti04rmat.pdf [3] Kronecker graphs: An Approach to Modeling Networks; http://arxiv.org/pdf/0812.4905v2.pdf [4] https://issues.apache.org/jira/browse/FLINK-2716 Greg Hogan
