I would like to propose that Flink include a selection of graph generators
in Gelly. Generated graphs will be useful for performing scalability,
stress, and regression testing as well as benchmarking and comparing
algorithms, both for Flink users and developers. Generated data is
infinitely scalable yet described by a few simple parameters and can often
substitute for user data or sharing large files when reporting issues.

Spark's GraphX includes a modest GraphGenerators class [1].

The initial implementation would focus on Erdos-Renyi, R-Mat [2], and
Kronecker [3] generators.

A key consideration is that the graphs should be seedable and generate the
same Graph regardless of parallelism.

Generated data is a complement to my proposed "Checksum method for DataSet
and Graph" [4].

[1]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.util.GraphGenerators$
[2] R-MAT: A Recursive Model for Graph Mining;
http://snap.stanford.edu/class/cs224w-readings/chakrabarti04rmat.pdf
[3] Kronecker graphs: An Approach to Modeling Networks;
http://arxiv.org/pdf/0812.4905v2.pdf
[4] https://issues.apache.org/jira/browse/FLINK-2716

Greg Hogan

Reply via email to