By the way… g.V().repeat(out()).times(3).count() just finished. gremlin> g.V().repeat(out()).times(3).count() ==>208214159917830
Thats 208,214,159,917,830 --- or 208 trillion length 3 paths in the Friendster dataset. It took 12.1 hours --- so, the last out() took 2.7 hours. Marko. http://markorodriguez.com On Apr 3, 2015, at 8:48 AM, Marko Rodriguez <[email protected]> wrote: > Hi, > > Over the last couple of weeks Daniel Kuppitz and I have been doing repeated > Gremlin traversal over the Friendster graph using SparkGraphComputer. > > > http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#sparkgraphcomputer > > The Friendster graph has 124 million vertices and 2.5 billion edges. We are > using 3 Blades each with 39gigs of memory and 24 cores. > > > https://ia600500.us.archive.org/27/items/friendster-dataset-201107/README-friends.txt > > We use ScriptInputFormat to read the text-base representation of the > Friendster data to generate VertexWritables which are (when on the wire) Gryo > adjacency list serializations. > > http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#gremlin-kryo > > Here is the amount of data that is generated: > > raw text based input: 22.6 gigs > gryo graph generated from parsed text data: 94.6 gigs > > Next, here are the runtimes for the various stages: > > g.V() -- 2.5hours > g.V.out() -- 6 hours (+3.5 hours) > g.V.out().out() -- 9.4 hours (+3.4 hours) > g.V.out().out().out() -- 12.8 hours (+3.4 hours) --- SPECULATED > (running right now -- but on target for that amount of time) > > The first computation has no incoming message so there is no join() involved > and thus, its just the raw processing of VertexProgram.execute(). The others > each incur a join for each out(). The cost of each join is the same as this > is the nature of the Gremlin OLAP algorithm. > > > http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#traversalvertexprogram > > http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/ > (read the section called The Traversal Mechanics of Faunus) > > Here is the amount of data being shuffled on each join: > > 393.5 gigs > > That is how much data is created by a single message pass. That is a lot -- > well, an out() will send a message over each edge in the graph and there are > 2.5 billion edges. What is that message that is being sent? -- Its a > Traverser<Vertex> which has a respective Gryo representation (--a > DetachedVertex). > > I believe if we can get our Gryo serialization size down, we can expect much > better runtimes. An "off the cuff" calculation estimates that we can get the > Gryo file down by ~8x. > > https://issues.apache.org/jira/browse/TINKERPOP3-609 > > That will not incur a 8x increase in speed as VertexProgram.execute() still > takes a prescribed amount of time that is irrespective of the serialization > size, but perhaps a ?5x+ …? Or perhaps more than 8x ?!?! as more data can be > held in memory by Spark and partitions will not be constantly written and > read from disk. > > Anywho -- was IMing this with Stephen and Daniel and decided to make it an > email to the groups in case people have thoughts or which to help on this > problem. > > Enjoy!, > Marko. > > http://markorodriguez.com >
