Re: Analyzing Friendster with SparkGraphComputer and the Gryo Size Problem

Marko Rodriguez Fri, 03 Apr 2015 09:01:34 -0700

By the way… g.V().repeat(out()).times(3).count() just finished.

gremlin> g.V().repeat(out()).times(3).count()
==>208214159917830


Thats 208,214,159,917,830 --- or 208 trillion length 3 paths in the Friendster 
dataset. 

It took 12.1 hours --- so, the last out() took 2.7 hours.

Marko.

http://markorodriguez.com

On Apr 3, 2015, at 8:48 AM, Marko Rodriguez <[email protected]> wrote:

> Hi,
> 
> Over the last couple of weeks Daniel Kuppitz and I have been doing repeated 
> Gremlin traversal over the Friendster graph using SparkGraphComputer.
> 
>       
> http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#sparkgraphcomputer
> 
> The Friendster graph has 124 million vertices and 2.5 billion edges. We are 
> using 3 Blades each with 39gigs of memory and 24 cores. 
> 
>       
> https://ia600500.us.archive.org/27/items/friendster-dataset-201107/README-friends.txt
> 
> We use ScriptInputFormat to read the text-base representation of the 
> Friendster data to generate VertexWritables which are (when on the wire) Gryo 
> adjacency list serializations. 
> 
>       http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#gremlin-kryo
> 
> Here is the amount of data that is generated:
> 
>       raw text based input: 22.6 gigs
>       gryo graph generated from parsed text data: 94.6 gigs
>       
> Next, here are the runtimes for the various stages:
> 
>       g.V() -- 2.5hours
>       g.V.out() -- 6 hours (+3.5 hours)
>       g.V.out().out() -- 9.4 hours (+3.4 hours)
>       g.V.out().out().out() -- 12.8 hours (+3.4 hours) --- SPECULATED 
> (running right now -- but on target for that amount of time)
> 
> The first computation has no incoming message so there is no join() involved 
> and thus, its just the raw processing of VertexProgram.execute(). The others 
> each incur a join for each out(). The cost of each join is the same as this 
> is the nature of the Gremlin OLAP algorithm. 
> 
>       
> http://tinkerpop.incubator.apache.org/docs/3.0.0-SNAPSHOT/#traversalvertexprogram
>               
> http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/ 
> (read the section called The Traversal Mechanics of Faunus) 
> 
> Here is the amount of data being shuffled on each join:
> 
>       393.5 gigs
> 
> That is how much data is created by a single message pass. That is a lot -- 
> well, an out() will send a message over each edge in the graph and there are 
> 2.5 billion edges. What is that message that is being sent? -- Its a 
> Traverser<Vertex> which has a respective Gryo representation (--a 
> DetachedVertex).
> 
> I believe if we can get our Gryo serialization size down, we can expect much 
> better runtimes. An "off the cuff" calculation estimates that we can get the 
> Gryo file down by ~8x. 
> 
>       https://issues.apache.org/jira/browse/TINKERPOP3-609
> 
> That will not incur a 8x increase in speed as VertexProgram.execute() still 
> takes a prescribed amount of time that is irrespective of the serialization 
> size, but perhaps a ?5x+ …? Or perhaps more than 8x ?!?! as more data can be 
> held in memory by Spark and partitions will not be constantly written and 
> read from disk.
> 
> Anywho -- was IMing this with Stephen and Daniel and decided to make it an 
> email to the groups in case people have thoughts or which to help on this 
> problem.
> 
> Enjoy!,
> Marko.
> 
> http://markorodriguez.com
>

Re: Analyzing Friendster with SparkGraphComputer and the Gryo Size Problem

Reply via email to