Ruminations on SparkGraphComputer -- Part Deux

Marko Rodriguez Tue, 09 Feb 2016 14:41:06 -0800

Hi,

Two tickets were recently completed.
        https://issues.apache.org/jira/browse/TINKERPOP-1131 (TinkerPop 
3.1.2-SNAPSHOT & TinkerPop 3.2.0-SNAPSHOT)
        https://issues.apache.org/jira/browse/TINKERPOP-962 (TinkerPop 
3.2.0-SNAPSHOT)
                - with updates to serialization as well in this push.


With these merged, I benchmarked SparkGraphComputer against Friendster (2.5 
billion edges) for the following queries:

g.V().count() -- answer 125000000 (125 million vertices)
        - TinkerPop 3.0.0.MX: 2.5 hours
        - TinkerPop 3.0.0:      1.5 hours
        - TinkerPop 3.1.1:      23 minutes
        - TinkerPop 3.2.0:      6.8 minutes

g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. 
edges))
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      2.5 hours
        - TinkerPop 3.1.1:      1.1 hours
        - TinkerPop 3.2.0:      13 minutes (*** TinkerPop 3.1.2 will be this 
fast too)
        
g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      unknown
        - TinkerPop 3.1.1:      unknown
        - TinkerPop 3.2.0:      55 minutes (*** TinkerPop 3.1.2 will be this 
fast too)

g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 
3-paths)
        - TinkerPop 3.0.0.MX: 12.8 hours
        - TinkerPop 3.0.0:      8.6 hours
        - TinkerPop 3.1.1:      2.4 hours
        - TinkerPop 3.2.0:      1.6 hours (*** TinkerPop 3.1.2 will be this 
fast too)           

For SparkGraphComputer, I no longer have to use DISK_ONLY because the memory 
optimizations have greatly reduced heap usage and thus, I can do 
MEMORY_AND_DISK_SER w/o causing the GC to go crazy. Moreover, because of 
TINKERPOP-1131, ReducingBarrierSteps (e.g. groupCount(), count(), sum(), max(), 
etc.) are significantly faster and use a minuscule amount of memory. Together, 
these updates have greatly improved GraphComputer as you can see specifically 
with the SparkGraphComputer benchmark above.

Finally, check this out. I decided to test the speed of g.V().count() when the 
input graph is already partitioned to the Spark cluster. This will be what you 
see when you use PersistedOutputRDD/InputRDD or when you use a graph system 
that provides a Partitioner to their InputRDD and thus, avoids an initial 
partition by SparkGraphComputer.

g.V().count() -- answer 125000000 (125 million vertices)
        - TinkerPop 3.2.0:      5.2 minutes
… hmm, not as good as I was hoping. I thought this would be around 1-2 minutes. 
:| I bet there is something I'm doing wrong.

Enjoy!,
Marko.

http://markorodriguez.com

Ruminations on SparkGraphComputer -- Part Deux

Reply via email to