Hi,
Two tickets were recently completed.
https://issues.apache.org/jira/browse/TINKERPOP-1131 (TinkerPop
3.1.2-SNAPSHOT & TinkerPop 3.2.0-SNAPSHOT)
https://issues.apache.org/jira/browse/TINKERPOP-962 (TinkerPop
3.2.0-SNAPSHOT)
- with updates to serialization as well in this push.
With these merged, I benchmarked SparkGraphComputer against Friendster (2.5
billion edges) for the following queries:
g.V().count() -- answer 125000000 (125 million vertices)
- TinkerPop 3.0.0.MX: 2.5 hours
- TinkerPop 3.0.0: 1.5 hours
- TinkerPop 3.1.1: 23 minutes
- TinkerPop 3.2.0: 6.8 minutes
g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e.
edges))
- TinkerPop 3.0.0.MX: unknown
- TinkerPop 3.0.0: 2.5 hours
- TinkerPop 3.1.1: 1.1 hours
- TinkerPop 3.2.0: 13 minutes (*** TinkerPop 3.1.2 will be this
fast too)
g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
- TinkerPop 3.0.0.MX: unknown
- TinkerPop 3.0.0: unknown
- TinkerPop 3.1.1: unknown
- TinkerPop 3.2.0: 55 minutes (*** TinkerPop 3.1.2 will be this
fast too)
g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length
3-paths)
- TinkerPop 3.0.0.MX: 12.8 hours
- TinkerPop 3.0.0: 8.6 hours
- TinkerPop 3.1.1: 2.4 hours
- TinkerPop 3.2.0: 1.6 hours (*** TinkerPop 3.1.2 will be this
fast too)
For SparkGraphComputer, I no longer have to use DISK_ONLY because the memory
optimizations have greatly reduced heap usage and thus, I can do
MEMORY_AND_DISK_SER w/o causing the GC to go crazy. Moreover, because of
TINKERPOP-1131, ReducingBarrierSteps (e.g. groupCount(), count(), sum(), max(),
etc.) are significantly faster and use a minuscule amount of memory. Together,
these updates have greatly improved GraphComputer as you can see specifically
with the SparkGraphComputer benchmark above.
Finally, check this out. I decided to test the speed of g.V().count() when the
input graph is already partitioned to the Spark cluster. This will be what you
see when you use PersistedOutputRDD/InputRDD or when you use a graph system
that provides a Partitioner to their InputRDD and thus, avoids an initial
partition by SparkGraphComputer.
g.V().count() -- answer 125000000 (125 million vertices)
- TinkerPop 3.2.0: 5.2 minutes
… hmm, not as good as I was hoping. I thought this would be around 1-2 minutes.
:| I bet there is something I'm doing wrong.
Enjoy!,
Marko.
http://markorodriguez.com