Ruminations on SparkGraphComputer Part CincoDeMarko

Marko Rodriguez Sat, 07 May 2016 06:29:24 -0700

Hola amigos y amigas (y las hormigas),

Yet again, I come bearing gifts and yet again, these gifts are shinier than the 
previous. Are you ready to accept these treasures into your heart of hearts?


        1. gremlin-core now has the concept of a VertexProgramInterceptor.
                
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/traversal/strategy/VertexProgramInterceptor.java
        2. spark-gremlin now has the extension called 
SparkVertexProgramInterceptor.
                
https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/strategy/SparkVertexProgramInterceptor.java
        3. A VertexProgramInterceptor can be used by a GraphComputer 
implementation to introspect on the submitted VertexProgram and do something 
different if it is so inclined.
        4. SparkStarBarrierInterceptor is able to process any traversal that is 
local to the star graph and ends with a ReducingBarrierStep using native 
SparkDSL.
                
https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/strategy/optimization/interceptor/SparkStarBarrierInterceptor.java
        5. Moreover, I added SparkSingleIterationStrategy that will NOT cache 
nor partition the loaded graphRDD if the traversal only requires a single 
iteration.
                
https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/strategy/optimization/SparkSingleIterationStrategy.java
                *** This is Russell Spitzer's idea and was a good one.

g.V().count() -- answer 125000000 (125 million vertices)
        - TinkerPop 3.0.0.MX: 2.5 hours
        - TinkerPop 3.0.0:      1.5 hours
        - TinkerPop 3.1.1:      23 minutes
        - TinkerPop 3.2.0:      6.8 minutes (Spark 1.5.2)
        - TinkerPop 3.2.0:      5.5 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      4.5 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      2.2 minutes (Spark 1.6.1) // latest master

g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. 
edges))
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      2.5 hours
        - TinkerPop 3.1.1:      1.1 hours
        - TinkerPop 3.2.0:      13 minutes (Spark 1.5.2)
        - TinkerPop 3.2.0:      12 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      10 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      2.4 minutes (Spark 1.6.1) // latest master
        
g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths)
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      unknown
        - TinkerPop 3.1.1:      unknown
        - TinkerPop 3.2.0:      55 minutes (Spark 1.5.2)
        - TinkerPop 3.2.0:      50 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      45 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      37 minutes (Spark 1.6.1) // latest master

g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 
3-paths)
        - TinkerPop 3.0.0.MX: 12.8 hours
        - TinkerPop 3.0.0:      8.6 hours
        - TinkerPop 3.1.1:      2.4 hours
        - TinkerPop 3.2.0:      1.6 hours (Spark 1.5.2)
        - TinkerPop 3.2.0:      1.5 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      1.3 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      1.1 hours (Spark 1.6.1) // latest master

g.V().out().out().out().out().count() -- answer 83841426570464575 (83 
quadrillion length 4-paths)
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      unknown
        - TinkerPop 3.1.1:      unknown
        - TinkerPop 3.2.0:      unknown (Spark 1.5.2)
        - TinkerPop 3.2.0:      2.1 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      1.9 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      1.7 hours (Spark 1.6.1) // latest master

g.V().out().out().out().out().out().count() -- answer -2280190503167902456 !! I 
blew the long space -- 64-bit overflow.
        - TinkerPop 3.0.0.MX: unknown
        - TinkerPop 3.0.0:      unknown
        - TinkerPop 3.1.1:      unknown
        - TinkerPop 3.2.0:      unknown (Spark 1.5.2)
        - TinkerPop 3.2.0:      2.8 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      2.4 hours (Spark 1.6.1)
        - TinkerPop 3.2.1:      2.2 hours (Spark 1.6.1) // latest master

g.V().group().by(outE().count()).by(count()). 
        - TinkerPop 3.2.0:      12 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      12 minutes (Spark 1.6.1)
        - TinkerPop 3.2.1:      2.4 minutes (Spark 1.6.1) // latest master

So, while the SparkStarBarrierInterceptor is good, note that 
SparkSingleIterationStrategy is better. Without SparkStarBarrierInterceptor and 
only SparkSingleIterationStrategy, we are looking at the following times:
        
        // previous SparkGraphComputer -- native SparkDSL -- with only 
SparkSingleIterationStrategy (no SparkStarBarrierInterceptor).
        g.V().count(): 4.5m -- 2.5m -- 2.9m
        g.V().out().count(): 10m -- 2.6m -- 3.1m
        g.V().groupCount().by(outE().count()): 12m -- 2.7m -- 3.2m

The key is really the intelligent use of caching and partitioning. Don't do it 
if you don't need to! And you don't need to on a single iteration. What this 
also tells us is that SparkGraphComputer and native Spark are very similar in 
time which is good as we want to make sure Gremlin and SparkDSL are as equal as 
possible so there is no incentive for users to create Frankencode that has both 
Gremlin and SparkDSL code intermingled.

Next up -- doing multi-iteration traversals in native SparkDSL and seeing their 
performance relative to multi-iteration Gremlin traversals and ensuring that 
they are "the same."

------------

What we also have done here is created the first provider-specific 
GraphComputer TraversalStrategy. The release of TinkerPop 3.2.0 created the 
distinction between Graph and GraphComputer TraversalStrategy registrations. 
All I had to do was register the strategies with SparkGraphComputer and when a 
user goes to use SparkGraphComputer, the new strategies are loaded.
        
https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L98-L103
We will start to see lots more GraphComputer strategies moving forward 
especially since the cost of their evaluation relative to the cost of an OLAP 
job is negligible. That is, in OLTP, too many introspections/strategies starts 
to eat into the milliseconds of the execution time. However, with OLAP, when 
just setting up a job on Spark/Hadoop/etc. takes at least 5 seconds, who cares 
about an extra millisecond for a traversal strategy to execute. Thus, we can 
start to "go crazy" with OLAP optimizations and now feel guilty.

Enjoy,
Marko.

http://markorodriguez.com

Ruminations on SparkGraphComputer Part CincoDeMarko

Reply via email to