Hola amigos y amigas (y las hormigas), Yet again, I come bearing gifts and yet again, these gifts are shinier than the previous. Are you ready to accept these treasures into your heart of hearts?
1. gremlin-core now has the concept of a VertexProgramInterceptor. https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/traversal/strategy/VertexProgramInterceptor.java 2. spark-gremlin now has the extension called SparkVertexProgramInterceptor. https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/strategy/SparkVertexProgramInterceptor.java 3. A VertexProgramInterceptor can be used by a GraphComputer implementation to introspect on the submitted VertexProgram and do something different if it is so inclined. 4. SparkStarBarrierInterceptor is able to process any traversal that is local to the star graph and ends with a ReducingBarrierStep using native SparkDSL. https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/strategy/optimization/interceptor/SparkStarBarrierInterceptor.java 5. Moreover, I added SparkSingleIterationStrategy that will NOT cache nor partition the loaded graphRDD if the traversal only requires a single iteration. https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/traversal/strategy/optimization/SparkSingleIterationStrategy.java *** This is Russell Spitzer's idea and was a good one. g.V().count() -- answer 125000000 (125 million vertices) - TinkerPop 3.0.0.MX: 2.5 hours - TinkerPop 3.0.0: 1.5 hours - TinkerPop 3.1.1: 23 minutes - TinkerPop 3.2.0: 6.8 minutes (Spark 1.5.2) - TinkerPop 3.2.0: 5.5 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 4.5 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 2.2 minutes (Spark 1.6.1) // latest master g.V().out().count() -- answer 2586147869 (2.5 billion length-1 paths (i.e. edges)) - TinkerPop 3.0.0.MX: unknown - TinkerPop 3.0.0: 2.5 hours - TinkerPop 3.1.1: 1.1 hours - TinkerPop 3.2.0: 13 minutes (Spark 1.5.2) - TinkerPop 3.2.0: 12 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 10 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 2.4 minutes (Spark 1.6.1) // latest master g.V().out().out().count() -- answer 640528666156 (640 billion length-2 paths) - TinkerPop 3.0.0.MX: unknown - TinkerPop 3.0.0: unknown - TinkerPop 3.1.1: unknown - TinkerPop 3.2.0: 55 minutes (Spark 1.5.2) - TinkerPop 3.2.0: 50 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 45 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 37 minutes (Spark 1.6.1) // latest master g.V().out().out().out().count() -- answer 215664338057221 (215 trillion length 3-paths) - TinkerPop 3.0.0.MX: 12.8 hours - TinkerPop 3.0.0: 8.6 hours - TinkerPop 3.1.1: 2.4 hours - TinkerPop 3.2.0: 1.6 hours (Spark 1.5.2) - TinkerPop 3.2.0: 1.5 hours (Spark 1.6.1) - TinkerPop 3.2.1: 1.3 hours (Spark 1.6.1) - TinkerPop 3.2.1: 1.1 hours (Spark 1.6.1) // latest master g.V().out().out().out().out().count() -- answer 83841426570464575 (83 quadrillion length 4-paths) - TinkerPop 3.0.0.MX: unknown - TinkerPop 3.0.0: unknown - TinkerPop 3.1.1: unknown - TinkerPop 3.2.0: unknown (Spark 1.5.2) - TinkerPop 3.2.0: 2.1 hours (Spark 1.6.1) - TinkerPop 3.2.1: 1.9 hours (Spark 1.6.1) - TinkerPop 3.2.1: 1.7 hours (Spark 1.6.1) // latest master g.V().out().out().out().out().out().count() -- answer -2280190503167902456 !! I blew the long space -- 64-bit overflow. - TinkerPop 3.0.0.MX: unknown - TinkerPop 3.0.0: unknown - TinkerPop 3.1.1: unknown - TinkerPop 3.2.0: unknown (Spark 1.5.2) - TinkerPop 3.2.0: 2.8 hours (Spark 1.6.1) - TinkerPop 3.2.1: 2.4 hours (Spark 1.6.1) - TinkerPop 3.2.1: 2.2 hours (Spark 1.6.1) // latest master g.V().group().by(outE().count()).by(count()). - TinkerPop 3.2.0: 12 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 12 minutes (Spark 1.6.1) - TinkerPop 3.2.1: 2.4 minutes (Spark 1.6.1) // latest master So, while the SparkStarBarrierInterceptor is good, note that SparkSingleIterationStrategy is better. Without SparkStarBarrierInterceptor and only SparkSingleIterationStrategy, we are looking at the following times: // previous SparkGraphComputer -- native SparkDSL -- with only SparkSingleIterationStrategy (no SparkStarBarrierInterceptor). g.V().count(): 4.5m -- 2.5m -- 2.9m g.V().out().count(): 10m -- 2.6m -- 3.1m g.V().groupCount().by(outE().count()): 12m -- 2.7m -- 3.2m The key is really the intelligent use of caching and partitioning. Don't do it if you don't need to! And you don't need to on a single iteration. What this also tells us is that SparkGraphComputer and native Spark are very similar in time which is good as we want to make sure Gremlin and SparkDSL are as equal as possible so there is no incentive for users to create Frankencode that has both Gremlin and SparkDSL code intermingled. Next up -- doing multi-iteration traversals in native SparkDSL and seeing their performance relative to multi-iteration Gremlin traversals and ensuring that they are "the same." ------------ What we also have done here is created the first provider-specific GraphComputer TraversalStrategy. The release of TinkerPop 3.2.0 created the distinction between Graph and GraphComputer TraversalStrategy registrations. All I had to do was register the strategies with SparkGraphComputer and when a user goes to use SparkGraphComputer, the new strategies are loaded. https://github.com/apache/incubator-tinkerpop/blob/master/spark-gremlin/src/main/java/org/apache/tinkerpop/gremlin/spark/process/computer/SparkGraphComputer.java#L98-L103 We will start to see lots more GraphComputer strategies moving forward especially since the cost of their evaluation relative to the cost of an OLAP job is negligible. That is, in OLTP, too many introspections/strategies starts to eat into the milliseconds of the execution time. However, with OLAP, when just setting up a job on Spark/Hadoop/etc. takes at least 5 seconds, who cares about an extra millisecond for a traversal strategy to execute. Thus, we can start to "go crazy" with OLAP optimizations and now feel guilty. Enjoy, Marko. http://markorodriguez.com