[ https://issues.apache.org/jira/browse/TINKERPOP-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272581#comment-15272581 ]
ASF GitHub Bot commented on TINKERPOP-1288: ------------------------------------------- GitHub user okram opened a pull request: https://github.com/apache/incubator-tinkerpop/pull/305 TINKERPOP-1288 & TINKERPOP-1290: VertexProgramInterceptor and GraphComputer TraversalStrategies https://issues.apache.org/jira/browse/TINKERPOP-1288 https://issues.apache.org/jira/browse/TINKERPOP-1290 We now have the concept of `TraversalStrategies` for `GraphComputers`. A clear pattern has been established for their use with `SparkGraphComputer`. Extending these concepts to `TinkerGraphComputer` will be easy. Moreover, we have `SparkInterceptorStrategy` which we can continue to extend with traversal patterns that we can map to use Spark DSL under the hood instead of `GraphComputer` semantics for performance boosting. However, after some intelligent refactoring around partitioning and caching (thanks @RussellSpitzer), `SparkGraphComputer` is very close in speed to Spark DSL computations. This is good as if that was sufficiently large, that would destroy the purpose of writing Gremlin traversals for analytics. Either way, we are both fast with non-intercepted traversals and intercepted traversal. In short, this is an epic PR. CHANGELOG (master) ``` * Added `VertexProgramInterceptor` interface as a general pattern for `GraphComputer` providers to use for bypassing `GraphComputer` semantics where appropriate. * Added `SparkStarBarrierInterceptor` that uses Spark DSL for local star graph traversals that end with a `ReducingBarrierStep`. * Added `SparkInterceptorStrategy` which identifies which interceptor to use (if any) given the submitted `VertexProgram`. * Added `SparkSingleIterationStrategy` that does not partition nor cache the graph RDD if the traversal does not message pass. * Added more helper methods to `TraversalHelper` for handling scoped traversal children. ``` I'm going to cherry pick the bug fix to `TraversalHelper.isLocalStarGraph()` to tp31/. CHANGELOG (tp31) ``` * Fixed a severe bug in `TraversalHelper.isLocalStarGraph()`. ``` --- BENCHMARKS (Friendster on the Blades): ``` g.V().count() before: 4.5 minutes no-partition: 2.9 minutes interceptor: 2.5 minutes g.V().out().count() before: 10 minutes no-partitioner: 3.1 minutes interceptor: 2.6 minutes g.V().groupCount().by(outE().count()) before: 12 minutes no-partitioner: 3.2 minutes interceptor: 2.6 minutes ``` --- VOTE +1 (`mvn clean install` + integration tests -- Giraph still running) You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1288 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-tinkerpop/pull/305.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #305 ---- commit 77480a2c7875e284b063ecca434bf2165116689d Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-03T20:14:29Z Created the first GraphComputer Provider TraversalStrategy -- SparkPartitionAwareStratgegy. This strategy analyzes the traversal to determine if partitioning should be skipped -- e.g. if no message pass is happening. I think this strategy will start to set the stage for how future GraphComputer provider-specific strategies will work. commit e8bba359be7a773a138b023929c0a2c02a9d6de8 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-03T20:48:27Z syncing with 1120. commit 40b3931f843dcd3fe22da4ec91b1b252e41ffe7e Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-03T20:48:31Z Merge branch 'TINKERPOP-1120' into TINKERPOP-1288 commit 6e92c4e115599a77381a34fdd94e19ce48866f37 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-03T20:53:14Z merged in TINKERPOP-1120 as that has all the new optimizations. going to test on the blades. commit 7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-03T23:10:29Z DefaultTraversal.hashCode() was bad. Fixed. Added NativeInterceptor which right now is tied to SparkGraphComputer, but should be generalizable to all GraphComputer providers. In short, instead of executing the VertexProgram, an interceptor can take the Graph/Memory and pop out a Graph/Memory. Thus, skipping the VertexProgram execution. Right now I have VertexCountInterceptor which does inputRDD.count(). BAM. Works. Added more test cases and cleanup. Going to probably just roost on this branch for a bit until I get the concept clear and simple. Pushing so I can benchmark on the Blades. commit d5e7c78628d89c464418f1329a797fc192880e8b Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T02:55:03Z lots of cleanup. NativeInterceptor is now called VertexProgramInterceptor. Its generalized to be used by any GraphComputer implementation. There exists SparkVertexProgramInterceptor which extends VertexProgramInterceptor. Cleaned up naming conventions in the test cases. Found a bug in TraversalSource that was causing RemoteGraph computer tests to fail. SparkGraphComputer is cleaner with how interception happens -- really clear delineation which will be used for all GraphComptuers that use the Interceptor-model. Running integration tests over night. commit f0acfd7512330f82d14e0c17d90f46487008b6e1 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T14:40:34Z Generalized SparkVertexCountInterceptor to SparkCountInterceptor and it can recognize and natively process various g.V.name.count(), g.V.out.count, etc. type of traversals. Added a few more test cases to AdjacentToIncidentStrategy. Came up with a nice test infrastructure for SparkInterceptorStrategyTest. commit 2717b281ed085a365d59edc718db9a33a9266343 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T15:44:03Z added test cases around persisted of rdds. if the interceptor does not 'touch' the loadedGraphRDD then its still the loadedGraphRDD and thus, persistence rules for loadedGraphRDD should apply there. commit 6f9e5fa0c7ffa01b7fbd3c3fed9539f6ad12bd4c Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T17:26:00Z just generalized SparkCountInterceptor to work with ANY star graph computation that ends with count(). The concept is simple, just do flatMapToPair(traversal).count(). tada. This is so clean and pretty.... next is going to be generalizing to work with any ReducingBarrierStep, not just count(). Also, need to add some test cases that use sideEffects and ensure that Memory is updated accordingly. However, besides that, this is such a dead simple concept that I actually think this is how we should redesign SparkGraphComputer. Break up the traverasl into starGraphTraversals and then concatenate the jobs via Spark join()s. commit 71f66c5114ae5e90aad8dc5f5cf77b200f6e1f30 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T18:51:35Z Got sideEffects working with SparkCountInterceptor. g.V.groupCount(m).count() works as expected. Classy. Found a bug in TraversalHelper.isLocalStarGraph(), fixed. commit 01fda96707335f2e33b073cef980c7844ca6d80c Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T19:48:39Z merged master into this branch. had to fix up SparkGraphComputer conflicts cause of ThreadInterrupt work from @spmallette. Added some Scope-based recurssive methods to TraversalHelper. SparkCountItercepter is now smart about sideEffects. Added more tests to ensure proper interception. commit 6a256f8fcbe6a87a8d794466a7b559ff9ba8d2d6 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T21:45:27Z SparkCountInterceptor is now called SparkStarBarrierInterceptor and works with count(), max(), sum(), min(), and mean(). groupCount(), group(), and tree() are still left to do. commit 6e122e42b9c49648ce42a295b94d780c3eb11aae Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-04T22:56:04Z got GroupCount working with SparkStarBarrierInterceptor. Going to test on the Blades. commit 854b259cb0645d1d50e5365392b49a373e2131a5 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-05T15:54:41Z found severe issues with TraversalHelper.isLocalStarGraph(). Given that this is used extensively by GraphComputer implementations, this was a good find. Crazy enough, no test cases for that method. Added a host of them for traverals of varying complexity. Added spark.skipPartitioner and spark.skipGraphCache which are both current modulated by SparkSingleIterationStrategy. Opened up some IgnoreEngines from @pietermartin that we due to bad TraversalHelper.isLocalStarGraph(). More tests cases for the new SparkXXXStrategy work. Lots of logic cleanup in SparkGraphComputer given the distinction between skipPartiioner and skipGraphCache. commit 1e6aa16b9dfbd8a07c0fab8bb0ad7ee88a3cd630 Author: Marko A. Rodriguez <okramma...@gmail.com> Date: 2016-05-05T16:15:02Z forgot to get rid of this class which was created in this branch -- not needed. ---- > Support gremlin.spark.skipPartitioning configuration. > ----------------------------------------------------- > > Key: TINKERPOP-1288 > URL: https://issues.apache.org/jira/browse/TINKERPOP-1288 > Project: TinkerPop > Issue Type: Improvement > Components: hadoop, process > Affects Versions: 3.2.0-incubating > Reporter: Marko A. Rodriguez > Assignee: Marko A. Rodriguez > Fix For: 3.2.1 > > > If a {{VertexProgram}} does not use message passing, then its best to NOT > partition after load as its pointless to do so. > In particular, for {{TraversalVertexProgram}}, if the submitted traversal > does not contain a {{VertexStep}} then partitioning can be avoided. This can > be reasoned via a {{SparkPartitionStrategy}}, but for now, simply making the > configuration and having it do its job is sufficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)