[jira] [Commented] (TINKERPOP-1288) Support gremlin.spark.skipPartitioning configuration.

ASF GitHub Bot (JIRA) Thu, 05 May 2016 09:29:39 -0700

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272581#comment-15272581
 ]


ASF GitHub Bot commented on TINKERPOP-1288:
-------------------------------------------

GitHub user okram opened a pull request:

    https://github.com/apache/incubator-tinkerpop/pull/305

    TINKERPOP-1288 & TINKERPOP-1290: VertexProgramInterceptor and GraphComputer 
TraversalStrategies

    https://issues.apache.org/jira/browse/TINKERPOP-1288
    https://issues.apache.org/jira/browse/TINKERPOP-1290
    
    We now have the concept of `TraversalStrategies` for `GraphComputers`. A 
clear pattern has been established for their use with `SparkGraphComputer`. 
Extending these concepts to `TinkerGraphComputer` will be easy. Moreover, we 
have `SparkInterceptorStrategy` which we can continue to extend with traversal 
patterns that we can map to use Spark DSL under the hood instead of 
`GraphComputer` semantics for performance boosting. However, after some 
intelligent refactoring around partitioning and caching (thanks 
@RussellSpitzer), `SparkGraphComputer` is very close in speed to Spark DSL 
computations. This is good as if that was sufficiently large, that would 
destroy the purpose of writing Gremlin traversals for analytics. Either way, we 
are both fast with non-intercepted traversals and intercepted traversal. In 
short, this is an epic PR.
    
    CHANGELOG (master)
    
    ```
    * Added `VertexProgramInterceptor` interface as a general pattern for 
`GraphComputer` providers to use for bypassing `GraphComputer` semantics where 
appropriate.
    * Added `SparkStarBarrierInterceptor` that uses Spark DSL for local star 
graph traversals that end with a `ReducingBarrierStep`.
    * Added `SparkInterceptorStrategy` which identifies which interceptor to 
use (if any) given the submitted `VertexProgram`.
    * Added `SparkSingleIterationStrategy` that does not partition nor cache 
the graph RDD if the traversal does not message pass.
    * Added more helper methods to `TraversalHelper` for handling scoped 
traversal children.
    ```
    
    I'm going to cherry pick the bug fix to 
`TraversalHelper.isLocalStarGraph()` to tp31/.
    
    CHANGELOG (tp31)
    
    ```
    * Fixed a severe bug in `TraversalHelper.isLocalStarGraph()`.
    ```
    
    ---
    
    BENCHMARKS (Friendster on the Blades):
    
    ```
    g.V().count()
      before: 4.5 minutes
      no-partition: 2.9 minutes 
      interceptor: 2.5 minutes
    
    g.V().out().count()
      before: 10 minutes
      no-partitioner: 3.1 minutes 
      interceptor: 2.6 minutes
    
    g.V().groupCount().by(outE().count())
      before: 12 minutes
      no-partitioner: 3.2 minutes
      interceptor: 2.6 minutes 
    ```
    
    ---
    
    VOTE +1 (`mvn clean install` + integration tests -- Giraph still running)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/incubator-tinkerpop TINKERPOP-1288

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-tinkerpop/pull/305.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #305
    
----
commit 77480a2c7875e284b063ecca434bf2165116689d
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-03T20:14:29Z

    Created the first GraphComputer Provider TraversalStrategy -- 
SparkPartitionAwareStratgegy. This strategy analyzes the traversal to determine 
if partitioning should be skipped -- e.g. if no message pass is happening. I 
think this strategy will start to set the stage for how future GraphComputer 
provider-specific strategies will work.

commit e8bba359be7a773a138b023929c0a2c02a9d6de8
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-03T20:48:27Z

    syncing with 1120.

commit 40b3931f843dcd3fe22da4ec91b1b252e41ffe7e
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-03T20:48:31Z

    Merge branch 'TINKERPOP-1120' into TINKERPOP-1288

commit 6e92c4e115599a77381a34fdd94e19ce48866f37
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-03T20:53:14Z

    merged in TINKERPOP-1120 as that has all the new optimizations. going to 
test on the blades.

commit 7c103c8b0bf218c5eb6ec83ccfe5d416fd671e3d
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-03T23:10:29Z

    DefaultTraversal.hashCode() was bad. Fixed. Added NativeInterceptor which 
right now is tied to SparkGraphComputer, but should be generalizable to all 
GraphComputer providers. In short, instead of executing the VertexProgram, an 
interceptor can take the Graph/Memory and pop out a Graph/Memory. Thus, 
skipping the VertexProgram execution. Right now I have VertexCountInterceptor 
which does inputRDD.count(). BAM. Works. Added more test cases and cleanup. 
Going to probably just roost on this branch for a bit until I get the concept 
clear and simple. Pushing so I can benchmark on the Blades.

commit d5e7c78628d89c464418f1329a797fc192880e8b
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T02:55:03Z

    lots of cleanup. NativeInterceptor is now called VertexProgramInterceptor. 
Its generalized to be used by any GraphComputer implementation. There exists 
SparkVertexProgramInterceptor which extends VertexProgramInterceptor. Cleaned 
up naming conventions in the test cases. Found a bug in TraversalSource that 
was causing RemoteGraph computer tests to fail. SparkGraphComputer is cleaner 
with how interception happens -- really clear delineation which will be used 
for all GraphComptuers that use the Interceptor-model. Running integration 
tests over night.

commit f0acfd7512330f82d14e0c17d90f46487008b6e1
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T14:40:34Z

    Generalized SparkVertexCountInterceptor to SparkCountInterceptor and it can 
recognize and natively process various g.V.name.count(), g.V.out.count, etc. 
type of traversals. Added a few more test cases to AdjacentToIncidentStrategy. 
Came up with a nice test infrastructure for SparkInterceptorStrategyTest.

commit 2717b281ed085a365d59edc718db9a33a9266343
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T15:44:03Z

    added test cases around persisted of rdds. if the interceptor does not 
'touch' the loadedGraphRDD then its still the loadedGraphRDD and thus, 
persistence rules for loadedGraphRDD should apply there.

commit 6f9e5fa0c7ffa01b7fbd3c3fed9539f6ad12bd4c
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T17:26:00Z

    just generalized SparkCountInterceptor to work with ANY star graph 
computation that ends with count(). The concept is simple, just do 
flatMapToPair(traversal).count(). tada. This is so clean and pretty.... next is 
going to be generalizing to work with any ReducingBarrierStep, not just 
count(). Also, need to add some test cases that use sideEffects and ensure that 
Memory is updated accordingly. However, besides that, this is such a dead 
simple concept that I actually think this is how we should redesign 
SparkGraphComputer. Break up the traverasl into starGraphTraversals and then 
concatenate the jobs via Spark join()s.

commit 71f66c5114ae5e90aad8dc5f5cf77b200f6e1f30
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T18:51:35Z

    Got sideEffects working with SparkCountInterceptor. 
g.V.groupCount(m).count() works as expected. Classy. Found a bug in 
TraversalHelper.isLocalStarGraph(), fixed.

commit 01fda96707335f2e33b073cef980c7844ca6d80c
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T19:48:39Z

    merged master into this branch. had to fix up SparkGraphComputer conflicts 
cause of ThreadInterrupt work from @spmallette. Added some Scope-based 
recurssive methods to TraversalHelper. SparkCountItercepter is now smart about 
sideEffects. Added more tests to ensure proper interception.

commit 6a256f8fcbe6a87a8d794466a7b559ff9ba8d2d6
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T21:45:27Z

    SparkCountInterceptor is now called SparkStarBarrierInterceptor and works 
with count(), max(), sum(), min(), and mean(). groupCount(), group(), and 
tree() are still left to do.

commit 6e122e42b9c49648ce42a295b94d780c3eb11aae
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-04T22:56:04Z

    got GroupCount working with SparkStarBarrierInterceptor. Going to test on 
the Blades.

commit 854b259cb0645d1d50e5365392b49a373e2131a5
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-05T15:54:41Z

    found severe issues with TraversalHelper.isLocalStarGraph(). Given that 
this is used extensively by GraphComputer implementations, this was a good 
find. Crazy enough, no test cases for that method. Added a host of them for 
traverals of varying complexity. Added spark.skipPartitioner and 
spark.skipGraphCache which are both current modulated by 
SparkSingleIterationStrategy. Opened up some IgnoreEngines from @pietermartin 
that we due to bad TraversalHelper.isLocalStarGraph(). More tests cases for the 
new SparkXXXStrategy work. Lots of logic cleanup in SparkGraphComputer given 
the distinction between skipPartiioner and skipGraphCache.

commit 1e6aa16b9dfbd8a07c0fab8bb0ad7ee88a3cd630
Author: Marko A. Rodriguez <okramma...@gmail.com>
Date:   2016-05-05T16:15:02Z

    forgot to get rid of this class which was created in this branch -- not 
needed.

----


> Support gremlin.spark.skipPartitioning configuration.
> -----------------------------------------------------
>
>                 Key: TINKERPOP-1288
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1288
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: hadoop, process
>    Affects Versions: 3.2.0-incubating
>            Reporter: Marko A. Rodriguez
>            Assignee: Marko A. Rodriguez
>             Fix For: 3.2.1
>
>
> If a {{VertexProgram}} does not use message passing, then its best to NOT 
> partition after load as its pointless to do so.
> In particular, for {{TraversalVertexProgram}}, if the submitted traversal 
> does not contain a {{VertexStep}} then partitioning can be avoided. This can 
> be reasoned via a {{SparkPartitionStrategy}}, but for now, simply making the 
> configuration and having it do its job is sufficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1288) Support gremlin.spark.skipPartitioning configuration.

Reply via email to