Re: [TinkerPop] GraphActors as a new distributed computing framework in TinkerPop

Marko Rodriguez Thu, 15 Dec 2016 08:40:37 -0800

Hello,

> I have only one idea: do traversal API users still really have to know 
> whether they use a GraphComputer or GraphActors? In other words, can the 
> withEngine options not just be some illuminating token constants for users 
> that just want to have the traversal() returned (LOCAL, LOCAL_DISTRIBUTED, 
> DISTRIBUTED)?  Of course, the more extended API will be useful for a minority 
> of power users that want to optimize an ActorProgram for a specific use case.


So, Matthias Bröcheler, for a few years now, wanted something like a 
TraversalEngineReasoningStrategy. This would be a DecorationStrategy that would 
look at the traversal and make a best guess as whether to execute 
iterator-style, compute-style, or actors-style. For instance:

        g.V().count() // computer
        g.V(1).out().count() // iterator
        g.V(1).repeat(out()).times(3).count() // actors

I think for now withEngine() is the bare-bones necessity and we can get clever 
with reasoning and your enum-model down the line.

Finally, note that I went with withProcessor() last night as the name :). 
GraphActors and GraphComputer both implement a new Processor interface (which 
is primarily a marker interface).

Marko.


> 
> Cheers,    Marc
> 
> Op woensdag 14 december 2016 18:46:44 UTC+1 schreef Marko A. Rodriguez:
> Hello,
> 
> For the last week I’ve been working on “distributed OLTP.” Gremlin has a 
> really nice architecture in that a traverser can be shipped around a cluster 
> and reattached to its respective element (vertex/edge/etc.) and step 
> (traversal) at the remote location and continue to compute. Thus, we can have 
> step-by-step query routing.
> 
>       https://issues.apache.org/jira/browse/TINKERPOP-1564 
> <https://issues.apache.org/jira/browse/TINKERPOP-1564>
> 
> With that, I’ve created GraphActors which is similar to GraphComputer. 
> However, there are some fundamental distinctions:
> 
>       1. GraphActors assumes the boundary of computation is a Partition.
>               - GraphComputer assumes the boundary of computation is vertex 
> and its incident edges and properties.
>       2. GraphActors assumes asynchronous computation with barriers at 
> Barrier steps.
>               - GraphComputer assumes (sorta) synchronous computation with a 
> barrier when all traversers have left their local vertex.
>       3. GraphActors is traverser-centric and partition-bound.
>               - GraphComputer is vertex-centric and vertex-bound.
> 
> In gremlin-core/ I’ve created a new set of interfaces off of process/.
> 
>       
> https://github.com/apache/tinkerpop/tree/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor
>  
> <https://github.com/apache/tinkerpop/tree/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor>
>               GraphActors <=> GraphComputer
>               MasterActor <=> setup()/terminate().
>               WorkerActor <=> execute()
>               ActorProgram <=> VertexProgram
> 
> The parallel between GraphComputer and GraphActors are strong. In short, a 
> (hardcore) user can create an ActorProgram and submit it to a GraphActors. 
> The ActorProgram will effect a distributed, asynchronous, partition-bound 
> message passing algorithm and return a Future<Result>. There is one 
> ActorProgram in particular the executes a Gremlin traversal.
> 
>       
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalActorProgram.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalActorProgram.java>
>         master actor program: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalMasterProgram.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalMasterProgram.java>
>         worker actors program: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalWorkerProgram.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/actor/traversal/TraversalWorkerProgram.java>
> 
> Pretty simple. Besides some problems I’m having with serialization stuff in 
> GroupStep, the ProcessSuite passes.
> 
> Now, its up to a provider to implement the GraphActors interfaces. Welcome 
> akka-gremlin/.
> 
>       
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/AkkaGraphActors.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/AkkaGraphActors.java>
>         master actor: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/MasterActor.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/MasterActor.java>
>         worker actor: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/WorkerActor.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/WorkerActor.java>
>       mailbox system: 
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/ActorMailbox.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/akka-gremlin/src/main/java/org/apache/tinkerpop/gremlin/akka/process/actor/ActorMailbox.java>
> 
> Dead simple!
> 
> What we should have done with TinkerPop from the start is include the notion 
> of a Partition. For this branch, I’ve added two concepts Partition and 
> Partitioner.
> 
>       
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/Partitioner.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/Partitioner.java>
>       
> https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/Partition.java
>  
> <https://github.com/apache/tinkerpop/blob/TINKERPOP-1564/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/structure/Partition.java>
> 
> Why is this cool? This is cool because if GraphActors system knows the 
> partitions of the underlying Graph data, then it can immediately process the 
> Graph data in a distributed manner. No need to write a custom “InputFormat.” 
> We should have done this from the start because then GraphComputer could do 
> the same. For instance, spark-gremlin/ can run over TinkerGraph as it doesn’t 
> care about TinkerGraph, it cares about Partition “input splits.” By adding 
> this layer of information, ANY Graph can work with ANY GraphComputer or 
> GraphActors. I have yet to create PartitionInputRDD and PartitionInputFormat, 
> but that will be next and at that point, GraphComputers are agnostic to the 
> underlying implementation.
> 
> So there you have it. Your thoughts on the matter would be most appreciated.
> 
> Things still left to do:
> 
>       * We need a concept of a traversal engine. That is, something like:
>               - g.withEngine(SparkGraphComputer.class)        // it knows its 
> a GraphComputer so thats the engine.
>               - g.withEngine(AkkaGraphActors.class)   // it knows its a 
> GraphActors so thats the engine.
>               - g.withEngine(Iterator.class)                  // this means, 
> just iterate the traversal locally :).
>       * GraphComputer semantics are a restricted version of GraphActors 
> semantics.
>               - GraphActors becomes GraphComputer when the Partitions are 
> defined by vertices.
>               - I think I can unify the two and thus, we could have 
> SparkGraphActors.
> 
> 
> 
> Here is some fun playing:
> 
> 
>          \,,,/
>          (o o)
> -----oOOo-(3)-oOOo-----
> plugin activated: tinkerpop.server
> plugin activated: tinkerpop.utilities
> plugin activated: tinkerpop.tinkergraph
> gremlin> :install org.apache.tinkerpop akka-gremlin 3.3.0-SNAPSHOT
> ==>Loaded: [org.apache.tinkerpop, akka-gremlin, 3.3.0-SNAPSHOT]
> gremlin> graph = TinkerFactory.createModern()
> ==>tinkergraph[vertices:6 edges:6]
> gremlin> graph.partitioner()
> ==>partitioner[globalpartitioner:1]
> gremlin> partitioner = new HashPartitioner(graph.partitioner(),3)          // 
> lets create 3 logical partitions over TinkerGraph
> ==>partitioner[hashpartitioner:3]
> gremlin> g = graph.traversal().withStrategies(new 
> ActorProgramStrategy(AkkaGraphActors,partitioner)) // in the future 
> withEngine() will be used
> ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], actors]
> gremlin> g.V().repeat(out()).times(2).values('name')
> ==>lop
> ==>ripple
> gremlin> g.V().repeat(both()).times(2).groupCount().by(out().in().count())  
> // beyond the star graph!
> ==>[0:13,3:3,4:7,5:7]
> gremlin> g.V().match(                                                     // 
> distributed pattern matching
> ......1>  __.as('a').out('created').as('b'),
> ......2>  __.as('b').has('name', 'lop'),
> ......3>  __.as('b').in('created').as('c'),
> ......4>  __.as('c').has('age', 29)).
> ......5>   select('a','c').by('name')
> ==>[a:marko,c:marko]
> ==>[a:josh,c:marko]
> ==>[a:peter,c:marko]
> gremlin>
> 
> Now imagine this executing over various providers:
> 
>       1. A sharded graph database (Titan/DSEGraph/OrientDB): the traversers 
> move between machines so they are always data local processing.
>       2. A replicated graph database (Neo4j): logical partitions are created 
> so that each machine is responsible for a subgraph of the full graph (load 
> balancing and parallization).
>       3. A single machine graph database (TinkerGraph): logical partitions 
> are created so that each core of the machine is responsible for a subgraph 
> (parallization).
> 
> Pretty neat, eh?
> 
> Ideas are more than welcome,
> Marko.
> 
> http://markorodriguez.com <http://markorodriguez.com/>
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Gremlin-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to gremlin-users+unsubscr...@googlegroups.com 
> <mailto:gremlin-users+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/gremlin-users/98255596-3d17-4b92-a82f-3d219eaffd81%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/gremlin-users/98255596-3d17-4b92-a82f-3d219eaffd81%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

Re: [TinkerPop] GraphActors as a new distributed computing framework in TinkerPop

Reply via email to