Re: [TinkerPop3] The Beauty of Bulking

Marko Rodriguez Fri, 05 Jun 2015 13:13:36 -0700

Hi,

The first ping that came through on IM was Bob Briody saying: "So you have to 
know when its appropriate to add a barrier()?"


As of yesterday, yes. As of today, no.

I did two things:

        1. CollectingBarrierStep (for which NoOpBarrierStep extends) now 
supports a maxBarrierSize so as to protect from OME (i.e. a "lazy barrier").
                
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/util/CollectingBarrierStep.java
        2. LazyBarrierStrategy looks for traversals with a particular pattern 
(e.g. not OLAP, deep traversal, lots of starts, little head filtering, etc.) 
and inserts NoOpBarrierStep's accordingly.
                
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/strategy/finalization/LazyBarrierStrategy.java

LazyBarrierStrategy is not a default strategy so you will have to turn it on to 
use it. However, how look at the same MovieLens recommendation traversal from 
yesterday when LazyBarrierStrategy is turned on -- no more need for manual 
barrier() additions!

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g = 
graph.traversal(GraphTraversalSource.build().with(LazyBarrierStrategy.instance()).engine(StandardTraversalEngine.build()))
 // adding a non-default strategy
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin>
////// some script to load the MovieLens data set into TinkerGraph //////
gremlin>
gremlin> g.V('u2590').outE('rated').has('stars',gte(4)).inV().
gremlin>              aggregate('userLibrary').
gremlin>           inE('rated').has('stars',gte(4)).outV().
gremlin>           outE('rated').has('stars',gte(4)).inV().
gremlin>              
where(not(within('userLibrary'))).groupCount().by('name').iterate().toString()
==>[TinkerGraphStep([u2590],vertex), VertexStep(OUT,[rated],edge), 
HasStep([stars.gte(4)]), EdgeVertexStep(IN), NoOpBarrierStep, 
AggregateStep(userLibrary), VertexStep(IN,[rated],edge), 
HasStep([stars.gte(4)]), EdgeVertexStep(OUT), NoOpBarrierStep, 
VertexStep(OUT,[rated],edge), HasStep([stars.gte(4)]), EdgeVertexStep(IN), 
NoOpBarrierStep, WhereStep(local,without([userLibrary])), 
GroupCountStep(value(name))] // barriers inserted automagically
gremlin> 
clock(20){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().outE('rated').has('stars',gte(4)).inV().where(not(within('userLibrary'))).groupCount().by('name').next()}
==>520.97150945

The logic behind LazyBarrierStrategy is still up in the air… however, hopefully 
this first push gets the discussing going.

Take care,
Marko.

http://markorodriguez.com

On Jun 4, 2015, at 11:24 AM, Marko Rodriguez <[email protected]> wrote:

> Hello,
> 
> TinkerPop3 introduces the concept of a Traverser which encapsulates not only 
> represents the current element it is "at," but also metadata like where it 
> has been before (path), a local, side-effect data structure (sack), and how 
> many traversers it represents (bulk). Bulk is a key concept that we required 
> when we needed to do OLAP graph algorithms. With traversers emanating from 
> every vertex in the graph, the number of traversers grows fast. However, if 
> your traversal doesn't require the traversers history (path) or its local 
> data structure (sack), then well, where its current "at" is the only uniquely 
> identifying characteritic and thus, 100 traversers at v[1] is the same as 1 
> traverser at v[1] with "long bulk = 100," where counting requires much less 
> space (a 64-bit long) than enumeration.
> 
> While this concept was great for OLAP, it is also great for OLTP.
> 
> Check this out:
> 
> gremlin> graph = TinkerGraph.open()
> ==>tinkergraph[vertices:0 edges:0]
> gremlin> graph.io(gryo()).readGraph('data/grateful-dead.kryo')
> ==>null
> gremlin> g = graph.traversal()
> ==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
> gremlin> 
> clockWithResult(20){g.V().out().out().out().out().out().out().out().out().out().out().count().next()}
>       // CNTRL-C as this will take the life of the universe to compute
> gremlin> clockWithResult(20){g.V().repeat(out()).times(10).count().next()}
> ==>2880.6283869
> ==>4947356572210703772
> gremlin> 
> clockWithResult(20){g.V().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().count().next()}
> ==>27.921249
> ==>4947356572210703772
> 
> There are three traversal above. All of them yield the same result -- that 
> is, there are 4,947,356,572,210,703,772 length-10 paths in the Grateful Dead 
> graph. Or, there are 4.9 sextillion length 10-paths.
> 
> What makes the traversals above different is their respective use of bulking 
> -- and ultimately their runtime in milliseconds.
> 
>       1. This is a sequential traversal where each of the 4.9 sextillion 
> objects ultimately touched much be enumerated.
>       2. This is a repeating traversal where the emanations from any one 
> vertex are iterated before the next vertex is processed. However, for that 
> one vertex's branch, results are bulked.
>       3. This is a barrier traversal where all the traversers are bulked at 
> each outgoing step (i.e. a BSP computation, though in sequential form).
> 
> All of these traversals are single threaded and over the same data. However, 
> look at the run times --- sometimes computing power is not the key, but the 
> way in which data is represented.
> 
> To brag -- I can determine that there are 4,947,356,572,210,703,772 length-10 
> paths in the Grateful Dead graph in 27 milliseconds. 
> 
> A PRACTICAL APPLICATION:
> 
> Bulking is crucial when you have recurrence in your traversal --- in other 
> words, when your traversal touches the same elements over and over again. One 
> place this happens a lot is in recommendation algorithms. Lets take the 
> MovieLens dataset. Lets recommend movies to user "u2590."
> 
> g.V('u2590').outE('rated').has('stars',gte(4)).inV().
>              aggregate('userLibrary').
>           inE('rated').has('stars',gte(4)).outV().
>           outE('rated').has('stars',gte(4)).inV().
>              where(not(within('userLibrary'))).groupCount().by('name')
> 
> Here is the runtime of that traversal:
> 
> gremlin> 
> clock(5){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().outE('rated').has('stars',gte(4)).inV().where(not(within('userLibrary'))).groupCount().by('name').next()}
> ==>13624.124757
> 
> However, I bet another user probably rated the same movies as user u2590. 
> Thus, lets add a barrier() on those people. Finally, I bet those users also 
> rated many of the same movies. Thus, lets barrier those movies.
> 
> g.V('u2590').outE('rated').has('stars',gte(4)).inV().
>              aggregate('userLibrary').
>           inE('rated').has('stars',gte(4)).outV().barrier()
>           outE('rated').has('stars',gte(4)).inV().barrier()
>              where(not(within('userLibrary'))).groupCount().by('name')
> 
> Here is the runtime of this new traversal which yields the exact same result:
> 
> gremlin> 
> clock(5){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().barrier().outE('rated').has('stars',gte(4)).inV().barrier().where(not(within('userLibrary'))).groupCount().by('name').next()}
> ==>531.6441133999999
> 
> From 13.5 seconds down to 0.5 seconds. Now that is the beauty of bulking.
> 
> Enjoy!,
> Marko.
> 
> http://markorodriguez.com
> 
> References:
> 
>       [1] 
> http://tinkerpop.incubator.apache.org/docs/3.0.0.M8-incubating/#a-note-on-barrier-steps
> 
>

Re: [TinkerPop3] The Beauty of Bulking

Reply via email to