Hi,
The first ping that came through on IM was Bob Briody saying: "So you have to
know when its appropriate to add a barrier()?"
As of yesterday, yes. As of today, no.
I did two things:
1. CollectingBarrierStep (for which NoOpBarrierStep extends) now
supports a maxBarrierSize so as to protect from OME (i.e. a "lazy barrier").
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/step/util/CollectingBarrierStep.java
2. LazyBarrierStrategy looks for traversals with a particular pattern
(e.g. not OLAP, deep traversal, lots of starts, little head filtering, etc.)
and inserts NoOpBarrierStep's accordingly.
https://github.com/apache/incubator-tinkerpop/blob/master/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/traversal/strategy/finalization/LazyBarrierStrategy.java
LazyBarrierStrategy is not a default strategy so you will have to turn it on to
use it. However, how look at the same MovieLens recommendation traversal from
yesterday when LazyBarrierStrategy is turned on -- no more need for manual
barrier() additions!
gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g =
graph.traversal(GraphTraversalSource.build().with(LazyBarrierStrategy.instance()).engine(StandardTraversalEngine.build()))
// adding a non-default strategy
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin>
////// some script to load the MovieLens data set into TinkerGraph //////
gremlin>
gremlin> g.V('u2590').outE('rated').has('stars',gte(4)).inV().
gremlin> aggregate('userLibrary').
gremlin> inE('rated').has('stars',gte(4)).outV().
gremlin> outE('rated').has('stars',gte(4)).inV().
gremlin>
where(not(within('userLibrary'))).groupCount().by('name').iterate().toString()
==>[TinkerGraphStep([u2590],vertex), VertexStep(OUT,[rated],edge),
HasStep([stars.gte(4)]), EdgeVertexStep(IN), NoOpBarrierStep,
AggregateStep(userLibrary), VertexStep(IN,[rated],edge),
HasStep([stars.gte(4)]), EdgeVertexStep(OUT), NoOpBarrierStep,
VertexStep(OUT,[rated],edge), HasStep([stars.gte(4)]), EdgeVertexStep(IN),
NoOpBarrierStep, WhereStep(local,without([userLibrary])),
GroupCountStep(value(name))] // barriers inserted automagically
gremlin>
clock(20){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().outE('rated').has('stars',gte(4)).inV().where(not(within('userLibrary'))).groupCount().by('name').next()}
==>520.97150945
The logic behind LazyBarrierStrategy is still up in the air… however, hopefully
this first push gets the discussing going.
Take care,
Marko.
http://markorodriguez.com
On Jun 4, 2015, at 11:24 AM, Marko Rodriguez <[email protected]> wrote:
> Hello,
>
> TinkerPop3 introduces the concept of a Traverser which encapsulates not only
> represents the current element it is "at," but also metadata like where it
> has been before (path), a local, side-effect data structure (sack), and how
> many traversers it represents (bulk). Bulk is a key concept that we required
> when we needed to do OLAP graph algorithms. With traversers emanating from
> every vertex in the graph, the number of traversers grows fast. However, if
> your traversal doesn't require the traversers history (path) or its local
> data structure (sack), then well, where its current "at" is the only uniquely
> identifying characteritic and thus, 100 traversers at v[1] is the same as 1
> traverser at v[1] with "long bulk = 100," where counting requires much less
> space (a 64-bit long) than enumeration.
>
> While this concept was great for OLAP, it is also great for OLTP.
>
> Check this out:
>
> gremlin> graph = TinkerGraph.open()
> ==>tinkergraph[vertices:0 edges:0]
> gremlin> graph.io(gryo()).readGraph('data/grateful-dead.kryo')
> ==>null
> gremlin> g = graph.traversal()
> ==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
> gremlin>
> clockWithResult(20){g.V().out().out().out().out().out().out().out().out().out().out().count().next()}
> // CNTRL-C as this will take the life of the universe to compute
> gremlin> clockWithResult(20){g.V().repeat(out()).times(10).count().next()}
> ==>2880.6283869
> ==>4947356572210703772
> gremlin>
> clockWithResult(20){g.V().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().count().next()}
> ==>27.921249
> ==>4947356572210703772
>
> There are three traversal above. All of them yield the same result -- that
> is, there are 4,947,356,572,210,703,772 length-10 paths in the Grateful Dead
> graph. Or, there are 4.9 sextillion length 10-paths.
>
> What makes the traversals above different is their respective use of bulking
> -- and ultimately their runtime in milliseconds.
>
> 1. This is a sequential traversal where each of the 4.9 sextillion
> objects ultimately touched much be enumerated.
> 2. This is a repeating traversal where the emanations from any one
> vertex are iterated before the next vertex is processed. However, for that
> one vertex's branch, results are bulked.
> 3. This is a barrier traversal where all the traversers are bulked at
> each outgoing step (i.e. a BSP computation, though in sequential form).
>
> All of these traversals are single threaded and over the same data. However,
> look at the run times --- sometimes computing power is not the key, but the
> way in which data is represented.
>
> To brag -- I can determine that there are 4,947,356,572,210,703,772 length-10
> paths in the Grateful Dead graph in 27 milliseconds.
>
> A PRACTICAL APPLICATION:
>
> Bulking is crucial when you have recurrence in your traversal --- in other
> words, when your traversal touches the same elements over and over again. One
> place this happens a lot is in recommendation algorithms. Lets take the
> MovieLens dataset. Lets recommend movies to user "u2590."
>
> g.V('u2590').outE('rated').has('stars',gte(4)).inV().
> aggregate('userLibrary').
> inE('rated').has('stars',gte(4)).outV().
> outE('rated').has('stars',gte(4)).inV().
> where(not(within('userLibrary'))).groupCount().by('name')
>
> Here is the runtime of that traversal:
>
> gremlin>
> clock(5){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().outE('rated').has('stars',gte(4)).inV().where(not(within('userLibrary'))).groupCount().by('name').next()}
> ==>13624.124757
>
> However, I bet another user probably rated the same movies as user u2590.
> Thus, lets add a barrier() on those people. Finally, I bet those users also
> rated many of the same movies. Thus, lets barrier those movies.
>
> g.V('u2590').outE('rated').has('stars',gte(4)).inV().
> aggregate('userLibrary').
> inE('rated').has('stars',gte(4)).outV().barrier()
> outE('rated').has('stars',gte(4)).inV().barrier()
> where(not(within('userLibrary'))).groupCount().by('name')
>
> Here is the runtime of this new traversal which yields the exact same result:
>
> gremlin>
> clock(5){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().barrier().outE('rated').has('stars',gte(4)).inV().barrier().where(not(within('userLibrary'))).groupCount().by('name').next()}
> ==>531.6441133999999
>
> From 13.5 seconds down to 0.5 seconds. Now that is the beauty of bulking.
>
> Enjoy!,
> Marko.
>
> http://markorodriguez.com
>
> References:
>
> [1]
> http://tinkerpop.incubator.apache.org/docs/3.0.0.M8-incubating/#a-note-on-barrier-steps
>
>