On Jul 12, 2016 4:22 PM, "Marko Rodriguez" <[email protected]> wrote:
>
> Hello,
>
> I believe that the most important aspect of Gremlin is the concept of
bulking. With graph traversals leading to combinatoric explosions in both
time and space, the best way to battle such explosions is to realize that
even if the number of unique paths is increasing rapidly, there is an upper
limit on the number of unique vertices touched. In order to project an
exponential space to a constant space, Gremlin traverses have the concept
of a bulk. If two traversers “are equals,” then when they meet each other
at the same step in the traversal and location in the graph, they are
merged and their respective bulks are sum’d. Thus, two traverser become
one. Excellent. This is why we can do massive branch traversals like below
in blazing fast time:
>
> gremlin> clockWithResult(10){g.V().repeat(out()).times(10).count().next()}
> ==>32.8619489 // 33 milliseconds
> ==>4947356572210703772 // 4 quintillion paths explored
>
> However, this model breaks down when traversers are no longer “equal.”
While two traversers may have the same step and graph location, they may
have different path histories. Thus, for traversals that leverage path
information, it is very difficult to get traverser equality and thus,
because the number of unique paths grows exponentially and traverser
equality is determined by path equality (when paths are required for a
traversal), you get exponential traverser growth.
>
> gremlin>
clockWithResult(10){g.V().repeat(out()).times(10).path().count().next()}
> // this will not complete within the life of universe
>
> Thus, in order to make Gremlin fast, its all about doing our best to
reduce the amount of information each traverser maintains which, in
principle, increases the likelihood that any two traversers can be
merged/bulked (increases the likelihood of “equality”).
>
> Ted Wilmes led the development of
https://issues.apache.org/jira/browse/TINKERPOP-1254 <
https://issues.apache.org/jira/browse/TINKERPOP-1254>. With this ticket,
traversers are able to shed path information they will no longer need down
the line. Thus, not only reducing their path footprint (space), but also
increasing the likelihood of bulking (space+time). How does it work? Take
the following example traversal:
>
> g.V().as(‘a’).out().as(‘b’).where(‘a’,eq(‘b’)).select(‘a’).out() // get
the adjacent neighbors of those traversers that have themselves as a
neighbor
>
> Prior to PathRetractionStrategy, a traverser’s path history would look
like this:
>
> [v[1]] // V
> [v[1], v[2]] // out
> [v[1], v[2]] // where
> [v[1], v[2], v[1]] // select
> [v[1], v[2], v[1], v[3]] // out
>
> Now with PathRetractionStrategy.
>
> [v[1]] // V
> [v[1], v[2]] // out
> [v[1]] // where
> [] // select
> [] // out
>
> Once where() is done with “b”, it wipes the “b”-path information for the
traverser’s path history. Next, once select() is done with “a”, it wipes
“a”-path information from the traverser’s path history. Thus, after
select(), there is no longer path information required for the traversal to
faithfully execute. Moreover, after select(), bulking becomes extremely
likely and thus, instead of calculating out() on an exponential set of
traversers, we will only calculate it on the upper limit of traverses which
is |V| (the total number of vertices in the graph). Next, what is really
cool is that MatchStep is smart about pruning path information dynamically
as needed based upon which patterns still required for the traverser to
complete its computation. Here is a match()-traversal that finds all
vertices that are in a triangle relationship. Note that we don’t care about
triangles, just the vertices involved in one as we only select(“a”).
However, by only selecting(“a”) this means that MatchStep is able to
dynamically prune path information as it is no longer needed — “b” and “c".
>
> g.V().has(“performances”,gt(500)).match(
> as("a").out().as("b"),
> as(“b”).out().as("c”),
> as(“c”).out().as(“a”)).select("a").profile()
>
>
> This leads to the following runtimes below. Notice how
PathRetractionStrategy is able to increase the likelihood of bulking in
MatchStep — as path information is shed (no longer needed for the remaining
patterns to check), traversers get merged and thus, less computation is
required. However, if you need to select(“a”,”b”,”c”) from the match()
patterns, well, then PathRetractionStrategy provides no benefit. Though, if
you just need “a” and “b”, then PathRetractionStrategy helps. In short, as
long as intermediate results can be shed along the way, then
PathRetractionStrategy will capitalize on this fact and increase the
likelihood of bulking.
>
> OLTP w/o PathRetractionStrategy
>
> Traversal Metrics
> Step Count
Traversers Time (ms) % Dur
>
=============================================================================================================
> TinkerGraphStep(vertex,[performances.gt(500)]) 9
9 20.937 1.42
> MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9546 1435.432 97.36
> MatchStartStep(a) 9
9 35.414
> VertexStep(OUT,vertex) 603
603 23.232
> MatchEndStep(b) 603
603 44.727
> MatchStartStep(b) 603
602 8.148
> VertexStep(OUT,vertex) 24279
24279 31.825
> MatchEndStep(c) 24279
24279 41.708
> MatchStartStep(c) 24279
24248 31.367
> VertexStep(OUT,vertex) 1073459
1073459 749.078
> MatchEndStep(a) 9546
9546 350.763
> SelectOneStep(a) 9546
9546 18.016 1.22
> >TOTAL -
- 1474.386 -
>
> OLTP w/ PathRetractionStrategy
>
> Traversal Metrics
> Step Count
Traversers Time (ms) % Dur
>
=============================================================================================================
> TinkerGraphStep(vertex,[performances.gt(500)]) 9
9 1.441 0.48
> MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9 295.555 99.20
> MatchStartStep(a) 9
9 2.385
> VertexStep(OUT,vertex) 603
603 2.006
> MatchEndStep(b) 603
603 5.554
> MatchStartStep(b) 603
602 3.245
> VertexStep(OUT,vertex) 24279
24279 17.903
> MatchEndStep(c) 24279
24279 112.304
> MatchStartStep(c) 24279
2574 4.445
> VertexStep(OUT,vertex) 1073459
64061 39.700
> MatchEndStep(a) 9546
619 26.218
> SelectOneStep(a) 9546
9 0.603 0.20
> NoOpBarrierStep(2500) 9546
9 0.349 0.12
> >TOTAL -
- 297.950 -
>
> OLAP w/o PathRetractionStrategy
>
> Traversal Metrics
> Step Count
Traversers Time (ms) % Dur
>
=============================================================================================================
> GraphStep(vertex,[]) 808
808 15.306 32.30
> HasStep([performances.gt(500)]) 9
9 0.048 0.10
> MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9546 21.210 44.76
> MatchStartStep(a) 9
9 0.037
> VertexStep(OUT,vertex) 603
602 4.915
> MatchEndStep(b) (profiling ignored)
0.000
> MatchStartStep(b) 603
602 5.604
> VertexStep(OUT,vertex) 24279
24248 28.272
> MatchEndStep(c) (profiling ignored)
0.000
> MatchStartStep(c) 24279
24248 26.132
> VertexStep(OUT,vertex) 1073459
1072266 1478.613
> MatchEndStep(a) (profiling ignored)
0.000
> SelectOneStep(a) 9546
9546 10.826 22.84
> >TOTAL -
- 47.392 -
>
> OLTP w/ PathRetractionStrategy
>
> Traversal Metrics
> Step Count
Traversers Time (ms) % Dur
>
=============================================================================================================
> GraphStep(vertex,[]) 808
808 5.633 94.99
> HasStep([performances.gt(500)]) 9
9 0.059 1.00
> MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9 0.145 2.45
> MatchStartStep(a) 9
9 0.041
> VertexStep(OUT,vertex) 603
602 3.854
> MatchEndStep(b) (profiling ignored)
0.000
> MatchStartStep(b) 603
602 1.891
> VertexStep(OUT,vertex) 24279
24248 78.803
> MatchEndStep(c) (profiling ignored)
0.000
> MatchStartStep(c) 24279
2574 5.581
> VertexStep(OUT,vertex) 1073459
63887 101.209
> MatchEndStep(a) (profiling ignored)
0.000
> SelectOneStep(a) 9546
9 0.092 1.56
> >TOTAL -
- 5.930 -
>
>
> Enjoy,
> Marko.
>
> http://markorodriguez.com
>
>
>