Hello,
I believe that the most important aspect of Gremlin is the concept of bulking.
With graph traversals leading to combinatoric explosions in both time and
space, the best way to battle such explosions is to realize that even if the
number of unique paths is increasing rapidly, there is an upper limit on the
number of unique vertices touched. In order to project an exponential space to
a constant space, Gremlin traverses have the concept of a bulk. If two
traversers “are equals,” then when they meet each other at the same step in the
traversal and location in the graph, they are merged and their respective bulks
are sum’d. Thus, two traverser become one. Excellent. This is why we can do
massive branch traversals like below in blazing fast time:
gremlin> clockWithResult(10){g.V().repeat(out()).times(10).count().next()}
==>32.8619489 // 33 milliseconds
==>4947356572210703772 // 4 quintillion paths explored
However, this model breaks down when traversers are no longer “equal.” While
two traversers may have the same step and graph location, they may have
different path histories. Thus, for traversals that leverage path information,
it is very difficult to get traverser equality and thus, because the number of
unique paths grows exponentially and traverser equality is determined by path
equality (when paths are required for a traversal), you get exponential
traverser growth.
gremlin>
clockWithResult(10){g.V().repeat(out()).times(10).path().count().next()}
// this will not complete within the life of universe
Thus, in order to make Gremlin fast, its all about doing our best to reduce the
amount of information each traverser maintains which, in principle, increases
the likelihood that any two traversers can be merged/bulked (increases the
likelihood of “equality”).
Ted Wilmes led the development of
https://issues.apache.org/jira/browse/TINKERPOP-1254
<https://issues.apache.org/jira/browse/TINKERPOP-1254>. With this ticket,
traversers are able to shed path information they will no longer need down the
line. Thus, not only reducing their path footprint (space), but also increasing
the likelihood of bulking (space+time). How does it work? Take the following
example traversal:
g.V().as(‘a’).out().as(‘b’).where(‘a’,eq(‘b’)).select(‘a’).out() // get the
adjacent neighbors of those traversers that have themselves as a neighbor
Prior to PathRetractionStrategy, a traverser’s path history would look like
this:
[v[1]] // V
[v[1], v[2]] // out
[v[1], v[2]] // where
[v[1], v[2], v[1]] // select
[v[1], v[2], v[1], v[3]] // out
Now with PathRetractionStrategy.
[v[1]] // V
[v[1], v[2]] // out
[v[1]] // where
[] // select
[] // out
Once where() is done with “b”, it wipes the “b”-path information for the
traverser’s path history. Next, once select() is done with “a”, it wipes
“a”-path information from the traverser’s path history. Thus, after select(),
there is no longer path information required for the traversal to faithfully
execute. Moreover, after select(), bulking becomes extremely likely and thus,
instead of calculating out() on an exponential set of traversers, we will only
calculate it on the upper limit of traverses which is |V| (the total number of
vertices in the graph). Next, what is really cool is that MatchStep is smart
about pruning path information dynamically as needed based upon which patterns
still required for the traverser to complete its computation. Here is a
match()-traversal that finds all vertices that are in a triangle relationship.
Note that we don’t care about triangles, just the vertices involved in one as
we only select(“a”). However, by only selecting(“a”) this means that MatchStep
is able to dynamically prune path information as it is no longer needed — “b”
and “c".
g.V().has(“performances”,gt(500)).match(
as("a").out().as("b"),
as(“b”).out().as("c”),
as(“c”).out().as(“a”)).select("a").profile()
This leads to the following runtimes below. Notice how PathRetractionStrategy
is able to increase the likelihood of bulking in MatchStep — as path
information is shed (no longer needed for the remaining patterns to check),
traversers get merged and thus, less computation is required. However, if you
need to select(“a”,”b”,”c”) from the match() patterns, well, then
PathRetractionStrategy provides no benefit. Though, if you just need “a” and
“b”, then PathRetractionStrategy helps. In short, as long as intermediate
results can be shed along the way, then PathRetractionStrategy will capitalize
on this fact and increase the likelihood of bulking.
OLTP w/o PathRetractionStrategy
Traversal Metrics
Step Count
Traversers Time (ms) % Dur
=============================================================================================================
TinkerGraphStep(vertex,[performances.gt(500)]) 9
9 20.937 1.42
MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9546 1435.432 97.36
MatchStartStep(a) 9
9 35.414
VertexStep(OUT,vertex) 603
603 23.232
MatchEndStep(b) 603
603 44.727
MatchStartStep(b) 603
602 8.148
VertexStep(OUT,vertex) 24279
24279 31.825
MatchEndStep(c) 24279
24279 41.708
MatchStartStep(c) 24279
24248 31.367
VertexStep(OUT,vertex) 1073459
1073459 749.078
MatchEndStep(a) 9546
9546 350.763
SelectOneStep(a) 9546
9546 18.016 1.22
>TOTAL -
- 1474.386 -
OLTP w/ PathRetractionStrategy
Traversal Metrics
Step Count
Traversers Time (ms) % Dur
=============================================================================================================
TinkerGraphStep(vertex,[performances.gt(500)]) 9
9 1.441 0.48
MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9 295.555 99.20
MatchStartStep(a) 9
9 2.385
VertexStep(OUT,vertex) 603
603 2.006
MatchEndStep(b) 603
603 5.554
MatchStartStep(b) 603
602 3.245
VertexStep(OUT,vertex) 24279
24279 17.903
MatchEndStep(c) 24279
24279 112.304
MatchStartStep(c) 24279
2574 4.445
VertexStep(OUT,vertex) 1073459
64061 39.700
MatchEndStep(a) 9546
619 26.218
SelectOneStep(a) 9546
9 0.603 0.20
NoOpBarrierStep(2500) 9546
9 0.349 0.12
>TOTAL -
- 297.950 -
OLAP w/o PathRetractionStrategy
Traversal Metrics
Step Count
Traversers Time (ms) % Dur
=============================================================================================================
GraphStep(vertex,[]) 808
808 15.306 32.30
HasStep([performances.gt(500)]) 9
9 0.048 0.10
MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9546 21.210 44.76
MatchStartStep(a) 9
9 0.037
VertexStep(OUT,vertex) 603
602 4.915
MatchEndStep(b) (profiling ignored)
0.000
MatchStartStep(b) 603
602 5.604
VertexStep(OUT,vertex) 24279
24248 28.272
MatchEndStep(c) (profiling ignored)
0.000
MatchStartStep(c) 24279
24248 26.132
VertexStep(OUT,vertex) 1073459
1072266 1478.613
MatchEndStep(a) (profiling ignored)
0.000
SelectOneStep(a) 9546
9546 10.826 22.84
>TOTAL -
- 47.392 -
OLTP w/ PathRetractionStrategy
Traversal Metrics
Step Count
Traversers Time (ms) % Dur
=============================================================================================================
GraphStep(vertex,[]) 808
808 5.633 94.99
HasStep([performances.gt(500)]) 9
9 0.059 1.00
MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546
9 0.145 2.45
MatchStartStep(a) 9
9 0.041
VertexStep(OUT,vertex) 603
602 3.854
MatchEndStep(b) (profiling ignored)
0.000
MatchStartStep(b) 603
602 1.891
VertexStep(OUT,vertex) 24279
24248 78.803
MatchEndStep(c) (profiling ignored)
0.000
MatchStartStep(c) 24279
2574 5.581
VertexStep(OUT,vertex) 1073459
63887 101.209
MatchEndStep(a) (profiling ignored)
0.000
SelectOneStep(a) 9546
9 0.092 1.56
>TOTAL -
- 5.930 -
Enjoy,
Marko.
http://markorodriguez.com