Hello, I believe that the most important aspect of Gremlin is the concept of bulking. With graph traversals leading to combinatoric explosions in both time and space, the best way to battle such explosions is to realize that even if the number of unique paths is increasing rapidly, there is an upper limit on the number of unique vertices touched. In order to project an exponential space to a constant space, Gremlin traverses have the concept of a bulk. If two traversers “are equals,” then when they meet each other at the same step in the traversal and location in the graph, they are merged and their respective bulks are sum’d. Thus, two traverser become one. Excellent. This is why we can do massive branch traversals like below in blazing fast time:
gremlin> clockWithResult(10){g.V().repeat(out()).times(10).count().next()} ==>32.8619489 // 33 milliseconds ==>4947356572210703772 // 4 quintillion paths explored However, this model breaks down when traversers are no longer “equal.” While two traversers may have the same step and graph location, they may have different path histories. Thus, for traversals that leverage path information, it is very difficult to get traverser equality and thus, because the number of unique paths grows exponentially and traverser equality is determined by path equality (when paths are required for a traversal), you get exponential traverser growth. gremlin> clockWithResult(10){g.V().repeat(out()).times(10).path().count().next()} // this will not complete within the life of universe Thus, in order to make Gremlin fast, its all about doing our best to reduce the amount of information each traverser maintains which, in principle, increases the likelihood that any two traversers can be merged/bulked (increases the likelihood of “equality”). Ted Wilmes led the development of https://issues.apache.org/jira/browse/TINKERPOP-1254 <https://issues.apache.org/jira/browse/TINKERPOP-1254>. With this ticket, traversers are able to shed path information they will no longer need down the line. Thus, not only reducing their path footprint (space), but also increasing the likelihood of bulking (space+time). How does it work? Take the following example traversal: g.V().as(‘a’).out().as(‘b’).where(‘a’,eq(‘b’)).select(‘a’).out() // get the adjacent neighbors of those traversers that have themselves as a neighbor Prior to PathRetractionStrategy, a traverser’s path history would look like this: [v[1]] // V [v[1], v[2]] // out [v[1], v[2]] // where [v[1], v[2], v[1]] // select [v[1], v[2], v[1], v[3]] // out Now with PathRetractionStrategy. [v[1]] // V [v[1], v[2]] // out [v[1]] // where [] // select [] // out Once where() is done with “b”, it wipes the “b”-path information for the traverser’s path history. Next, once select() is done with “a”, it wipes “a”-path information from the traverser’s path history. Thus, after select(), there is no longer path information required for the traversal to faithfully execute. Moreover, after select(), bulking becomes extremely likely and thus, instead of calculating out() on an exponential set of traversers, we will only calculate it on the upper limit of traverses which is |V| (the total number of vertices in the graph). Next, what is really cool is that MatchStep is smart about pruning path information dynamically as needed based upon which patterns still required for the traverser to complete its computation. Here is a match()-traversal that finds all vertices that are in a triangle relationship. Note that we don’t care about triangles, just the vertices involved in one as we only select(“a”). However, by only selecting(“a”) this means that MatchStep is able to dynamically prune path information as it is no longer needed — “b” and “c". g.V().has(“performances”,gt(500)).match( as("a").out().as("b"), as(“b”).out().as("c”), as(“c”).out().as(“a”)).select("a").profile() This leads to the following runtimes below. Notice how PathRetractionStrategy is able to increase the likelihood of bulking in MatchStep — as path information is shed (no longer needed for the remaining patterns to check), traversers get merged and thus, less computation is required. However, if you need to select(“a”,”b”,”c”) from the match() patterns, well, then PathRetractionStrategy provides no benefit. Though, if you just need “a” and “b”, then PathRetractionStrategy helps. In short, as long as intermediate results can be shed along the way, then PathRetractionStrategy will capitalize on this fact and increase the likelihood of bulking. OLTP w/o PathRetractionStrategy Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= TinkerGraphStep(vertex,[performances.gt(500)]) 9 9 20.937 1.42 MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546 9546 1435.432 97.36 MatchStartStep(a) 9 9 35.414 VertexStep(OUT,vertex) 603 603 23.232 MatchEndStep(b) 603 603 44.727 MatchStartStep(b) 603 602 8.148 VertexStep(OUT,vertex) 24279 24279 31.825 MatchEndStep(c) 24279 24279 41.708 MatchStartStep(c) 24279 24248 31.367 VertexStep(OUT,vertex) 1073459 1073459 749.078 MatchEndStep(a) 9546 9546 350.763 SelectOneStep(a) 9546 9546 18.016 1.22 >TOTAL - - 1474.386 - OLTP w/ PathRetractionStrategy Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= TinkerGraphStep(vertex,[performances.gt(500)]) 9 9 1.441 0.48 MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546 9 295.555 99.20 MatchStartStep(a) 9 9 2.385 VertexStep(OUT,vertex) 603 603 2.006 MatchEndStep(b) 603 603 5.554 MatchStartStep(b) 603 602 3.245 VertexStep(OUT,vertex) 24279 24279 17.903 MatchEndStep(c) 24279 24279 112.304 MatchStartStep(c) 24279 2574 4.445 VertexStep(OUT,vertex) 1073459 64061 39.700 MatchEndStep(a) 9546 619 26.218 SelectOneStep(a) 9546 9 0.603 0.20 NoOpBarrierStep(2500) 9546 9 0.349 0.12 >TOTAL - - 297.950 - OLAP w/o PathRetractionStrategy Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= GraphStep(vertex,[]) 808 808 15.306 32.30 HasStep([performances.gt(500)]) 9 9 0.048 0.10 MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546 9546 21.210 44.76 MatchStartStep(a) 9 9 0.037 VertexStep(OUT,vertex) 603 602 4.915 MatchEndStep(b) (profiling ignored) 0.000 MatchStartStep(b) 603 602 5.604 VertexStep(OUT,vertex) 24279 24248 28.272 MatchEndStep(c) (profiling ignored) 0.000 MatchStartStep(c) 24279 24248 26.132 VertexStep(OUT,vertex) 1073459 1072266 1478.613 MatchEndStep(a) (profiling ignored) 0.000 SelectOneStep(a) 9546 9546 10.826 22.84 >TOTAL - - 47.392 - OLTP w/ PathRetractionStrategy Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= GraphStep(vertex,[]) 808 808 5.633 94.99 HasStep([performances.gt(500)]) 9 9 0.059 1.00 MatchStep(AND,[[MatchStartStep(a), ProfileStep,... 9546 9 0.145 2.45 MatchStartStep(a) 9 9 0.041 VertexStep(OUT,vertex) 603 602 3.854 MatchEndStep(b) (profiling ignored) 0.000 MatchStartStep(b) 603 602 1.891 VertexStep(OUT,vertex) 24279 24248 78.803 MatchEndStep(c) (profiling ignored) 0.000 MatchStartStep(c) 24279 2574 5.581 VertexStep(OUT,vertex) 1073459 63887 101.209 MatchEndStep(a) (profiling ignored) 0.000 SelectOneStep(a) 9546 9 0.092 1.56 >TOTAL - - 5.930 - Enjoy, Marko. http://markorodriguez.com