[TinkerPop3] The Beauty of Bulking

Marko Rodriguez Thu, 04 Jun 2015 10:27:08 -0700

Hello,

TinkerPop3 introduces the concept of a Traverser which encapsulates not only 
represents the current element it is "at," but also metadata like where it has 
been before (path), a local, side-effect data structure (sack), and how many 
traversers it represents (bulk). Bulk is a key concept that we required when we 
needed to do OLAP graph algorithms. With traversers emanating from every vertex 
in the graph, the number of traversers grows fast. However, if your traversal 
doesn't require the traversers history (path) or its local data structure 
(sack), then well, where its current "at" is the only uniquely identifying 
characteritic and thus, 100 traversers at v[1] is the same as 1 traverser at 
v[1] with "long bulk = 100," where counting requires much less space (a 64-bit 
long) than enumeration.


While this concept was great for OLAP, it is also great for OLTP.

Check this out:

gremlin> graph = TinkerGraph.open()
==>tinkergraph[vertices:0 edges:0]
gremlin> graph.io(gryo()).readGraph('data/grateful-dead.kryo')
==>null
gremlin> g = graph.traversal()
==>graphtraversalsource[tinkergraph[vertices:808 edges:8049], standard]
gremlin> 
clockWithResult(20){g.V().out().out().out().out().out().out().out().out().out().out().count().next()}
        // CNTRL-C as this will take the life of the universe to compute
gremlin> clockWithResult(20){g.V().repeat(out()).times(10).count().next()}
==>2880.6283869
==>4947356572210703772
gremlin> 
clockWithResult(20){g.V().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().out().barrier().count().next()}
==>27.921249
==>4947356572210703772

There are three traversal above. All of them yield the same result -- that is, 
there are 4,947,356,572,210,703,772 length-10 paths in the Grateful Dead graph. 
Or, there are 4.9 sextillion length 10-paths.

What makes the traversals above different is their respective use of bulking -- 
and ultimately their runtime in milliseconds.

        1. This is a sequential traversal where each of the 4.9 sextillion 
objects ultimately touched much be enumerated.
        2. This is a repeating traversal where the emanations from any one 
vertex are iterated before the next vertex is processed. However, for that one 
vertex's branch, results are bulked.
        3. This is a barrier traversal where all the traversers are bulked at 
each outgoing step (i.e. a BSP computation, though in sequential form).

All of these traversals are single threaded and over the same data. However, 
look at the run times --- sometimes computing power is not the key, but the way 
in which data is represented.

To brag -- I can determine that there are 4,947,356,572,210,703,772 length-10 
paths in the Grateful Dead graph in 27 milliseconds. 

A PRACTICAL APPLICATION:

Bulking is crucial when you have recurrence in your traversal --- in other 
words, when your traversal touches the same elements over and over again. One 
place this happens a lot is in recommendation algorithms. Lets take the 
MovieLens dataset. Lets recommend movies to user "u2590."

g.V('u2590').outE('rated').has('stars',gte(4)).inV().
             aggregate('userLibrary').
          inE('rated').has('stars',gte(4)).outV().
          outE('rated').has('stars',gte(4)).inV().
             where(not(within('userLibrary'))).groupCount().by('name')

Here is the runtime of that traversal:

gremlin> 
clock(5){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().outE('rated').has('stars',gte(4)).inV().where(not(within('userLibrary'))).groupCount().by('name').next()}
==>13624.124757

However, I bet another user probably rated the same movies as user u2590. Thus, 
lets add a barrier() on those people. Finally, I bet those users also rated 
many of the same movies. Thus, lets barrier those movies.

g.V('u2590').outE('rated').has('stars',gte(4)).inV().
             aggregate('userLibrary').
          inE('rated').has('stars',gte(4)).outV().barrier()
          outE('rated').has('stars',gte(4)).inV().barrier()
             where(not(within('userLibrary'))).groupCount().by('name')

Here is the runtime of this new traversal which yields the exact same result:

gremlin> 
clock(5){g.V('u2590').outE('rated').has('stars',gte(4)).inV().aggregate('userLibrary').inE('rated').has('stars',gte(4)).outV().barrier().outE('rated').has('stars',gte(4)).inV().barrier().where(not(within('userLibrary'))).groupCount().by('name').next()}
==>531.6441133999999

From 13.5 seconds down to 0.5 seconds. Now that is the beauty of bulking.

Enjoy!,
Marko.

http://markorodriguez.com

References:

        [1] 
http://tinkerpop.incubator.apache.org/docs/3.0.0.M8-incubating/#a-note-on-barrier-steps

[TinkerPop3] The Beauty of Bulking

Reply via email to