A Novel "Bytecode" Optimization Mechanism for TP4

Marko Rodriguez Wed, 15 May 2019 05:09:59 -0700

Hi,

Thinking last night, I came up with another way of doing bytecode optimization 
in TP4 that has some interesting properties.


        1. Providers don't write custom strategies.
        2. Providers don’t write custom instructions.
        3. TP4 can be ignorant of large swaths of optimization techniques.
        ==> Instead, providers define custom Sequences.
                        - In Java, Sequence<T>.iterator() => Iterator<T>
                        - In mm-ADT, a sequence tuple.

———

Assume the following {graph} tuple.

{ type:graph V:<V> }

g.V()
=compiles to=>
[V]
=evaluates to=>
 <V>

When an instruction is applied to a tuple, it first sees if that tuple has a 
respective "instruction-key.” If so, the value of that key is returned. Thus, 
[V] => <V>.

<V> is a “sequence” (an iterator) of vertex tuples. It is a reference/pointer 
to a bunch of vertices.

< type:V, parent:{graph}, bytecode:[[V]], hasId:<V.hasId> >

If all you did was g.V(), then <V> would be dereferenced (iterated) yielding 
all the vertex tuples of the graph.  Note that the {graph} tuple was the one 
who said there was a V-key that returned <V>. In other words, the graph 
provider knows what a <V> sequence is as it created it! Thus, the provider 
knows how to generate an iterator of vertex tuples when <V>.iterator() is 
called.

Moving on….

// g.V().hasId(1) 
[V][hasId,1] 
  ==> <V.hasId>

Note above that the provider’s created <V> sequence has a hasId-key that 
returns a <V.hasId> sequence. Again, like <V>, <V.hasId> is a reference/pointer 
to a bunch of vertex tuples.

< type:V.hasId, parent:<V>, bytecode:[[V][hasId,1]] >

If all you did was g.V().hasId(1), then <V.hasId> would be dereferenced 
(iterated) yielding v[1]. Note that <V.hasId> was created by <V> which was 
created by {graph}. Thus, the graph provider indirectly created <V.hasId> and 
thus, <V.hasId> knows how to dereference/iterate itself with respects to the 
{graph} (follow the parent-key chain back to {graph}). Assume for this graph 
provider, a <V.hasId> dereference/iteration performs an index lookup by id.

Notice how we haven’t done anything with bytecode strategies. g.V().hasId() was 
able to trigger an index lookup. No custom strategy. No custom instructions. 
Why? Because the graph provider delays dereferencing these sequences and thus, 
delays manifesting vertex objects! When it finally has to manifest vertex 
objects, it has an instruction-provenance chain that allows it to be smart 
about how to get the data — i.e. an index lookup is possible.

A dereference doesn’t just happen when the end of the bytecode is reached. No, 
it also happens when a sequence doesn’t have a respective instruction-key. 
Watch...

// g.V().hasId(1).has(‘name’,’marko’)
[V][hasId,1][has,name,marko] 
  ==> { type:vertex, id:1, name:marko, outE:<outE> }

The <V.hasId> sequence from previous does not have a has-key. Thus, the 
sequence chain can no longer delay evaluation. <V.hasId> is dereferenced, index 
lookup occurs, and v[1] is flatmapped into the processor stream. The 
has(name,marko) instruction is evaluated on v[1]. The v[1] tuple doesn’t have a 
has-key so the HasFunction does its standard evaluation on a vertex (no delayed 
evaluation as we are back into standard TP-stream processing).

Moving on...

// g.V().hasId(1).has(‘name’,’marko’).outE()
[V][hasId,1][has,name,marko][outE]
  ==> <outE>

When the v[1] vertex tuple is flatmapped into the processor stream from 
<V.hasId>, HasFunction lets it live, and then the [outE] instruction is called. 
The v[1] vertex tuple has an outE-key. Thus, instead of OutEdgesFunction 
evaluating on v[1], v[1] provides an <outE> sequence object to the processor 
stream.

< type:outE, parent:{vertex id:1}, bytecode:[[outE]], hasLabel:<outE.hasLabel> >

If no more instructions, outE is dereferenced. Since v[1] created the <outE>, 
it must have the logic to create an iterator() of outgoing incident edge tuples.

Moving on...

// g.V().hasId(1).has(‘name’,’marko’).outE(‘knows’)
[V][hasId,1][has,name,marko][outE][hasLabel,knows]
  ==> <outE.hasLabel>

< type:outE.hasLabel, parent:<outE>, bytecode:[[outE][hasLabel,knows]], 
inV:<outE.hasLabel.inV> >

Do you see where this is going?

// g.V().hasId().has(‘name’,’marko’).out(‘knows’)
[V][hasId,1][has,name,marko][outE][hasLabel,knows][inV]
  ==> <outE.hasLabel.inV>

< type:outE.hasLabel.inV, parent:<outE.hasLabel>, 
bytecode:[[outE][hasLabel,knows][inV]] >

When the <outE.hasLabel.inV> sequence is dereferenced, v[1] will know how to 
get all its know-adjacent vertices. Guess what cool things just happened? 
        1. We didn’t materialize any incident edges.
        2. We used a vertex-centric index to directly grab the v[1] knows-edges 
off disk. 

——

Here is why this direction may prove fruitful for TP4:

        1. Optimizations don’t have to be determined at compile via strategies. 
                * Instead they can be determined at runtime via this “delayed 
evaluation”-mechanism.
        2. Optimizations don’t have to be global to a type, they can also be 
local to an instance.
                * v[1] could have a out(‘knows’)-index, but v[4] might not!
                * This realization happens at runtime, not at compile time.
                * Now think about when we support multiple databases in the 
same query.
                        ? How do you compile so you know a TinkerVertex will 
never touch a custom JanusGraph vertex-centric-index-backed instruction ?
        3. TP4 doesn’t have to be aware of all the various forms of 
optimization as we are currently doing in the mm-ADT spec.
                * For some provider, g.V().has(‘name’,’marko’).id() may never 
manifest a vertex. Purely from a name-index, it could return the vertex id.
                * Thankfully with this sequencing-method, TP4 doesn't have to 
think of this!

Like always, my emails are out of control long. Thus, I will stop here. If 
people think this is a cool idea, I can show you some other entailments 
(especially how it effects the mm-ADT specification, how it makes it easier for 
providers to embed themselves in other model-ADTs, and how it RDFMS query 
evaluation would happen TP4……I can’t help myself — here is a sneak peak: Tables 
are Sequences! table.where().where().select().. :))

Thanks for reading,
Marko.

http://rredux.com <http://rredux.com/>

A Novel "Bytecode" Optimization Mechanism for TP4

Reply via email to