Hi,
Thinking last night, I came up with another way of doing bytecode optimization
in TP4 that has some interesting properties.
1. Providers don't write custom strategies.
2. Providers don’t write custom instructions.
3. TP4 can be ignorant of large swaths of optimization techniques.
==> Instead, providers define custom Sequences.
- In Java, Sequence<T>.iterator() => Iterator<T>
- In mm-ADT, a sequence tuple.
———
Assume the following {graph} tuple.
{ type:graph V:<V> }
g.V()
=compiles to=>
[V]
=evaluates to=>
<V>
When an instruction is applied to a tuple, it first sees if that tuple has a
respective "instruction-key.” If so, the value of that key is returned. Thus,
[V] => <V>.
<V> is a “sequence” (an iterator) of vertex tuples. It is a reference/pointer
to a bunch of vertices.
< type:V, parent:{graph}, bytecode:[[V]], hasId:<V.hasId> >
If all you did was g.V(), then <V> would be dereferenced (iterated) yielding
all the vertex tuples of the graph. Note that the {graph} tuple was the one
who said there was a V-key that returned <V>. In other words, the graph
provider knows what a <V> sequence is as it created it! Thus, the provider
knows how to generate an iterator of vertex tuples when <V>.iterator() is
called.
Moving on….
// g.V().hasId(1)
[V][hasId,1]
==> <V.hasId>
Note above that the provider’s created <V> sequence has a hasId-key that
returns a <V.hasId> sequence. Again, like <V>, <V.hasId> is a reference/pointer
to a bunch of vertex tuples.
< type:V.hasId, parent:<V>, bytecode:[[V][hasId,1]] >
If all you did was g.V().hasId(1), then <V.hasId> would be dereferenced
(iterated) yielding v[1]. Note that <V.hasId> was created by <V> which was
created by {graph}. Thus, the graph provider indirectly created <V.hasId> and
thus, <V.hasId> knows how to dereference/iterate itself with respects to the
{graph} (follow the parent-key chain back to {graph}). Assume for this graph
provider, a <V.hasId> dereference/iteration performs an index lookup by id.
Notice how we haven’t done anything with bytecode strategies. g.V().hasId() was
able to trigger an index lookup. No custom strategy. No custom instructions.
Why? Because the graph provider delays dereferencing these sequences and thus,
delays manifesting vertex objects! When it finally has to manifest vertex
objects, it has an instruction-provenance chain that allows it to be smart
about how to get the data — i.e. an index lookup is possible.
A dereference doesn’t just happen when the end of the bytecode is reached. No,
it also happens when a sequence doesn’t have a respective instruction-key.
Watch...
// g.V().hasId(1).has(‘name’,’marko’)
[V][hasId,1][has,name,marko]
==> { type:vertex, id:1, name:marko, outE:<outE> }
The <V.hasId> sequence from previous does not have a has-key. Thus, the
sequence chain can no longer delay evaluation. <V.hasId> is dereferenced, index
lookup occurs, and v[1] is flatmapped into the processor stream. The
has(name,marko) instruction is evaluated on v[1]. The v[1] tuple doesn’t have a
has-key so the HasFunction does its standard evaluation on a vertex (no delayed
evaluation as we are back into standard TP-stream processing).
Moving on...
// g.V().hasId(1).has(‘name’,’marko’).outE()
[V][hasId,1][has,name,marko][outE]
==> <outE>
When the v[1] vertex tuple is flatmapped into the processor stream from
<V.hasId>, HasFunction lets it live, and then the [outE] instruction is called.
The v[1] vertex tuple has an outE-key. Thus, instead of OutEdgesFunction
evaluating on v[1], v[1] provides an <outE> sequence object to the processor
stream.
< type:outE, parent:{vertex id:1}, bytecode:[[outE]], hasLabel:<outE.hasLabel> >
If no more instructions, outE is dereferenced. Since v[1] created the <outE>,
it must have the logic to create an iterator() of outgoing incident edge tuples.
Moving on...
// g.V().hasId(1).has(‘name’,’marko’).outE(‘knows’)
[V][hasId,1][has,name,marko][outE][hasLabel,knows]
==> <outE.hasLabel>
< type:outE.hasLabel, parent:<outE>, bytecode:[[outE][hasLabel,knows]],
inV:<outE.hasLabel.inV> >
Do you see where this is going?
// g.V().hasId().has(‘name’,’marko’).out(‘knows’)
[V][hasId,1][has,name,marko][outE][hasLabel,knows][inV]
==> <outE.hasLabel.inV>
< type:outE.hasLabel.inV, parent:<outE.hasLabel>,
bytecode:[[outE][hasLabel,knows][inV]] >
When the <outE.hasLabel.inV> sequence is dereferenced, v[1] will know how to
get all its know-adjacent vertices. Guess what cool things just happened?
1. We didn’t materialize any incident edges.
2. We used a vertex-centric index to directly grab the v[1] knows-edges
off disk.
——
Here is why this direction may prove fruitful for TP4:
1. Optimizations don’t have to be determined at compile via strategies.
* Instead they can be determined at runtime via this “delayed
evaluation”-mechanism.
2. Optimizations don’t have to be global to a type, they can also be
local to an instance.
* v[1] could have a out(‘knows’)-index, but v[4] might not!
* This realization happens at runtime, not at compile time.
* Now think about when we support multiple databases in the
same query.
? How do you compile so you know a TinkerVertex will
never touch a custom JanusGraph vertex-centric-index-backed instruction ?
3. TP4 doesn’t have to be aware of all the various forms of
optimization as we are currently doing in the mm-ADT spec.
* For some provider, g.V().has(‘name’,’marko’).id() may never
manifest a vertex. Purely from a name-index, it could return the vertex id.
* Thankfully with this sequencing-method, TP4 doesn't have to
think of this!
Like always, my emails are out of control long. Thus, I will stop here. If
people think this is a cool idea, I can show you some other entailments
(especially how it effects the mm-ADT specification, how it makes it easier for
providers to embed themselves in other model-ADTs, and how it RDFMS query
evaluation would happen TP4……I can’t help myself — here is a sneak peak: Tables
are Sequences! table.where().where().select().. :))
Thanks for reading,
Marko.
http://rredux.com <http://rredux.com/>