Re: N-Tuples, Pointers, Data Model Interfaces, and Bytecode Instructions

Marko Rodriguez Tue, 07 May 2019 10:15:07 -0700

When your hot, your hot. Going to keep pushin’.

What is the difference between a relational database's and a graph database’s 
encoding of a property graph? The short answer, NOTHING. The n-tuple model for 
a subset of the TinkerPop toy graph is:


{#type:vertex, #id:1, #label:person, name:marko, age:29, 
#outE:*{#outV=*{#id=1}}}
{#type:vertex, #id:4, #label:person, name:josh, age:32}
{#type:vertex, #id:2, #label:person, name:vadas, age:27}
{#type:edge, #id:7, #label:knows, #outV:*{#id=1}, #inV:*{#id=2}}
{#type:edge, #id:8, #label:knows, #outV:*{#id=1}, #inV:*{#id=4}}

MySQL and Neo4j would talk with TP4 via the same above tuples. What makes MySQL 
different from Neo4j is how the pointers are resolved.

First, lets talk about the MySQL schema that ultimately holds the graphdb 
instance data.

CREATE TABLE #V {
  #id int NOT NULL PRIMARY KEY,
  #label varchar,
  name varchar,
  age int,
  #outE varchar
}

CREATE TABLE #E {
  #id int NOT NULL PRIMARY KEY,
  #label varchar,
  #outV int,
  #inV int,
  FOREIGN KEY (#outV) REFERENCES V(#id),
  FOREIGN KEY (#inV) REFERENCES V(#id)
}

We know how graph databases will execute bytecode so lets focus on how MySQL 
will do it.

// g.V()
db().values(‘#V’)
  == strategizes to ==>
db().sql(‘SELECT * FROM #V’)

// g.V().outE(‘knows’).id()
db().values(‘#V’).values(‘#outE’).has(‘#label’,’knows’).values(‘#id’)
  == strategizes to ==>
db().sql(‘SELECT * FROM #V’).sql(‘SELECT * FROM #E WHERE 
#outV=$id’).by(‘#id’).has(‘#label’,’knows’).values(‘#id’)
  == strategizes to ==>
db().sql(‘SELECT * FROM #V’).sql(‘SELECT * FROM #E WHERE #label=knows AND 
#outV=$id’).by(‘#id’).values(‘#id’)
  == strategizes to ==>
db().sql(‘SELECT #id FROM #V’).sql(‘SELECT #id FROM #E WHERE #label=knows AND 
#outV=$id’).by(‘#id’)
  == strategizes to ==>
db().sql(‘SELECT #E.#id FROM #V, #E WHERE #E.#label=knows AND #E.#outV=#V.#id’)

What just happened? The RDBMS TP4 compiler strategy knows that rows can not 
have direct reference to one another. Thus, in order to get the outgoing edges 
of a vertex, a join is required. What is interesting is that the n-tuple 
model’s #outE provides the all the necessary information to perform the join. 
Since we defined the #outV foreign key as a reference to the primary key #id, 
we know that the *{#outV=...} pointer is simply referencing the #id of the #V 
table. Thus, constructing our SQL is easy. However, on our initial pass, we are 
doing a new SQL call to grab the edges for each vertex emitted from the 
sql(SELECT * FROM #V) instruction. Fortunately, we are able to recursively 
merge SQL calls until we can no longer merge. Note the algorithmic nature of 
this process (easy to code). Each strategy-pass folds as much information into 
the preceding SQL instruction as possible so we don’t have to shuffle tuples 
back and forth from MySQL. In the end, we reach a single sql() query “fix 
point” and tada — TP4 bytecode strategized to use MySQL’s SQL execution engine. 
You have successfully encoded (in one particular way) a property graph into a 
relational database.

Drop mic,
Marko.

http://rredux.com <http://rredux.com/>




> On May 7, 2019, at 10:20 AM, Marko Rodriguez <[email protected]> wrote:
> 
> I’m in the pocket so I’m just going to riff…
> 
> The concept of a pointer as I have defined it is a "0-argument function.” 
> That is, we currently have Pointer as:
> 
> Pointer<E> implements Supplier<Iterator<E extends Tuple>>
> 
> However, pointers can be n-arg functions!
> 
> Pointer<S,E> implements Function<List<S>, Iterator<E extends Tuple>>
> 
> If so, providers can support dynamic tuple fields.
> 
> {#id:1, #label:person, name:marko, age:29, #outE_by_label:*{#outV=*{#id:1}, 
> #label=@[0]}}
> 
> I’m using @[0] to denote the first argument passed into the pointer.
> 
> What this tuple is saying is that the provider is able to do a direct lookup 
> of all incident edges based on edge label. Thus, as TP4, we can optimize:
> 
> values(‘#outE’).has(‘#label’,‘knows’)
>       == to ==>
> values(‘#outE_by_label’).by(‘knows’)
> 
> Lets say a user is doing outE(‘knows’,’likes’). With #outE_by_label we can 
> only reference one edge sequence as we only use one argument. Thus, the 
> compiler would have to write out the following bytecode:
> 
> union(values(‘#outE_by_label’).by(‘knows’), 
> values(‘#outE_by_label’).by(‘likes’))
> 
> However, lets say another provider has this field on their vertex-tuples:
> 
> {#id:1, #label:person, name:marko, #outE_by_labels:*{#outV=*{#id:1}, #label 
> within @}}
> 
> This tuple is saying that it can get a single sequence from a pointer that 
> references an arbitrary number of incident edges by label, where “within” is 
> a predicate like =. Thus, TP4 can compile outE(‘knows’,’likes’) to the 
> following bytecode:
> 
> values(‘#outE_by_labels’).by(‘knows’).by(‘likes’)
> 
> *** I’m starting to see #fields as bytecode! …………….. foggy vision right now.
> 
> ——————
> 
> I’m starting to think that there are no graph interfaces, relational 
> interfaces, document interfaces, etc. No — instead, everything is always in 
> terms of tuples!
> 
>       1. Database providers communicate with TP4 by giving it tuples.
>               - these tuples “abstractly” represent objects like vertices, 
> edges, rows, documents, etc.
>       2. TP4 communicates with database providers by either 
>               (1) joining tuple sequences to create a new tuple sequence (for 
> relational-style data) or 
>               (2) jumping to a tuple sequence via #fields (for pointer-style 
> data).
> 
> Thats it. Thats all there is to the game.
> 
> // Gremlin
> g.V().out(‘knows’).values(’name’)
> 
> // TP4 Bytecode
>       == minimum tuple-field requirement to call yourself a graphdb ==>
> db().values(‘#V’).values(‘#outE’).has(‘#label’,’knows’).values(‘#inV’).values(‘name’)
>       == edge-label vertex-centric index optimization ==>
> db().values(‘#V’).values(‘#out_label’).by(‘knows’).values(‘name’)
>       == denormalized adjacent vertex property data for read-heavy systems ==>
> db().values('#V’).values(‘#out_label_key’).by(‘knows’).by(‘name’)
> 
> So what makes a graphdb? The tuple types and their respective fields.
>       - you have to have #type=vertex and #type=edge (you can have other 
> types in there too!, but the graphdb instruction set only cares about 
> vertices and edges).
>       - you have to have #id and #label
>       - you have to have #outE and #inE
>       - you have to have #outV and #inV (that reference a single #type=vertex)
> If your n-tuple representation has all that, then you are a “graphdb” and 
> Gremlin will be able to process you.
> 
> Now check this mind-numbing entailment. Your n-tuple representation could 
> have other tuple #types. For instance, #type=row.
> 
> {#type:row, #table:yearly_expenses, person:*{#type:vertex,#id:1}, 
> january:$10, february:$12, march:$54, april:…. }
> 
> I have a graphdb mixed with a relationaldb all within the same db! You could 
> effectively encode this in Cassandra (same keyspace) or an RDBMS (same 
> database). Watch and learn:
> 
>       // what are the names of the friends of the person that spent the least 
> last year?
>       sql("SELECT person FROM yearly_expenses WHERE 
> MIN(january+february+…)”).out(‘knows’).values(‘name’)
> 
> We aren’t talking “multi-model” …. no my friend, we are talking 
> “hybrid-model.”
> 
> Its all just tuples and pointers.
> 
> Outz,
> Marko.
> 
> http://rredux.com <http://rredux.com/>
> 
> 
> 
> 
>> On May 7, 2019, at 7:26 AM, Marko Rodriguez <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Whoa.
>> 
>> Check out this trippy trick.
>> 
>> First, here is how you define a pointer to a map-tuple.
>> 
>>      *{k1?v1, k2?v2, …, kn?vn}
>>              * says “this is a pointer to a map" { }
>>              ? is some comparator like =, >, <, !=, contains(), etc.
>> 
>> Assume the vertex map tuple v[1]:
>> 
>> {#id:1, #label:person, name:marko, age:29} 
>> 
>> Now, we can add the following fields:
>> 
>> 1. #outE:*{#outV=*{#id=1}}  // references all tuples that have an outV field 
>> that is a pointer to the the v[1] vertex tuple.
>> 2. #outE.knows:*{#outV=*{#id=1},#label=knows} // references all outgoing 
>> knows-edges.
>> 3. #outE.knows.weight_gt_85:*{#outV=*{#id=1},#label=knows,weight>0.85} // 
>> references all strong outgoing knows-edges
>> 
>> By using different types of pointers, a graph database provider can make 
>> explicit their internal structure. Assume all three fields above are in the 
>> v[1] vertex tuple. This means that:
>> 
>>      1. all of v[1]’s outgoing edges are group together. <— linear scan
>>      2. all of v[1]’s outgoing knows-edges are group together. <— indexed by 
>> label
>>      3. all of v[1]’s strong outgoing knows-edges are group together <— 
>> indexed by label and weight
>> 
>> Thus, a graph database provider can describe the way in which it internally 
>> organizes adjacent edges — i.e. vertex-centric indices! This means then that 
>> TP4 can do vertex-centric index optimizations automatically for providers!
>> 
>>      1. values(“#outE”).hasLabel(‘knows’).has(‘weight’,gt(0.85)) // grab all 
>> edges, then filter on label, then filter on weight.
>>      2. values(“#outE.knows”).has(‘weight’,gt(0.85)) // grab all 
>> knows-edges, then filter on weight.
>>      3. values(“#outE.knows.weight_gt_85”) // grab all strong knows-edges.
>> 
>> *** Realize that Gremlin outE() will just compile to bytecode 
>> values(“#outE”).
>> 
>> Freakin’ crazy! … Josh was interested in using the n-tuple structure to 
>> describe indices. I was against it. I believe I still am. However, this is 
>> pretty neat. As Josh was saying though, without a rich enough n-tuple 
>> description of the underlying database, there should be no reason for 
>> providers to have to write custom strategies and instructions ?!?!?!?!? 
>> crazy!?
>> 
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/>
>> 
>> 
>> 
>> 
>>> On May 7, 2019, at 4:44 AM, Marko Rodriguez <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hey Josh,
>>> 
>>>> I think of your Pointer<T> as a reference to an entity. It does not contain
>>>> the entity it refers to, but it contains the primary key of that entity.
>>> 
>>> Exactly! I was just thinking that last night. Tuples don’t need a separate 
>>> ID system. No -- pointers reference the primary key of a tuple! Better yet 
>>> perhaps, they can reference one-to-many. For instance:
>>> 
>>> { id:1, label:person, name:marko, age:29, outE:*(outV=id) }
>>> 
>>> Thus, a pointer is defined by a pattern match. Haven’t thought through the 
>>> consequences, but … :)
>>> 
>>>> Here, I have invented an Entity class to indicate that the pointer resolves
>>>> to a vertex (an entity without a tuple, or rather with a 0-tuple -- the
>>>> unit element).
>>> 
>>> Ah — the 0-tuple. Neat thought.
>>> 
>>> I look forward to your slides from the Knowledge Graph Conference. If I 
>>> wasn’t such a reclusive hermit, I would have loved to have joined you there.
>>> 
>>> Take care,
>>> Marko.
>>> 
>>> http://rredux.com <http://rredux.com/>
>>> 
>>> 
>>>> On Mon, May 6, 2019 at 9:38 PM Marko Rodriguez <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>>> Hey Josh,
>>>>> 
>>>>>> I am feeling the tuples... as long as they can be typed, e.g.
>>>>>> 
>>>>>>    <V> myTuple.get(Integer) -- int-indexed tuples
>>>>>>    <V> myTuple.get(String) -- string-indexed tuples
>>>>>> In most programming languages, "tuples" are not lists, though they are
>>>>> typed by a list of element types. E.g. in Haskell you might have a tuple
>>>>> with the type
>>>>>>    (Double, Double, Bool)
>>>>> 
>>>>> 
>>>>> Yes, we have Pair<A,B>, Triple<A,B,C>, Quadruple<A,B,C,D>, etc. However
>>>>> for base Tuple<A> of unknown length, the best I can do in Java is <A>. :|
>>>>> You can see my stubs in the gist:
>>>>>        https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>> (LINES
>>>>> #21-42)
>>>>> 
>>>>>> If this is in line with your proposal, then we agree that tuples should
>>>>> be the atomic unit of data in TP4.
>>>>> 
>>>>> Yep. Vertices, Edges, Rows, Documents, etc. are all just tuples. However,
>>>>> I suspect that we will disagree on some of my tweaks. Thus, I’d really 
>>>>> like
>>>>> to get your feedback on:
>>>>> 
>>>>>        1. pointers (tuple entries referencing tuples).
>>>>>        2. sequences (multi-value tuple entries).
>>>>>        3. # hidden map keys :|
>>>>>                - sorta ghetto.
>>>>> 
>>>>> Also, I’m still not happy with db().has().has().as(‘x’).db().where()… its
>>>>> an intense syntax and its hard to strategize.
>>>>> 
>>>>> I really want to nail down this “universal model” (tuple structure and
>>>>> tuple-oriented instructions) as then I can get back on the codebase and
>>>>> start to flush this stuff out with confidence.
>>>>> 
>>>>> See ya,
>>>>> Marko.
>>>>> 
>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>>>>> <http://rredux.com/>>
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Josh
>>>>>> 
>>>>>> 
>>>>>> On Mon, May 6, 2019 at 5:34 PM Marko Rodriguez <[email protected] 
>>>>>> <mailto:[email protected]>
>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I spent this afternoon playing with n-tuples, pointers, data model
>>>>> interfaces, and bytecode instructions.
>>>>>> 
>>>>>>        https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>> <
>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>>>
>>>>>> 
>>>>>> *** Kuppitz: They are tuples :). A Map<K,V> extends Tuple<Pair<K,V>>.
>>>>> Tada!
>>>>>> 
>>>>>> What I like about this is that it combines the best of both worlds
>>>>> (Josh+Marko).
>>>>>>        * just flat tuples of arbitrary length.
>>>>>>                * pattern matching for arbitrary joins. (k1=k2 AND k3=k4
>>>>> …)
>>>>>>                * pointers chasing for direct links. (edges, foreign
>>>>> keys, document _id references, URI resolutions, …)
>>>>>>        * sequences are a special type of tuple used for multi-valued
>>>>> entries.
>>>>>>        * has()/values()/etc. work on all tuple types! (maps, lists,
>>>>> tuples, vertices, edges, rows, statements, documents, etc.)
>>>>>> 
>>>>>> Thoughts?,
>>>>>> Marko.
>>>>>> 
>>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>>>>>> <http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>>>>> http://rredux.com/ <http://rredux.com/>>>
>>> 
>> 
>

Re: N-Tuples, Pointers, Data Model Interfaces, and Bytecode Instructions

Reply via email to