Re: N-Tuples, Pointers, Data Model Interfaces, and Bytecode Instructions

Marko Rodriguez Tue, 07 May 2019 11:05:18 -0700

Here is a thought…

Think about what this means for Apache Cassandra. Users will be able to define 
a graphdb-based n-tuple schema that has #fields denoting global indices, 
vertex-centric indices, full-text search indices, denormalized data, etc. … It 
will be possible for Apache TinkerPop to take the user's n-tuple schema and 
then CREATE TABLEs accordingly in Cassandra and thus, optimized for the user’s 
graph data and queries. In other words, theoretically, we should be able to 
create an optimially index’d Cassandra keyspace for a user’s specific n-tuple 
data. In other words, TP4 creates Titans! And best of all, there is no “Titan” 
middle layer. We just talk directly to Cassandra. Cassandra gives us tuples and 
resolves pointers.


Now hold on to your undergarments.

A user can state that they plan to have numerous concurrent users interacting 
with the database.
        - Connect RxJavaSerialProcessor.
A user can state that they plan to do batch analytics weekly.
        - Connect SparkProcessor.
A user can state that their queries tend to jump around the graph.
        - Connect AkkaProcessor.
…

Given a questionnaire and a user’s graphdb-based n-tuple schema, we could 
provide a service that does:

        Processing your questionnaire answers and n-tuple schema.

        Using Apache Cassandra as the underlying structure.
        Generating CQL script:
                ...for creating appropriate indices.
                ...for denormalizing row data in order to limit pointer chasing.
        Using PipesProcessor as the primary real-time processor.
        Using SparkProcessor as the primary batch processor.
        Exposing Gremlin as a graph query language.
        Exposing Cypher as a graph query language.
        Configuring TP4VM MachineServer for 127.0.0.1:8080.
        
        Your database is ready for download:
                
http://tinkerpop.apache.org/db-generator/GraphDB+Cassandra+Pipes+Spark+Gremlin+Cypher.zip
 
<http://tinkerpop.apache.org/db-generator/GraphDB+Cassandra+Pipes+Spark+Gremlin+Cypher.zip>

Its wild, but its not crazy. From what I can see, it is theoretically possible 
to provide a service of this nature with TP4.

Marko.
        
http://rredux.com <http://rredux.com/>




> On May 7, 2019, at 11:14 AM, Marko Rodriguez <[email protected]> wrote:
> 
> When your hot, your hot. Going to keep pushin’.
> 
> What is the difference between a relational database's and a graph database’s 
> encoding of a property graph? The short answer, NOTHING. The n-tuple model 
> for a subset of the TinkerPop toy graph is:
> 
> {#type:vertex, #id:1, #label:person, name:marko, age:29, 
> #outE:*{#outV=*{#id=1}}}
> {#type:vertex, #id:4, #label:person, name:josh, age:32}
> {#type:vertex, #id:2, #label:person, name:vadas, age:27}
> {#type:edge, #id:7, #label:knows, #outV:*{#id=1}, #inV:*{#id=2}}
> {#type:edge, #id:8, #label:knows, #outV:*{#id=1}, #inV:*{#id=4}}
> 
> MySQL and Neo4j would talk with TP4 via the same above tuples. What makes 
> MySQL different from Neo4j is how the pointers are resolved.
> 
> First, lets talk about the MySQL schema that ultimately holds the graphdb 
> instance data.
> 
> CREATE TABLE #V {
>   #id int NOT NULL PRIMARY KEY,
>   #label varchar,
>   name varchar,
>   age int,
>   #outE varchar
> }
> 
> CREATE TABLE #E {
>   #id int NOT NULL PRIMARY KEY,
>   #label varchar,
>   #outV int,
>   #inV int,
>   FOREIGN KEY (#outV) REFERENCES V(#id),
>   FOREIGN KEY (#inV) REFERENCES V(#id)
> }
> 
> We know how graph databases will execute bytecode so lets focus on how MySQL 
> will do it.
> 
> // g.V()
> db().values(‘#V’)
>   == strategizes to ==>
> db().sql(‘SELECT * FROM #V’)
> 
> // g.V().outE(‘knows’).id()
> db().values(‘#V’).values(‘#outE’).has(‘#label’,’knows’).values(‘#id’)
>   == strategizes to ==>
> db().sql(‘SELECT * FROM #V’).sql(‘SELECT * FROM #E WHERE 
> #outV=$id’).by(‘#id’).has(‘#label’,’knows’).values(‘#id’)
>   == strategizes to ==>
> db().sql(‘SELECT * FROM #V’).sql(‘SELECT * FROM #E WHERE #label=knows AND 
> #outV=$id’).by(‘#id’).values(‘#id’)
>   == strategizes to ==>
> db().sql(‘SELECT #id FROM #V’).sql(‘SELECT #id FROM #E WHERE #label=knows AND 
> #outV=$id’).by(‘#id’)
>   == strategizes to ==>
> db().sql(‘SELECT #E.#id FROM #V, #E WHERE #E.#label=knows AND 
> #E.#outV=#V.#id’)
> 
> What just happened? The RDBMS TP4 compiler strategy knows that rows can not 
> have direct reference to one another. Thus, in order to get the outgoing 
> edges of a vertex, a join is required. What is interesting is that the 
> n-tuple model’s #outE provides the all the necessary information to perform 
> the join. Since we defined the #outV foreign key as a reference to the 
> primary key #id, we know that the *{#outV=...} pointer is simply referencing 
> the #id of the #V table. Thus, constructing our SQL is easy. However, on our 
> initial pass, we are doing a new SQL call to grab the edges for each vertex 
> emitted from the sql(SELECT * FROM #V) instruction. Fortunately, we are able 
> to recursively merge SQL calls until we can no longer merge. Note the 
> algorithmic nature of this process (easy to code). Each strategy-pass folds 
> as much information into the preceding SQL instruction as possible so we 
> don’t have to shuffle tuples back and forth from MySQL. In the end, we reach 
> a single sql() query “fix point” and tada — TP4 bytecode strategized to use 
> MySQL’s SQL execution engine. You have successfully encoded (in one 
> particular way) a property graph into a relational database.
> 
> Drop mic,
> Marko.
> 
> http://rredux.com <http://rredux.com/>
> 
> 
> 
> 
>> On May 7, 2019, at 10:20 AM, Marko Rodriguez <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> I’m in the pocket so I’m just going to riff…
>> 
>> The concept of a pointer as I have defined it is a "0-argument function.” 
>> That is, we currently have Pointer as:
>> 
>> Pointer<E> implements Supplier<Iterator<E extends Tuple>>
>> 
>> However, pointers can be n-arg functions!
>> 
>> Pointer<S,E> implements Function<List<S>, Iterator<E extends Tuple>>
>> 
>> If so, providers can support dynamic tuple fields.
>> 
>> {#id:1, #label:person, name:marko, age:29, #outE_by_label:*{#outV=*{#id:1}, 
>> #label=@[0]}}
>> 
>> I’m using @[0] to denote the first argument passed into the pointer.
>> 
>> What this tuple is saying is that the provider is able to do a direct lookup 
>> of all incident edges based on edge label. Thus, as TP4, we can optimize:
>> 
>> values(‘#outE’).has(‘#label’,‘knows’)
>>      == to ==>
>> values(‘#outE_by_label’).by(‘knows’)
>> 
>> Lets say a user is doing outE(‘knows’,’likes’). With #outE_by_label we can 
>> only reference one edge sequence as we only use one argument. Thus, the 
>> compiler would have to write out the following bytecode:
>> 
>> union(values(‘#outE_by_label’).by(‘knows’), 
>> values(‘#outE_by_label’).by(‘likes’))
>> 
>> However, lets say another provider has this field on their vertex-tuples:
>> 
>> {#id:1, #label:person, name:marko, #outE_by_labels:*{#outV=*{#id:1}, #label 
>> within @}}
>> 
>> This tuple is saying that it can get a single sequence from a pointer that 
>> references an arbitrary number of incident edges by label, where “within” is 
>> a predicate like =. Thus, TP4 can compile outE(‘knows’,’likes’) to the 
>> following bytecode:
>> 
>> values(‘#outE_by_labels’).by(‘knows’).by(‘likes’)
>> 
>> *** I’m starting to see #fields as bytecode! …………….. foggy vision right now.
>> 
>> ——————
>> 
>> I’m starting to think that there are no graph interfaces, relational 
>> interfaces, document interfaces, etc. No — instead, everything is always in 
>> terms of tuples!
>> 
>>      1. Database providers communicate with TP4 by giving it tuples.
>>              - these tuples “abstractly” represent objects like vertices, 
>> edges, rows, documents, etc.
>>      2. TP4 communicates with database providers by either 
>>              (1) joining tuple sequences to create a new tuple sequence (for 
>> relational-style data) or 
>>              (2) jumping to a tuple sequence via #fields (for pointer-style 
>> data).
>> 
>> Thats it. Thats all there is to the game.
>> 
>> // Gremlin
>> g.V().out(‘knows’).values(’name’)
>> 
>> // TP4 Bytecode
>>      == minimum tuple-field requirement to call yourself a graphdb ==>
>> db().values(‘#V’).values(‘#outE’).has(‘#label’,’knows’).values(‘#inV’).values(‘name’)
>>      == edge-label vertex-centric index optimization ==>
>> db().values(‘#V’).values(‘#out_label’).by(‘knows’).values(‘name’)
>>      == denormalized adjacent vertex property data for read-heavy systems ==>
>> db().values('#V’).values(‘#out_label_key’).by(‘knows’).by(‘name’)
>> 
>> So what makes a graphdb? The tuple types and their respective fields.
>>      - you have to have #type=vertex and #type=edge (you can have other 
>> types in there too!, but the graphdb instruction set only cares about 
>> vertices and edges).
>>      - you have to have #id and #label
>>      - you have to have #outE and #inE
>>      - you have to have #outV and #inV (that reference a single #type=vertex)
>> If your n-tuple representation has all that, then you are a “graphdb” and 
>> Gremlin will be able to process you.
>> 
>> Now check this mind-numbing entailment. Your n-tuple representation could 
>> have other tuple #types. For instance, #type=row.
>> 
>> {#type:row, #table:yearly_expenses, person:*{#type:vertex,#id:1}, 
>> january:$10, february:$12, march:$54, april:…. }
>> 
>> I have a graphdb mixed with a relationaldb all within the same db! You could 
>> effectively encode this in Cassandra (same keyspace) or an RDBMS (same 
>> database). Watch and learn:
>> 
>>      // what are the names of the friends of the person that spent the least 
>> last year?
>>      sql("SELECT person FROM yearly_expenses WHERE 
>> MIN(january+february+…)”).out(‘knows’).values(‘name’)
>> 
>> We aren’t talking “multi-model” …. no my friend, we are talking 
>> “hybrid-model.”
>> 
>> Its all just tuples and pointers.
>> 
>> Outz,
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/>
>> 
>> 
>> 
>> 
>>> On May 7, 2019, at 7:26 AM, Marko Rodriguez <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Whoa.
>>> 
>>> Check out this trippy trick.
>>> 
>>> First, here is how you define a pointer to a map-tuple.
>>> 
>>>     *{k1?v1, k2?v2, …, kn?vn}
>>>             * says “this is a pointer to a map" { }
>>>             ? is some comparator like =, >, <, !=, contains(), etc.
>>> 
>>> Assume the vertex map tuple v[1]:
>>> 
>>> {#id:1, #label:person, name:marko, age:29} 
>>> 
>>> Now, we can add the following fields:
>>> 
>>> 1. #outE:*{#outV=*{#id=1}}  // references all tuples that have an outV 
>>> field that is a pointer to the the v[1] vertex tuple.
>>> 2. #outE.knows:*{#outV=*{#id=1},#label=knows} // references all outgoing 
>>> knows-edges.
>>> 3. #outE.knows.weight_gt_85:*{#outV=*{#id=1},#label=knows,weight>0.85} // 
>>> references all strong outgoing knows-edges
>>> 
>>> By using different types of pointers, a graph database provider can make 
>>> explicit their internal structure. Assume all three fields above are in the 
>>> v[1] vertex tuple. This means that:
>>> 
>>>     1. all of v[1]’s outgoing edges are group together. <— linear scan
>>>     2. all of v[1]’s outgoing knows-edges are group together. <— indexed by 
>>> label
>>>     3. all of v[1]’s strong outgoing knows-edges are group together <— 
>>> indexed by label and weight
>>> 
>>> Thus, a graph database provider can describe the way in which it internally 
>>> organizes adjacent edges — i.e. vertex-centric indices! This means then 
>>> that TP4 can do vertex-centric index optimizations automatically for 
>>> providers!
>>> 
>>>     1. values(“#outE”).hasLabel(‘knows’).has(‘weight’,gt(0.85)) // grab all 
>>> edges, then filter on label, then filter on weight.
>>>     2. values(“#outE.knows”).has(‘weight’,gt(0.85)) // grab all 
>>> knows-edges, then filter on weight.
>>>     3. values(“#outE.knows.weight_gt_85”) // grab all strong knows-edges.
>>> 
>>> *** Realize that Gremlin outE() will just compile to bytecode 
>>> values(“#outE”).
>>> 
>>> Freakin’ crazy! … Josh was interested in using the n-tuple structure to 
>>> describe indices. I was against it. I believe I still am. However, this is 
>>> pretty neat. As Josh was saying though, without a rich enough n-tuple 
>>> description of the underlying database, there should be no reason for 
>>> providers to have to write custom strategies and instructions ?!?!?!?!? 
>>> crazy!?
>>> 
>>> Marko.
>>> 
>>> http://rredux.com <http://rredux.com/>
>>> 
>>> 
>>> 
>>> 
>>>> On May 7, 2019, at 4:44 AM, Marko Rodriguez <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hey Josh,
>>>> 
>>>>> I think of your Pointer<T> as a reference to an entity. It does not 
>>>>> contain
>>>>> the entity it refers to, but it contains the primary key of that entity.
>>>> 
>>>> Exactly! I was just thinking that last night. Tuples don’t need a separate 
>>>> ID system. No -- pointers reference the primary key of a tuple! Better yet 
>>>> perhaps, they can reference one-to-many. For instance:
>>>> 
>>>> { id:1, label:person, name:marko, age:29, outE:*(outV=id) }
>>>> 
>>>> Thus, a pointer is defined by a pattern match. Haven’t thought through the 
>>>> consequences, but … :)
>>>> 
>>>>> Here, I have invented an Entity class to indicate that the pointer 
>>>>> resolves
>>>>> to a vertex (an entity without a tuple, or rather with a 0-tuple -- the
>>>>> unit element).
>>>> 
>>>> Ah — the 0-tuple. Neat thought.
>>>> 
>>>> I look forward to your slides from the Knowledge Graph Conference. If I 
>>>> wasn’t such a reclusive hermit, I would have loved to have joined you 
>>>> there.
>>>> 
>>>> Take care,
>>>> Marko.
>>>> 
>>>> http://rredux.com <http://rredux.com/>
>>>> 
>>>> 
>>>>> On Mon, May 6, 2019 at 9:38 PM Marko Rodriguez <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>>> Hey Josh,
>>>>>> 
>>>>>>> I am feeling the tuples... as long as they can be typed, e.g.
>>>>>>> 
>>>>>>>    <V> myTuple.get(Integer) -- int-indexed tuples
>>>>>>>    <V> myTuple.get(String) -- string-indexed tuples
>>>>>>> In most programming languages, "tuples" are not lists, though they are
>>>>>> typed by a list of element types. E.g. in Haskell you might have a tuple
>>>>>> with the type
>>>>>>>    (Double, Double, Bool)
>>>>>> 
>>>>>> 
>>>>>> Yes, we have Pair<A,B>, Triple<A,B,C>, Quadruple<A,B,C,D>, etc. However
>>>>>> for base Tuple<A> of unknown length, the best I can do in Java is <A>. :|
>>>>>> You can see my stubs in the gist:
>>>>>>        https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
>>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>> (LINES
>>>>>> #21-42)
>>>>>> 
>>>>>>> If this is in line with your proposal, then we agree that tuples should
>>>>>> be the atomic unit of data in TP4.
>>>>>> 
>>>>>> Yep. Vertices, Edges, Rows, Documents, etc. are all just tuples. However,
>>>>>> I suspect that we will disagree on some of my tweaks. Thus, I’d really 
>>>>>> like
>>>>>> to get your feedback on:
>>>>>> 
>>>>>>        1. pointers (tuple entries referencing tuples).
>>>>>>        2. sequences (multi-value tuple entries).
>>>>>>        3. # hidden map keys :|
>>>>>>                - sorta ghetto.
>>>>>> 
>>>>>> Also, I’m still not happy with db().has().has().as(‘x’).db().where()… its
>>>>>> an intense syntax and its hard to strategize.
>>>>>> 
>>>>>> I really want to nail down this “universal model” (tuple structure and
>>>>>> tuple-oriented instructions) as then I can get back on the codebase and
>>>>>> start to flush this stuff out with confidence.
>>>>>> 
>>>>>> See ya,
>>>>>> Marko.
>>>>>> 
>>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>>>>>> <http://rredux.com/>>
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Josh
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, May 6, 2019 at 5:34 PM Marko Rodriguez <[email protected] 
>>>>>>> <mailto:[email protected]>
>>>>>> <mailto:[email protected] <mailto:[email protected]>>> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I spent this afternoon playing with n-tuples, pointers, data model
>>>>>> interfaces, and bytecode instructions.
>>>>>>> 
>>>>>>>        https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
>>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>> <
>>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8> <
>>>>>> https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8 
>>>>>> <https://gist.github.com/okram/25d50724da89452853a3f4fa894bcbe8>>>
>>>>>>> 
>>>>>>> *** Kuppitz: They are tuples :). A Map<K,V> extends Tuple<Pair<K,V>>.
>>>>>> Tada!
>>>>>>> 
>>>>>>> What I like about this is that it combines the best of both worlds
>>>>>> (Josh+Marko).
>>>>>>>        * just flat tuples of arbitrary length.
>>>>>>>                * pattern matching for arbitrary joins. (k1=k2 AND k3=k4
>>>>>> …)
>>>>>>>                * pointers chasing for direct links. (edges, foreign
>>>>>> keys, document _id references, URI resolutions, …)
>>>>>>>        * sequences are a special type of tuple used for multi-valued
>>>>>> entries.
>>>>>>>        * has()/values()/etc. work on all tuple types! (maps, lists,
>>>>>> tuples, vertices, edges, rows, statements, documents, etc.)
>>>>>>> 
>>>>>>> Thoughts?,
>>>>>>> Marko.
>>>>>>> 
>>>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>>>>>>> <http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>>>>>> http://rredux.com/ <http://rredux.com/>>>
>>>> 
>>> 
>> 
>

Re: N-Tuples, Pointers, Data Model Interfaces, and Bytecode Instructions

Reply via email to