Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Marko Rodriguez Tue, 30 Apr 2019 15:47:45 -0700

Hello,

> First, the "root". While we do need context for traversals, I don't think
> there should be a distinct kind of root for each kind of structure. Once
> again, select(), or operations derived from select() will work just fine.


So given your example below, “root” would be db in this case. 
db is the reference to the structure as a whole.
Within db, substructures exist. 
Logically, this makes sense.
For instance, a relational database’s references don’t leak outside the RDBMs 
into other areas of your computer’s memory.
And there is always one entry point into every structure — the connection. And 
what does that connection point to:
        vertices, keyspaces, databases, document collections, etc. 
In other words, “roots.” (even the JVM has a “root” — it called the heap).

> Want the "person" table? db.select("person"). Want a sequence of vertices
> with the label "person"? db.select("person"). What we are saying in either
> case is "give me the 'person' relation. Don't project any specific fields;
> just give me all the data". A relational DB and a property graph DB will
> have different ways of supplying the relation, but in either case, it can
> hide behind the same interface (TRelation?).

In your lexicon, for both RDBMS and graph:
        db.select(‘person’) is saying, select the people table (which is 
composed of a sequence of “person" rows)
        db.select(‘person’) is saying, select the person vertices (which is 
composed of a sequence of “person" vertices)
…right off the bat you have the syntax-problem of people vs. person. Tables are 
typically named the plural of the rows. That
doesn’t exist in graph databases as there is just one vertex set (i.e. one 
“table”).

In my lexicon (TP instructions)
        db().values(‘people’) is saying, flatten out the person rows of the 
people table.
        V().has(label,’person’) is saying, flatten out the vertex objects of 
the graph’s vertices and filter out non-person vertices.

Well, that is stupid, why not have the same syntax for both structures?
Because they are different. There are no “person” relations in the classic 
property graph (Neo4j 1.0). There are only vertex relations with a label=person 
entry.
In a relational database there are “person” relations and these are bundled 
into disjoint tables (i.e. relation sets — and schema constrained).

The point I’m making is that instead of trying to fit all these data structures 
into a strict type system that ultimately looks like
a bunch of disjoint relational sets, lets mimic the vendor-specified semantics. 
Lets take these systems at their face value
and not try and “mathematize” them. If they are inconsistent and ugly, fine. If 
we map them into another system that is mathematical
and beautiful, great. However, every data structure, from Neo4j’s 
representation for OLTP traversals
 to that “same" data being OLAP processed as Spark RDDs or Hadoop
SequenceFiles will all have their ‘oh shits’ (impedance mismatches) and that is 
okay. As this is the reality we are tying to model!

Graph and RDBMs have two different data models (their unique worldview):

RDBMS:   Databases->Tables->Rows->Primitives
GraphDB: Vertices->Edges->Vertices->Edges->Vertices-> ...

Here is a person->knows->person “traversal” in TP4 bytecode over an RDBMS (#key 
are ’symbols’ (constants)):

db().values(“people”).as(“x”).
db().values(“knows”).as(“y”).
  where(“x”,eq(“y”)).by(#id).by(#outV).
db().values(“people”).as(“z”).
  where(“y”,eq(“z”)).by(#inV).by(#id)
   
Pretty freakin’ disgusting, eh? Here is a person->knows->person “traversal” in 
TP4 bytecode over a property graph:

V().has(#label,”person”).values(#outE).has(#label,”knows”).values(#inV)

So we have two completely different bytecode representations for the same 
computational result. Why?
Because we have two completely different data models!

        One is a set of disjoint typed-relations (i.e. RDBMS).
        One is a set of nested loosely-typed-relations (i.e. property graphs).

Why not make them the same? Because they are not the same and that is exactly 
what I believe we should be capturing.

Just looking at the two computations above you see that a relational database 
is doing “joins” while a graph database is doing “traversals”.
We have to use path-data to compute a join. We have to use memory! (and we do). 
We don’t have to use path-data to compute a traversal.
We don’t have to use memory! (and we don’t!). That is the fundamental nature of 
the respective computations that are taking place.
That is what gives each system their particular style of computing.

NEXT: There is nothing that says you can’t map between the two? Lets go 
property graph to RDBMS.
        - we could make a person table, a software table, a knows table, a 
created table.
                - that only works if the property graph is schema-based.
        - we could make a single vertex table with another 3 column properties 
table (vertexId,key,value)
        - we could…
Which ever encoding you choose, a different bytecode will be required. 
Fortunately, the space of (reasonable) possibilities is constrained.
Thus, instead of saying: 
        “I want to map from property graph to RDBMS” 
I say: 
        “I want to map from a recursive, bi-relational structure to a disjoint 
multi-relational structure where linkage is based on #id/#outV/#inV equalities.”
Now you have constrained the space of possible RDBMS encodings! Moreover, we 
now have an algorithmic solution that not only disconnects “vertices,” 
but also rewrites the bytecode according to the new logical steps required to 
execute the computation as we have a new data structure and a new
way of moving through that data structure. The pointers are completely 
different! However, as long as the mapping is sound, the rewrite should be 
algorithmic.

I’m getting tired. I see your stuff below about indices and I have thoughts on 
that… but I will address those tomorrow.

Thanks for reading,
Marko.

http://rredux.com <http://rredux.com/>







> 
> But wait, you say, what if the under the hood, you have a TTable in one
> case, and TSequence in the other? They are so different! That's why
> the Dataflow
> Model
> <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf
>  
> <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf>>
> is so great; to an extent, you can think of the two as interchangeable. I
> think we would get a lot of mileage out of treating them as interchangeable
> within TP4.
> 
> So instead of a data model -specific "root", I argue for a universal root
> together with a set of relations and what we might call an "indexes". An
> index is an arrow from a type to a relation which says "give me a
> column/value pair, and I will give you all matching tuples from this
> relation". The result is another relation. Where data sources differentiate
> themselves is by having different relations and indexes.
> 
> For example, if the underlying data structure is nothing but a stream of
> Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
> you just have to wait for tuples to go by, and filter on them. So if you
> say d.select("Trip", "driver") -- where d is a traversal that gets you to a
> User -- the machine knows that it can't use "driver" to look up a specific
> set of trips; it has to use a filter over all future "Trip" tuples. If, on
> the other hand, we have a relational database, we have the option of
> indexing on "driver". In this case, d.select("Trip", "driver") may take you
> to a specific table like "Trip_by_driver" which has "driver" as a primary
> key. The machine recognizes that this index exists, and uses it to answer
> the query more efficiently. The alternative is to do a full scan over any
> table which contains the "Trip" relation. Since TinkerPop3, we have been
> without a vendor-neutral API for indexes, but this is where such an API
> would really start to shine. Consider Neo4j's single property indexes,
> JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
> etc.) as in AllegroGraph in addition to primary keys in relational
> databases.
> 
> TTuple -- cool. +1
> 
> "Enums" -- I agree that enums are necessary, but we need even more: tagged
> unions <https://en.wikipedia.org/wiki/Tagged_union 
> <https://en.wikipedia.org/wiki/Tagged_union>>. They are part of the
> system of algebraic data types which I described on Friday. An enum is a
> special case of a tagged union in which there is no value, just a type tag.
> May I suggest something like TValue, which contains a value (possibly
> trivial) together with a type tag. This enables ORs and pattern matching.
> For example, suppose "created" edges are allowed to point to either
> "Project" or "Document" vertices. The in-type of "created" is
> union{project:Project, document:Document). Now the in value of a specific
> edge can be TValue("project", [some project vertex]) or TValue("document",
> [some document vertex]) and you have the freedom to switch on the type tag
> if you want to, e.g. the next step in the traversal can give you the "name"
> of the project or the "title" of the document as appropriate.
> 
> Multi-properties -- agreed; has() is good enough.
> 
> Meta-properties -- again, this is where I think we should have a
> lower-level select() operation. Then has() builds on that operation.
> Whereas select() matches on fields of a relation, has() matches on property
> values and other higher-order things. If you want properties of properties,
> don't use has(); use select()/from(). Most of the time, you will just want
> to use has().
> 
> Agreed that every *entity* should have an id(), and also a label() (though
> it should always be possible to infer label() from the context). I would
> suggest TEntity (or TElement), which has id(), label(), and value(), where
> value() provides the raw value (usually a TTuple) of the entity.
> 
> Josh
> 
> 
> 
> On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Hello Josh,
>> 
>>> A has("age",29), for example, operates at a different level of
>> abstraction than a
>>> has("city","Santa Fe") if "city" is a column in an "addresses" table.
>> 
>> So hasXXX() operators work on TTuples. Thus:
>> 
>> g.V().hasLabel(‘person’).has(‘age’,29)
>> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>> 
>> ..both work as a person-vertex and an address-vertex are TTuples. If these
>> were tables, then:
>> 
>> jdbc.db().values(‘people’).has(‘age’,29)
>> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>> 
>> …also works as both people and addresses are TTables which extend
>> TTuple<String,?>.
>> 
>> In summary, its its a TTuple, then hasXXX() is good go.
>> 
>> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
>> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
>> metadata. Thus TTable.value(#label) -> “people.” If so, then
>> jdbc.db().hasLabel(“people”).has(“age”,29)
>> 
>>> At least, they
>>> are different if the data model allows for multi-properties,
>>> meta-properties, and hyper-edges. A property is something that can either
>>> be there, attached to an element, or not be there. There may also be more
>>> than one such property, and it may have other properties attached to it.
>> A
>>> column of a table, on the other hand, is always there (even if its value
>> is
>>> allowed to be null), always has a single value, and cannot have further
>>> properties attached.
>> 
>> 1. Multi-properties.
>> 
>> Multi-properties works because if name references a TSequence, then its
>> the sequence that you analyze with has(). This is another reason why
>> TSequence is important. Its a reference to a “stream” so there isn’t
>> another layer of tuple-nesting.
>> 
>> // assume v[1] has name={marko,mrodriguez,markor}
>> g.V(1).value(‘name’) => TSequence<String>
>> g.V(1).values(‘name’) => marko, mrodriguez, markor
>> g.V(1).has(‘name’,’marko’) => v[1]
>> 
>> 2. Meta-properties
>> 
>> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
>> a tuple value
>> g.V(1).value(‘name’) => TTuple<?,String> // doh!
>> g.V(1).value(‘name’).value(‘value’) => marko
>> g.V(1).value(‘name’).value(‘creator’) => josh
>> 
>> So things get screwy. — however, it only gets screwy when you mix your
>> “metadata” key/values with your “data” key/values. This is why I think
>> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>> 
>> [#value:marko,creator:josh,timestamp:12303]
>> 
>> If you do g.V(1).value(‘name’), we could look to the value indexed by the
>> symbol #value, thus => “marko”.
>> If you do g.V(1).values(‘name’), you would get back a TSequence with a
>> single TTuple being the meta property.
>> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
>> the symbol #value.
>> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
>> primitive string “josh”.
>> 
>> I believe that the following symbols should be recommended for use across
>> all data structures.
>>        #id, #label, #key, #value
>> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
>> for use with propertygraph/ include:
>>        #outE, #inV, #inE, #outV, #bothE, #bothV
>> 
>>> In order to simplify user queries, you can let has() and values() do
>> double
>>> duty, but I still feel that there are lower-level operations at play, at
>> a
>>> logical level even if not at a bytecode level. However, expressing the a
>>> traversal in terms of its lowest-level relational operations may also be
>>> useful for query optimization.
>> 
>> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
>> that I’m not modeling everything in terms of “tables.” Each data structure
>> is trying to stay as pure to its conceptual model as possible. Thus, there
>> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
>> where TEdge is an interface that extends TTuple. You can just walk without
>> doing any type of INNER JOIN. Now, if you model a property graph in a
>> relational database, you will have to strategize the bytecode accordingly!
>> Just a heads up in case you haven’t noticed that.
>> 
>> Thanks for your input,
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>> <http://rredux.com/>>
>> 
>> 
>> 
>>> 
>>> Josh
>>> 
>>> 
>>> 
>>> On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <[email protected] 
>>> <mailto:[email protected]>
>> <mailto:[email protected] <mailto:[email protected]>>>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> *** This email is primarily for Josh (and Kuppitz). However, if others
>> are
>>>> interested… ***
>>>> 
>>>> So I did a lot of thinking this weekend about structure/ and this
>> morning,
>>>> I prototyped both graph/ and rdbms/.
>>>> 
>>>> This is the way I’m currently thinking of things:
>>>> 
>>>>       1. There are 4 base types in structure/.
>>>>               - Primitive: string, long, float, int, … (will constrain
>>>> these at some point).
>>>>               - TTuple<K,V>: key/value map.
>>>>               - TSequence<V>: an iterable of v objects.
>>>>               - TSymbol: like Ruby, I think we need “enum-like” symbols
>>>> (e.g., #id, #label).
>>>> 
>>>>       2. Every structure has a “root.”
>>>>               - for graph its TGraph implements TSequence<TVertex>
>>>>               - for rdbms its a TDatabase implements
>>>> TTuple<String,TTable>
>>>> 
>>>>       3. Roots implement Structure and thus, are what is generated by
>>>> StructureFactory.mint().
>>>>               - defined using withStructure().
>>>>               - For graph, its accessible via V().
>>>>               - For rdbms, its accessible via db().
>>>> 
>>>>       4. There is a list of core instructions for dealing with these
>>>> base objects.
>>>>               - value(K key): gets the TTuple value for the provided
>> key.
>>>>               - values(K key): gets an iterator of the value for the
>>>> provided key.
>>>>               - entries(): gets an iterator of T2Tuple objects for the
>>>> incoming TTuple.
>>>>               - hasXXX(A,B): various has()-based filters for looking
>>>> into a TTuple and a TSequence
>>>>               - db()/V()/etc.: jump to the “root” of the
>> withStructure()
>>>> structure.
>>>>               - drop()/add(): behave as one would expect and thus.
>>>> 
>>>> ————
>>>> 
>>>> For RDBMS, we have three interfaces in rdbms/.
>>>> (machine/machine-core/structure/rdbms)
>>>> 
>>>>       1. TDatabase implements TTuple<String,TTable> // the root
>>>> structure that indexes the tables.
>>>>       2. TTable implements TSequence<TRow<?>> // a table is a sequence
>>>> of rows
>>>>       3. TRow<V> implements TTuple<String,V>> // a row has string
>> column
>>>> names
>>>> 
>>>> I then created a new project at machine/structure/jdbc). The classes in
>>>> here implement the above rdbms/ interfaces/
>>>> 
>>>> Here is an RDBMS session:
>>>> 
>>>> final Machine machine = LocalMachine.open();
>>>> final TraversalSource jdbc =
>>>>       Gremlin.traversal(machine).
>>>>                       withProcessor(PipesProcessor.class).
>>>>                       withStructure(JDBCStructure.class,
>>>> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
>>>> 
>>>> System.out.println(jdbc.db().toList());
>>>> System.out.println(jdbc.db().entries().toList());
>>>> System.out.println(jdbc.db().value("people").toList());
>>>> System.out.println(jdbc.db().values("people").toList());
>>>> System.out.println(jdbc.db().values("people").value("name").toList());
>>>> System.out.println(jdbc.db().values("people").entries().toList());
>>>> 
>>>> This yields:
>>>> 
>>>> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
>>>> [PEOPLE:<table#PEOPLE>]
>>>> [<table#people>]
>>>> [<row#PEOPLE:1>, <row#PEOPLE:2>]
>>>> [marko, josh]
>>>> [NAME:marko, AGE:29, NAME:josh, AGE:32]
>>>> 
>>>> The bytecode of the last query is:
>>>> 
>>>> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
>>>> entries]
>>>> 
>>>> JDBCDatabase implements TDatabase, Structure.
>>>>       *** JDBCDatabase is the root structure and is referenced by db()
>>>> *** (CRUCIAL POINT)
>>>> 
>>>> Assume another table called ADDRESSES with two columns: name and city.
>>>> 
>>>> 
>>>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
>>>> 
>>>> The above is equivalent to:
>>>> 
>>>> SELECT city FROM people,addresses WHERE people.name=addresses.name
>>>> 
>>>> If you want to do an inner join (a product), you do this:
>>>> 
>>>> 
>>>> 
>> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
>>>> 
>>>> The above is equivalent to:
>>>> 
>>>> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
>>>> 
>>>> NOTES:
>>>>       1. Instead of select(), we simply jump to the root via db() (or
>>>> V() for graph).
>>>>       2. Instead of project(), we simply use value() or values().
>>>>       3. Instead of select() being overloaded with by() join syntax, we
>>>> use has() and path().
>>>>               - like TP3 we will be smart about dropping path() data
>>>> once its no longer referenced.
>>>>       4. We can also do LEFT and RIGHT JOINs (haven’t thought through
>>>> FULL OUTER JOIN yet).
>>>>               - however, we don’t support ‘null' in TP so I don’t know
>>>> if we want to support these null-producing joins. ?
>>>> 
>>>> LEFT JOIN:
>>>>       * If an address doesn’t exist for the person, emit a
>> “null”-filled
>>>> path.
>>>> 
>>>> jdbc.db().values(“people”).as(“x”).
>>>> db().values(“addresses”).as(“y”).
>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>     identity(),
>>>>     path(“y”).by(null).as(“y”)).
>>>> path(“x”,”y")
>>>> 
>>>> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
>>>> 
>>>> RIGHT JOIN:
>>>> 
>>>> jdbc.db().values(“people”).as(“x”).
>>>> db().values(“addresses”).as(“y”).
>>>>   choose(has(“name”,eq(path(“x”).by(“name”))),
>>>>     identity(),
>>>>     path(“x”).by(null).as(“x”)).
>>>> path(“x”,”y")
>>>> 
>>>> 
>>>> SUMMARY:
>>>> 
>>>> There are no “low level” instructions. Everything is based on the
>> standard
>>>> instructions that we know and love. Finally, if not apparent, the above
>>>> bytecode chunks would ultimately get strategized into a single SQL query
>>>> (breadth-first) instead of one-off queries (depth-first) to improve
>>>> performance.
>>>> 
>>>> Neat?,
>>>> Marko.
>>>> 
>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ 
>>>> <http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>> http://rredux.com/ <http://rredux.com/>>>

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Reply via email to