Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Joshua Shinavier Tue, 30 Apr 2019 07:52:00 -0700

Hi Marko,

I like it. But I still have some constructive criticism. I think a little
more simplicity in the right places will make things like index support,
query optimization, and integration with SEDMs (someone else's data model)
that much easier in the future.


First, the "root". While we do need context for traversals, I don't think
there should be a distinct kind of root for each kind of structure. Once
again, select(), or operations derived from select() will work just fine.
Want the "person" table? db.select("person"). Want a sequence of vertices
with the label "person"? db.select("person"). What we are saying in either
case is "give me the 'person' relation. Don't project any specific fields;
just give me all the data". A relational DB and a property graph DB will
have different ways of supplying the relation, but in either case, it can
hide behind the same interface (TRelation?).

But wait, you say, what if the under the hood, you have a TTable in one
case, and TSequence in the other? They are so different! That's why
the Dataflow
Model
<https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43864.pdf>
is so great; to an extent, you can think of the two as interchangeable. I
think we would get a lot of mileage out of treating them as interchangeable
within TP4.

So instead of a data model -specific "root", I argue for a universal root
together with a set of relations and what we might call an "indexes". An
index is an arrow from a type to a relation which says "give me a
column/value pair, and I will give you all matching tuples from this
relation". The result is another relation. Where data sources differentiate
themselves is by having different relations and indexes.

For example, if the underlying data structure is nothing but a stream of
Trip tuples, you will have a single relation "Trip", and no indexes. Sorry;
you just have to wait for tuples to go by, and filter on them. So if you
say d.select("Trip", "driver") -- where d is a traversal that gets you to a
User -- the machine knows that it can't use "driver" to look up a specific
set of trips; it has to use a filter over all future "Trip" tuples. If, on
the other hand, we have a relational database, we have the option of
indexing on "driver". In this case, d.select("Trip", "driver") may take you
to a specific table like "Trip_by_driver" which has "driver" as a primary
key. The machine recognizes that this index exists, and uses it to answer
the query more efficiently. The alternative is to do a full scan over any
table which contains the "Trip" relation. Since TinkerPop3, we have been
without a vendor-neutral API for indexes, but this is where such an API
would really start to shine. Consider Neo4j's single property indexes,
JanusGraph's composite indexes, and even RDF triple indices (spo, ops,
etc.) as in AllegroGraph in addition to primary keys in relational
databases.

TTuple -- cool. +1

"Enums" -- I agree that enums are necessary, but we need even more: tagged
unions <https://en.wikipedia.org/wiki/Tagged_union>. They are part of the
system of algebraic data types which I described on Friday. An enum is a
special case of a tagged union in which there is no value, just a type tag.
May I suggest something like TValue, which contains a value (possibly
trivial) together with a type tag. This enables ORs and pattern matching.
For example, suppose "created" edges are allowed to point to either
"Project" or "Document" vertices. The in-type of "created" is
union{project:Project, document:Document). Now the in value of a specific
edge can be TValue("project", [some project vertex]) or TValue("document",
[some document vertex]) and you have the freedom to switch on the type tag
if you want to, e.g. the next step in the traversal can give you the "name"
of the project or the "title" of the document as appropriate.

Multi-properties -- agreed; has() is good enough.

Meta-properties -- again, this is where I think we should have a
lower-level select() operation. Then has() builds on that operation.
Whereas select() matches on fields of a relation, has() matches on property
values and other higher-order things. If you want properties of properties,
don't use has(); use select()/from(). Most of the time, you will just want
to use has().

Agreed that every *entity* should have an id(), and also a label() (though
it should always be possible to infer label() from the context). I would
suggest TEntity (or TElement), which has id(), label(), and value(), where
value() provides the raw value (usually a TTuple) of the entity.

Josh



On Mon, Apr 29, 2019 at 10:35 AM Marko Rodriguez <okramma...@gmail.com>
wrote:

> Hello Josh,
>
> > A has("age",29), for example, operates at a different level of
> abstraction than a
> > has("city","Santa Fe") if "city" is a column in an "addresses" table.
>
> So hasXXX() operators work on TTuples. Thus:
>
> g.V().hasLabel(‘person’).has(‘age’,29)
> g.V().hasLabel(‘address’).has(‘city’,’Santa Fe’)
>
> ..both work as a person-vertex and an address-vertex are TTuples. If these
> were tables, then:
>
> jdbc.db().values(‘people’).has(‘age’,29)
> jdbc.db().values(‘addresses’).has(‘city’,’Santa Fe’)
>
> …also works as both people and addresses are TTables which extend
> TTuple<String,?>.
>
> In summary, its its a TTuple, then hasXXX() is good go.
>
> ////////// IGNORE UNTIL AFTER READING NEXT SECTION //////////
> *** SIDENOTE: A TTable (which is a TSequence) could have Symbol-based
> metadata. Thus TTable.value(#label) -> “people.” If so, then
> jdbc.db().hasLabel(“people”).has(“age”,29)
>
> > At least, they
> > are different if the data model allows for multi-properties,
> > meta-properties, and hyper-edges. A property is something that can either
> > be there, attached to an element, or not be there. There may also be more
> > than one such property, and it may have other properties attached to it.
> A
> > column of a table, on the other hand, is always there (even if its value
> is
> > allowed to be null), always has a single value, and cannot have further
> > properties attached.
>
> 1. Multi-properties.
>
> Multi-properties works because if name references a TSequence, then its
> the sequence that you analyze with has(). This is another reason why
> TSequence is important. Its a reference to a “stream” so there isn’t
> another layer of tuple-nesting.
>
> // assume v[1] has name={marko,mrodriguez,markor}
> g.V(1).value(‘name’) => TSequence<String>
> g.V(1).values(‘name’) => marko, mrodriguez, markor
> g.V(1).has(‘name’,’marko’) => v[1]
>
> 2. Meta-properties
>
> // assume v[1] has name=[value:marko,creator:josh,timestamp:12303] // i.e.
> a tuple value
> g.V(1).value(‘name’) => TTuple<?,String> // doh!
> g.V(1).value(‘name’).value(‘value’) => marko
> g.V(1).value(‘name’).value(‘creator’) => josh
>
> So things get screwy. — however, it only gets screwy when you mix your
> “metadata” key/values with your “data” key/values. This is why I think
> TSymbols are important. Imagine the following meta-property tuple for v[1]:
>
> [#value:marko,creator:josh,timestamp:12303]
>
> If you do g.V(1).value(‘name’), we could look to the value indexed by the
> symbol #value, thus => “marko”.
> If you do g.V(1).values(‘name’), you would get back a TSequence with a
> single TTuple being the meta property.
> If you do g.V(1).values(‘name’).value(), we could get the value indexed by
> the symbol #value.
> If you do g.V(1).values(‘name’).value(‘creator’), it will return the
> primitive string “josh”.
>
> I believe that the following symbols should be recommended for use across
> all data structures.
>         #id, #label, #key, #value
> …where id(), label(), key(), value() are tuple.get(Symbol). Other symbols
> for use with propertygraph/ include:
>         #outE, #inV, #inE, #outV, #bothE, #bothV
>
> > In order to simplify user queries, you can let has() and values() do
> double
> > duty, but I still feel that there are lower-level operations at play, at
> a
> > logical level even if not at a bytecode level. However, expressing the a
> > traversal in terms of its lowest-level relational operations may also be
> > useful for query optimization.
>
> One thing that I’m doing, that perhaps you haven’t caught onto yet, is
> that I’m not modeling everything in terms of “tables.” Each data structure
> is trying to stay as pure to its conceptual model as possible. Thus, there
> are no “joins” in property graphs as outE() references a TSequence<TEdge>,
> where TEdge is an interface that extends TTuple. You can just walk without
> doing any type of INNER JOIN. Now, if you model a property graph in a
> relational database, you will have to strategize the bytecode accordingly!
> Just a heads up in case you haven’t noticed that.
>
> Thanks for your input,
> Marko.
>
> http://rredux.com <http://rredux.com/>
>
>
>
> >
> > Josh
> >
> >
> >
> > On Mon, Apr 29, 2019 at 7:34 AM Marko Rodriguez <okramma...@gmail.com
> <mailto:okramma...@gmail.com>>
> > wrote:
> >
> >> Hi,
> >>
> >> *** This email is primarily for Josh (and Kuppitz). However, if others
> are
> >> interested… ***
> >>
> >> So I did a lot of thinking this weekend about structure/ and this
> morning,
> >> I prototyped both graph/ and rdbms/.
> >>
> >> This is the way I’m currently thinking of things:
> >>
> >>        1. There are 4 base types in structure/.
> >>                - Primitive: string, long, float, int, … (will constrain
> >> these at some point).
> >>                - TTuple<K,V>: key/value map.
> >>                - TSequence<V>: an iterable of v objects.
> >>                - TSymbol: like Ruby, I think we need “enum-like” symbols
> >> (e.g., #id, #label).
> >>
> >>        2. Every structure has a “root.”
> >>                - for graph its TGraph implements TSequence<TVertex>
> >>                - for rdbms its a TDatabase implements
> >> TTuple<String,TTable>
> >>
> >>        3. Roots implement Structure and thus, are what is generated by
> >> StructureFactory.mint().
> >>                - defined using withStructure().
> >>                - For graph, its accessible via V().
> >>                - For rdbms, its accessible via db().
> >>
> >>        4. There is a list of core instructions for dealing with these
> >> base objects.
> >>                - value(K key): gets the TTuple value for the provided
> key.
> >>                - values(K key): gets an iterator of the value for the
> >> provided key.
> >>                - entries(): gets an iterator of T2Tuple objects for the
> >> incoming TTuple.
> >>                - hasXXX(A,B): various has()-based filters for looking
> >> into a TTuple and a TSequence
> >>                - db()/V()/etc.: jump to the “root” of the
> withStructure()
> >> structure.
> >>                - drop()/add(): behave as one would expect and thus.
> >>
> >> ————
> >>
> >> For RDBMS, we have three interfaces in rdbms/.
> >> (machine/machine-core/structure/rdbms)
> >>
> >>        1. TDatabase implements TTuple<String,TTable> // the root
> >> structure that indexes the tables.
> >>        2. TTable implements TSequence<TRow<?>> // a table is a sequence
> >> of rows
> >>        3. TRow<V> implements TTuple<String,V>> // a row has string
> column
> >> names
> >>
> >> I then created a new project at machine/structure/jdbc). The classes in
> >> here implement the above rdbms/ interfaces/
> >>
> >> Here is an RDBMS session:
> >>
> >> final Machine machine = LocalMachine.open();
> >> final TraversalSource jdbc =
> >>        Gremlin.traversal(machine).
> >>                        withProcessor(PipesProcessor.class).
> >>                        withStructure(JDBCStructure.class,
> >> Map.of(JDBCStructure.JDBC_CONNECTION, "jdbc:h2:/tmp/test"));
> >>
> >> System.out.println(jdbc.db().toList());
> >> System.out.println(jdbc.db().entries().toList());
> >> System.out.println(jdbc.db().value("people").toList());
> >> System.out.println(jdbc.db().values("people").toList());
> >> System.out.println(jdbc.db().values("people").value("name").toList());
> >> System.out.println(jdbc.db().values("people").entries().toList());
> >>
> >> This yields:
> >>
> >> [<database#conn1: url=jdbc:h2:/tmp/test user=>]
> >> [PEOPLE:<table#PEOPLE>]
> >> [<table#people>]
> >> [<row#PEOPLE:1>, <row#PEOPLE:2>]
> >> [marko, josh]
> >> [NAME:marko, AGE:29, NAME:josh, AGE:32]
> >>
> >> The bytecode of the last query is:
> >>
> >> [db(<database#conn1: url=jdbc:h2:/tmp/test user=>), values(people),
> >> entries]
> >>
> >> JDBCDatabase implements TDatabase, Structure.
> >>        *** JDBCDatabase is the root structure and is referenced by db()
> >> *** (CRUCIAL POINT)
> >>
> >> Assume another table called ADDRESSES with two columns: name and city.
> >>
> >>
> >>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).value(“city”)
> >>
> >> The above is equivalent to:
> >>
> >> SELECT city FROM people,addresses WHERE people.name=addresses.name
> >>
> >> If you want to do an inner join (a product), you do this:
> >>
> >>
> >>
> jdbc.db().values(“people”).as(“x”).db().values(“addresses”).has(“name”,eq(path(“x”).by(“name”))).as(“y”).path(“x”,”y")
> >>
> >> The above is equivalent to:
> >>
> >> SELECT * FROM addresses INNER JOIN people ON people.name=addresses.name
> >>
> >> NOTES:
> >>        1. Instead of select(), we simply jump to the root via db() (or
> >> V() for graph).
> >>        2. Instead of project(), we simply use value() or values().
> >>        3. Instead of select() being overloaded with by() join syntax, we
> >> use has() and path().
> >>                - like TP3 we will be smart about dropping path() data
> >> once its no longer referenced.
> >>        4. We can also do LEFT and RIGHT JOINs (haven’t thought through
> >> FULL OUTER JOIN yet).
> >>                - however, we don’t support ‘null' in TP so I don’t know
> >> if we want to support these null-producing joins. ?
> >>
> >> LEFT JOIN:
> >>        * If an address doesn’t exist for the person, emit a
> “null”-filled
> >> path.
> >>
> >> jdbc.db().values(“people”).as(“x”).
> >>  db().values(“addresses”).as(“y”).
> >>    choose(has(“name”,eq(path(“x”).by(“name”))),
> >>      identity(),
> >>      path(“y”).by(null).as(“y”)).
> >>  path(“x”,”y")
> >>
> >> SELECT * FROM addresses LEFT JOIN people ON people.name=addresses.name
> >>
> >> RIGHT JOIN:
> >>
> >> jdbc.db().values(“people”).as(“x”).
> >>  db().values(“addresses”).as(“y”).
> >>    choose(has(“name”,eq(path(“x”).by(“name”))),
> >>      identity(),
> >>      path(“x”).by(null).as(“x”)).
> >>  path(“x”,”y")
> >>
> >>
> >> SUMMARY:
> >>
> >> There are no “low level” instructions. Everything is based on the
> standard
> >> instructions that we know and love. Finally, if not apparent, the above
> >> bytecode chunks would ultimately get strategized into a single SQL query
> >> (breadth-first) instead of one-off queries (depth-first) to improve
> >> performance.
> >>
> >> Neat?,
> >> Marko.
> >>
> >> http://rredux.com <http://rredux.com/> <http://rredux.com/ <
> http://rredux.com/>>
>
>

Re: The Fundamental Structure Instructions Already Exist! (w/ RDBMS Example)

Reply via email to