Re: What makes 'graph traversals' and 'relational joins' the same?

Marko Rodriguez Tue, 23 Apr 2019 10:29:30 -0700

Hi,

I think we are very close to something useable for TP4 structure/. Solving this 
problem elegantly will open the flood gates on tp4/ development.


——

I still don’t grock your comeFrom().goto() stuff. I don’t get the benefit of 
having two instructions for “pointer chasing” instead of one.

Lets put that aside for now and lets turn to modeling a Vertex. Go back to my 
original representation:

vertex.goto(‘label’)
vertex.goto(‘id’)
vertex.goto(‘outE’)
vertex.goto(‘inE’)
vertex.goto(‘properties’)

Any object can be converted into a Map. In TinkerPop3 we convert vertices into 
maps via:

        g.V().has(‘name’,’marko’).valueMap() => {name:marko,age:29}
        g.V().has(‘name’,’marko’).valueMap(true) => 
{id:1,label:person,name:marko,age:29}

In the spirit of instruction reuse, we should have an asMap() instruction that 
works for ANY object. (As a side: this gets back to ONLY sending primitives 
over the wire, no Vertex/Edge/Document/Table/Row/XML/ColumnFamily/etc.). Thus, 
the above is:

        g.V().has(‘name’,’marko’).properties().asMap() => {name:marko,age:29}
        g.V().has(‘name’,’marko’).asMap() => 
{id:1,label:person,properties:{name:marko,age:29}}

You might ask, why didn’t it go to outE and inE and map-ify that data? Because 
those are "sibling” references, not “children” references. 

        goto(‘outE’) is a “sibling” reference. (a vertex does not contain an 
edge)
        goto(‘id’) is a “child” reference. (a vertex contains the id)

Where do we find sibling references?
        Graphs: vertices don’t contain each other.
        OO heaps: many objects don’t contain each other.
        RDBMS: rows are linked by joins, but don’t contain each other.

So, the way in which we structure our references (pointers) determines the 
shape of the data and ultimately how different instructions will behave. We 
can’t assume that asMap() knows anything about 
vertices/edges/documents/rows/tables/etc. It will simply walk all 
child-references and create a map.

We don’t want TP to get involved in “complex data types.” We don’t care. You 
can propagate MyDatabaseObject through the TP4 VM pipeline and load your object 
up with methods for optimizations with your DB and all that, but for TP4, your 
object is just needs to implement:

        ComplexType
                - Iterator<T> children(String label)
                - Iterator<T> siblings(String label)
                - default Iterator<T> references(String label) { 
IteratorUtils.concat(children(label), siblings(label)) }
                - String toString()

When a ComplexType goes over the wire to the user, it just represented as a 
ComplexTypeProxy with a toString() like v[1], 
tinkergraph[vertices:10,edges:34], etc. All references are disconnected. Yes, 
even children references. We do not want language drivers having to know about 
random object types and have to deal with implementing serializers and all that 
non-sense. The TP4 serialization protocol is primitives, maps, lists, bytecode, 
and traversers. Thats it!

*** Only Maps and Lists (that don’t contain complex data types) maintain their 
child references “over the wire.”

——

I don’t get your hypergraph example, so let me try another example:

        tp ==member==> marko, josh

TP is a vertex and there is a directed hyperedge with label “member” connecting 
to marko and josh vertices.

tp.goto(“outE”).filter(goto(“label”).is(“member”)).goto(“inV”)

Looks exactly like a property graph query? However, its not because goto(“inV”) 
returns 2 vertices, not 1. EdgeVertexFlatmapFunction works for property graphs 
and hypergraphs. It doesn’t care — it just follows goto() pointers! That is, it 
follows the ComplexType.references(“inV”). Multi-properties are the same as 
well. Likewise for meta-properties. These data model variations are not 
“special” to the TP4 VM. It just walks references whether there are 0,1,2, or N 
of them.

Thus, what is crucial to all this is the “shape of the data.” Using your 
pointers wisely so instructions produce useful results.

Does any of what I wrote update your comeFrom().goto() stuff? If not, can you 
please explain to me why comeFrom() is cool — sorry for being dense (aka “being 
Kuppitz" — thats right, I said it. boom!).

Thanks,
Marko.

http://rredux.com <http://rredux.com/>




> On Apr 23, 2019, at 10:25 AM, Joshua Shinavier <[email protected]> wrote:
> 
> On Tue, Apr 23, 2019 at 5:14 AM Marko Rodriguez <[email protected]>
> wrote:
> 
>> Hey Josh,
>> 
>> This gets to the notion I presented in “The Fabled GMachine.”
>>        http://rredux.com/the-fabled-gmachine.html <
>> http://rredux.com/the-fabled-gmachine.html> (first paragraph of
>> “Structures, Processes, and Languages” section)
>> 
>> All that exists are memory addresses that contain either:
>> 
>>        1. A primitive
>>        2. A set of labeled references to other references or primitives.
>> 
>> Using your work and the above, here is a super low-level ‘bytecode' for
>> property graphs.
>> 
>> v.goto("id") => 1
>> 
> 
> LGTM. An id is special because it is uniquely identifying / is a primary
> key for the element. However, it is also just a field of the element, like
> "in"/"inV" and "out"/"outV" are fields of an edge. As an aside, an id would
> only really need to be unique among other elements of the same type. To the
> above, I would add:
> 
> v.type() => Person
> 
> ...a special operation which takes you from an element to its type. This is
> important if unions are supported; e.g. "name" in my example can apply
> either to a Person or a Project.
> 
> 
> v.goto("label") => person
>> 
> 
> Or that. Like "id", "type"/"label" is special. You can think of it as a
> field; it's just a different sort of field which will have the same value
> for all elements of any given type.
> 
> 
> 
>> v.goto("properties").goto("name") => "marko"
>> 
> 
> OK, properties. Are properties built-in as a separate kind of thing from
> edges, or can we treat them the same as vertices and edges here? I think we
> can treat them the same. A property, in the algebraic model I described
> above, is just an element with two fields, the second of which is a
> primitive value. As I said, I think we need two distinct traversal
> operations -- projection and selection -- and here is where we can use the
> latter. Here, I will call it "comeFrom".
> 
> v.comeFrom("name", "out").goto("in") => {"marko"}
> 
> You can think of this comeFrom as a special case of a select() function
> which takes a type -- "name" -- and a set of key/value pairs {("out", v)}.
> It returns all matching elements of the given type. You then project to the
> "in" value using your goto. I wrote {"marko"} as a set, because comeFrom
> can give you multiple properties, depending on whether multi-properties are
> supported.
> 
> Note how similar this is to an edge traversal:
> 
> v.comeFrom("knows", "out").goto("in") => {v[2], v[4]}
> 
> Of course, you could define "properties" in such a way that a
> goto("properties") does exactly this under the hood, but in terms of low
> level instructions, you need something like comeFrom.
> 
> 
> v.goto("properties").goto("name").goto(0) => "m"
>> 
> 
> This is where the notion of optionals becomes handy. You can make
> array/list indices into fields like this, but IMO you should also make them
> safe. E.g. borrowing Haskell syntax for a moment:
> 
> v.goto("properties").goto("name").goto(0) => Just 'm'
> 
> v.goto("properties").goto("name").goto(5) => Nothing
> 
> 
> v.goto("outE").goto("inV") => v[2], v[4]
>> 
> 
> I am not a big fan of untyped "outE", but you can think of this as a union
> of all v.comeFrom(x, "out").goto("in"), where x is any edge type. Only
> "knows" and "created" are edge types which are applicable to "Person", so
> you will only get {v[2], v[4]}. If you want to get really crazy, you can
> allow x to be any type. Then you get {v[2], v[4], 29, "marko"}.
> 
> 
> 
>> g.goto("V").goto(1) => v[1]
>> 
> 
> That, or you give every element a virtual field called "graph". So:
> 
> v.goto("graph") => g
> 
> g.comeFrom("Person", "graph") => {v[1], v[2], v[4], v[6]}
> 
> g.comeFrom("Person", "graph").restrict("id", 1)
> 
> ...where restrict() is the relational "sigma" operation as above, not to be
> confused with TinkerPop's select(), filter(), or has() steps. Again, I
> prefer to specify a type in comeFrom (i.e. we're looking specifically for a
> Person with id of 1), but you could also do a comprehension g.comeFrom(x,
> "graph"), letting x range over all types.
> 
> 
> 
>> The goto() instruction moves the “memory reference” (traverser) from the
>> current “memory address” to the “memory address” referenced by the goto()
>> argument.
>> 
> 
> Agreed, if we also think of primitive values as memory references.
> 
> 
> 
>> The Gremlin expression:
>> 
>>        g.V().has(‘name’,’marko’).out(‘knows’).drop()
>> 
>> ..would compile to:
>> 
>> 
>> g.goto(“V”).filter(goto(“properties”).goto(“name”).is(“marko”)).goto(“outE”).filter(goto(“label”).is(“knows”)).goto(“inV”).free()
>> 
> 
> 
> In the alternate universe:
> 
> g.comeFrom("Person", "graph").comeFrom("name", "out").restrict("in",
> "marko").goto("out").comeFrom("knows", "out").goto("in").free()
> 
> I have wimped out on free() and just left it as you had it, but I think it
> would be worthwhile to explore a monadic syntax for traversals with
> side-effects. Different topic.
> 
> Now, all of this "out", "in" business is getting pretty repetitive, right?
> Well, the field names become more diverse if we allow hyper-edges and
> generalized ADTs. E.g. in my Trip example, say I want to know all drop-off
> locations for a given rider:
> 
> u.comeFrom("Trip", "rider").goto("dropoff").goto("place")
> 
> Done.
> 
> 
> 
>> If we can get things that “low-level” and still efficient to compile, then
>> we can model every data structure. All you are doing is pointer chasing
>> through a withStructure() data structure. .
>> 
> 
> Agreed.
> 
> 
> No one would ever want to write strategies for goto()-based Bytecode.
> 
> 
> Also agreed.
> 
> 
> 
>> Thus, perhaps there could be a PropertyGraphDecorationStrategy that does:
>> 
>> [...]
> 
> 
> No argument here, though the alternate-universe "bytecode" would look
> slightly different. And the high-level syntax should also be able to deal
> with generalized relations / data types gracefully. As a thought
> experiment, suppose we were to define the steps to() as your goto(), and
> from() as my comeFrom(). Then traversals like:
> 
> u.from("Trip", "rider").to("dropoff").to("time")
> 
> ...look pretty good as-is, and are not too low-level. However, ordinary
> edge traversals like:
> 
> v.from("knows", "out").to("in")
> 
> ...do look a little Assembly-like. So in/out/both etc. remain as they are,
> but are shorthand for from() and to() steps using "out" or "in":
> 
> v.out("knows") === v.outE("knows").inV() === v.from("knows", "out").to("in")
> 
> 
> [I AM NOW GOING OFF THE RAILS]
>> [sniiiiip]
>> 
> 
> Sure. Again, I like the idea of wrapping side-effects in monads. What would
> that look like in a Gremlinesque fluent syntax? I don't quite know, but if
> we think of the dot as a monadic bind operation like Haskell's >>=, then
> perhaps the monadic expressions look pretty similar to what you have just
> sketched out. Might have to be careful about what it means to nest
> operations as in your addEdge examples.
> 
> 
> 
> [I AM NOW BACK ON THE RAILS]
>> 
>> Its as if “properties”, “outE”, “label”, “inV”, etc. references mean
>> something to property graph providers and they can do more intelligent
>> stuff than what MongoDB would do with such information. However, someone,
>> of course, can create a MongoDBPropertyGraphStrategy that would make
>> documents look like vertices and edges and then use O(log(n)) lookups on
>> ids to walk the graph. However, if that didn’t exist, it would still do
>> something that works even if its horribly inefficient as every database can
>> make primitives with references between them!
>> 
> 
> I'm on the same same pair of rails.
> 
> 
> 
>> Anywho @Josh, I believe goto() is what you are doing with multi-references
>> off an object. How do we make it all clean, easy, and universal?
>> 
> 
> Let me know what you think of the above.
> 
> Josh
> 
> 
> 
>> 
>> Marko.
>> 
>> http://rredux.com <http://rredux.com/>
>> 
>> 
>> 
>> 
>>> On Apr 22, 2019, at 6:42 PM, Joshua Shinavier <[email protected]> wrote:
>>> 
>>> Ah, glad you asked. It's all in the pictures. I have nowhere to put them
>> online at the moment... maybe this attachment will go through to the list?
>>> 
>>> Btw. David Spivak gave his talk today at Uber; it was great. Juan
>> Sequeda (relational <--> RDF mapping guy) was also here, and Ryan joined
>> remotely. Really interesting discussion about databases vs. graphs, and
>> what category theory brings to the table.
>>> 
>>> 
>>> On Mon, Apr 22, 2019 at 1:45 PM Marko Rodriguez <[email protected]
>> <mailto:[email protected]>> wrote:
>>> Hey Josh,
>>> 
>>> I’m digging what you are saying, but the pictures didn’t come through
>> for me ? … Can you provide them again (or if dev@ is filtering them, can
>> you give me URLs to them)?
>>> 
>>> Thanks,
>>> Marko.
>>> 
>>> 
>>>> On Apr 21, 2019, at 12:58 PM, Joshua Shinavier <[email protected]
>> <mailto:[email protected]>> wrote:
>>>> 
>>>> On the subject of "reified joins", maybe be a picture will be worth a
>> few words. As I said in the thread <
>> https://groups.google.com/d/msg/gremlin-users/_s_DuKW90gc/Xhp5HMfjAQAJ <
>> https://groups.google.com/d/msg/gremlin-users/_s_DuKW90gc/Xhp5HMfjAQAJ>>
>> on property graph standardization, if you think of vertex labels, edge
>> labels, and property keys as types, each with projections to two other
>> types, there is a nice analogy with relations of two columns, and this
>> analogy can be easily extended to hyper-edges. Here is what the schema of
>> the TinkerPop classic graph looks like if you make each type (e.g. Person,
>> Project, knows, name) into a relation:
>>>> 
>>>> 
>>>> 
>>>> I have made the vertex types salmon-colored, the edge types yellow,
>> the property types green, and the data types blue. The "o" and "I" columns
>> represent the out-type (e.g. out-vertex type of Person) and in-type (e.g.
>> property value type of String) of each relation. More than two arrows from
>> a column represent a coproduct, e.g. the out-type of "name" is Person OR
>> Project. Now you can think of out() and in() as joins of two tables on a
>> primary and foreign key.
>>>> 
>>>> We are not limited to "out" and "in", however. Here is the ternary
>> relationship (hyper-edge) from hyper-edge slide <
>> https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/49
>> <
>> https://www.slideshare.net/joshsh/a-graph-is-a-graph-is-a-graph-equivalence-transformation-and-composition-of-graph-data-models-129403012/49>>
>> of my Graph Day preso, which has three columns/roles/projections:
>>>> 
>>>> 
>>>> 
>>>> I have drawn Says in light blue to indicate that it is a generalized
>> element; it has projections other than "out" and "in". Now the line between
>> relations and edges begins to blur. E.g. in the following, is PlaceEvent a
>> vertex or a property?
>>>> 
>>>> 
>>>> 
>>>> With the right type system, we can just speak of graph elements, and
>> use "vertex", "edge", "property" when it is convenient. In the relational
>> model, they are relations. If you materialize them in a relational
>> database, they are rows. In any case, you need two basic graph traversal
>> operations:
>>>> project() -- forward traversal of the arrows in the above diagrams.
>> Takes you from an element to a component like in-vertex.
>>>> select() -- reverse traversal of the arrows. Allows you to answer
>> questions like "in which Trips is John Doe the rider?"
>>>> 
>>>> Josh
>>>> 
>>>> 
>>>> On Fri, Apr 19, 2019 at 10:03 AM Marko Rodriguez <[email protected]
>> <mailto:[email protected]> <mailto:[email protected] <mailto:
>> [email protected]>>> wrote:
>>>> Hello,
>>>> 
>>>> I agree with everything you say. Here is my question:
>>>> 
>>>>        Relational database — join: Table x Table x equality function
>> -> Table
>>>>        Graph database — traverser: Vertex x edge label -> Vertex
>>>> 
>>>> I want a single function that does both. The only think was to
>> represent traverser() in terms of join():
>>>> 
>>>>        Graph database — traverser: Vertices x Vertex x equality
>> function -> Vertices
>>>> 
>>>> For example,
>>>> 
>>>> V().out(‘address’)
>>>> 
>>>>        ==>
>>>> 
>>>> g.join(V().hasLabel(‘person’).as(‘a’)
>>>>       V().hasLabel(‘addresses’).as(‘b’)).
>>>>         by(‘name’).select(?address vertex?)
>>>> 
>>>> That is, join the vertices with themselves based on some predicate to
>> go from vertices to vertices.
>>>> 
>>>> However, I would like instead to transform the relational database
>> join() concept into a traverser() concept. Kuppitz and I were talking the
>> other day about a link() type operator that says: “try and link to this
>> thing in some specified way.” .. ?? The problem we ran into is again, “link
>> it to what?”
>>>> 
>>>>        - in graph, the ‘to what’ is hardcoded so you don’t need to
>> specify anything.
>>>>        - in rdbms, the ’to what’ is some other specified table.
>>>> 
>>>> So what does the link() operator look like?
>>>> 
>>>> ——
>>>> 
>>>> Some other random thoughts….
>>>> 
>>>> Relational databases join on the table (the whole collection)
>>>> Graph databases traverser on the vertex (an element of the whole
>> collection)
>>>> 
>>>> We can make a relational database join on single row (by providing a
>> filter to a particular primary key). This is the same as a table with one
>> row. Likewise, for graph in the join() context above:
>>>> 
>>>> V(1).out(‘address’)
>>>> 
>>>>        ==>
>>>> 
>>>> g.join(V(1).as(‘a’)
>>>>       V().hasLabel(‘addresses’).as(‘b’)).
>>>>         by(‘name’).select(?address vertex?)
>>>> 
>>>> More thoughts please….
>>>> 
>>>> Marko.
>>>> 
>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <
>> http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>> http://rredux.com/ <http://rredux.com/>>>
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Apr 19, 2019, at 4:20 AM, pieter martin <[email protected]
>> <mailto:[email protected]> <mailto:[email protected] <mailto:
>> [email protected]>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> The way I saw it is that the big difference is that graph's have
>>>>> reified joins. This is both a blessing and a curse.
>>>>> A blessing because its much easier (less text to type, less mistakes,
>>>>> clearer semantics...) to traverse an edge than to construct a manual
>>>>> join.A curse because there are almost always far more ways to
>> traverse
>>>>> a data set than just by the edges some architect might have
>> considered
>>>>> when creating the data set. Often the architect is not the domain
>>>>> expert and the edges are a hardcoded layout of the dataset, which
>>>>> almost certainly won't survive the real world's demands. In graphs,
>> if
>>>>> their are no edges then the data is not reachable, except via indexed
>>>>> lookups. This is the standard engineering problem of database design,
>>>>> but it is important and useful that data can be traversed, joined,
>>>>> without having reified edges.
>>>>> In Sqlg at least, but I suspect it generalizes, I want to create the
>>>>> notion of a "virtual edge". Which in meta data describes the join and
>>>>> then the standard to(direction, "virtualEdgeName") will work.
>>>>> In a way this is precisely to keep the graphy nature of gremlin, i.e.
>>>>> traversing edges, and avoid using the manual join syntax you
>> described.
>>>>> CheersPieter
>>>>> 
>>>>> On Thu, 2019-04-18 at 14:15 -0600, Marko Rodriguez wrote:
>>>>>> Hi,
>>>>>> *** This is mainly for Kuppitz, but if others care.
>>>>>> Was thinking last night about relational data and Gremlin. The T()
>>>>>> step returns all the tables in the withStructure() RDBMS database.
>>>>>> Tables are ‘complex values’ so they can't leave the VM (only a
>> simple
>>>>>> ‘toString’).
>>>>>> Below is a fake Gremlin session. (and these are just ideas…) tables
>>>>>> -> a ListLike of rows        rows -> a MapLike of primitives
>>>>>> gremlin> g.T()==>t[people]==>t[addresses]gremlin>
>>>>>> g.T(‘people’)==>t[people]gremlin>
>>>>>> 
>> g.T(‘people’).values()==>r[people:1]==>r[people:2]==>r[people:3]greml
>>>>>> in>
>>>>>> 
>> g.T(‘people’).values().asMap()==>{name:marko,age:29}==>{name:kuppitz,
>>>>>> age:10}==>{name:josh,age:35}gremlin>
>>>>>> 
>> g.T(‘people’).values().has(‘age’,gt(20))==>r[people:1]==>r[people:3]g
>>>>>> remlin>
>>>>>> 
>> g.T(‘people’).values().has(‘age’,gt(20)).values(‘name’)==>marko==>jos
>>>>>> h
>>>>>> Makes sense. Nice that values() and has() generally apply to all
>>>>>> ListLike and MapLike structures. Also, note how asMap() is the
>>>>>> valueMap() of TP4, but generalizes to anything that is MapLike so it
>>>>>> can be turned into a primitive form as a data-rich result from the
>>>>>> VM.
>>>>>> gremlin> g.T()==>t[people]==>t[addresses]gremlin>
>>>>>> 
>> g.T(‘addresses’).values().asMap()==>{name:marko,city:santafe}==>{name
>>>>>> :kuppitz,city:tucson}==>{name:josh,city:desertisland}gremlin>
>>>>>> g.join(T(‘people’).as(‘a’),T(‘addresses’).as(‘b’)).
>> by(se
>>>>>> lect(‘a’).value(’name’).is(eq(select(‘b’).value(’name’))).
>> 
>>>>>> values().asMap()==>{a.name:marko,a.age:29,b.name:
>> marko,b.city:santafe
>>>>>> }==>{a.name:kuppitz,a.age:10,b.name:kuppitz,b.city:tucson}==>{
>> a.name <http://a.name/> <http://a.name/ <http://a.name/>>:
>>>>>> josh,a.age:35,b.name:josh,b.city:desertisland}gremlin>
>>>>>> g.join(T(‘people’).as(‘a’),T(‘addresses’).as(‘b’)).
>> by(’n
>>>>>> ame’). // shorthand for equijoin on name
>>>>>> column/key           values().asMap()==>{a.name:marko,a.age:29,
>> b.name <http://b.name/> <http://b.name/ <http://b.name/>>
>>>>>> :marko,b.city:santafe}==>{a.name:kuppitz,a.age:10,b.name:kuppitz,
>> b.ci <http://b.ci/> <http://b.ci/ <http://b.ci/>>
>>>>>> ty:tucson}==>{a.name:josh,a.age:35,b.name:
>> josh,b.city:desertisland}gr
>>>>>> emlin>
>>>>>> g.join(T(‘people’).as(‘a’),T(‘addresses’).as(‘b’)).
>> by(’n
>>>>>> ame’)==>t[people<-name->addresses]  // without asMap(), just the
>>>>>> complex value ‘toString'gremlin>
>>>>>> And of course, all of this is strategized into a SQL call so its
>>>>>> joins aren’t necessarily computed using TP4-VM resources.
>>>>>> Anywho — what I hope to realize is the relationship between “links”
>>>>>> (graph) and “joins” (tables). How can we make (bytecode-wise at
>>>>>> least) RDBMS join operations and graph traversal operations ‘the
>>>>>> same.’?
>>>>>>     Singleton: Integer, String, Float, Double, etc. Collection:
>>>>>> List, Map (Vertex, Table, Document)  Linkable: Vertex, Table
>>>>>> Vertices and Tables can be “linked.” Unlike Collections, they don’t
>>>>>> maintain a “parent/child” relationship with the objects they
>>>>>> reference. What does this mean……….?
>>>>>> Take care,Marko.
>>>>>> http://rredux.com <http://rredux.com/> <http://rredux.com/ <
>> http://rredux.com/>> <http://rredux.com/ <http://rredux.com/> <
>> http://rredux.com/ <http://rredux.com/>>> <http://rredux.com/ <
>> http://rredux.com/> <http://rredux.com/ <http://rredux.com/>> <
>> http://rredux.com/ <http://rredux.com/> <http://rredux.com/ <
>> http://rredux.com/>>>>
>>> 
>>> <diagrams.zip>
>> 
>>

Re: What makes 'graph traversals' and 'relational joins' the same?

Reply via email to