Re: A meta model for gremlin's property graph

pieter gmail Sun, 16 Jan 2022 11:36:38 -0800

Hi,

This is a continuation of the "first decide what we are trying to
achieve in the first place." part.
We seem to agree on most of what was iterated. 
Do you have more items to add to the list?


Here is the bit I did not quite follow.

> > 3: Extend the gremlin grammar to specify schema create/edit/delete
> > functionality.
> Why is that necessary, if you're embedding schemas in the graph? Just
> embed them in the graph. We don't have extra grammar for updating
> other types of graphs.

I am not entirely sure what you mean here. In my previous example I
did, as an example, create the "modern" schema using pure gremlin based
on the property graph meta model.
However it is far from being user friendly.
Here it is again, is this what you mean by "Just embed them in the
graph."?

           modernSchema = g.meta();
    Vertex person = modernSchema.addVertex(T.label, "VertexLabel", "label", 
"person");
    Vertex personNameVertexProperty = modernSchema.addVertex(T.label, 
"VertexProperty", "name", "name", "type", GremlinDataType.STRING.name());
    Vertex personAgeVertexProperty = modernSchema.addVertex(T.label, 
"VertexProperty", "name", "age", "type", GremlinDataType.INTEGER.name());
    person.addEdge("properties", personNameVertexProperty);
    person.addEdge("properties", personAgeVertexProperty);

    Vertex software = modernSchema.addVertex(T.label, "VertexLabel", "label", 
"software");
    Vertex softwareNameVertexProperty = modernSchema.addVertex(T.label, 
"VertexProperty", "name", "name", "type", GremlinDataType.STRING.name());
    Vertex softwareLangVertexProperty = modernSchema.addVertex(T.label, 
"VertexProperty", "name", "lang", "type", GremlinDataType.STRING.name());
    software.addEdge("properties", softwareNameVertexProperty);
    software.addEdge("properties", softwareLangVertexProperty);

    Vertex knows = modernSchema.addVertex(T.label, "EdgeLabel", "label", 
"knows");
    Vertex knowsWeightVertexProperty = modernSchema.addVertex(T.label, 
"EdgeProperty", "name", "weight", "type", GremlinDataType.INTEGER.name());
    knows.addEdge("properties", knowsWeightVertexProperty);

    Vertex created = modernSchema.addVertex(T.label, "EdgeLabel", "label", 
"created");
    Vertex createdWeightVertexProperty = modernSchema.addVertex(T.label, 
"EdgeProperty", "name", "weight", "type", GremlinDataType.INTEGER.name());
    created.addEdge("properties", createdWeightVertexProperty);

    person.addEdge("outEdge", knows);
    person.addEdge("outEdge", created);
    software.addEdge("inEdge", knows);
    software.addEdge("inEdge", created);

It is far simpler to define a dedicated grammar, something like this,

    VertexLabel person = g.getTopology().ensureVertexLabelExist("person", new 
HashMap<>() {{
        put("name", GremlinDataType.STRING);
        put("age", GremlinDataType.INTEGER);
    }});
    VertexLabel software = g.getTopology().ensureVertexLabelExist("software", 
new HashMap<>() {{
        put("name", GremlinDataType.STRING);
        put("lang", GremlinDataType.STRING);
    }});
    EdgeLabel knows = person.ensureEdgeLabelExist("knows", person, new 
HashMap<>() {{
        put("weight", GremlinDataType.DOUBLE);
    }});
    EdgeLabel created = person.ensureEdgeLabelExist("created", person, new 
HashMap<>() {{
        put("weight", GremlinDataType.DOUBLE);
    }});

This is from embedded java so it will need some adjustment and thinking
but I suspect it is easier for the user if we extend the grammar.  In
the same way that rdbms's do not ask users to insert rows into the
information schema but instead give them a DDL grammar that speaks
directly to the task at hand.
It also guarantees that the model is valid at all times as the grammar
 won't permit an incorrect schema.
In the "embedded" way it is possible to corrupt the schema, which is
why the property graph meta model defined gremlin constraints to
validate the schema.

Preferable all implementations should provide a way to query the schema
based on the property graph meta model.

i.e. 
    List<Vertex> persons = g.schema().V().hasLabel("VertexLabel").has("name", 
P.eq("person")).toList();
    Assert.assertEquals(1, persons.size());
    List<Vertex> knowsAndCreated = 
g.schema().V().hasLabel("VertexLabel").has("name", 
P.eq("person")).out("outEdge").toList()
    Assert.assertEquals(2, knowsAndCreated.size())


Thanks
Pieter


On Sun, 2022-01-16 at 08:40 -0800, Joshua Shinavier wrote:
> Hi Pieter,
> 
> Responses inline.
> 
> On Sat, Jan 15, 2022 at 9:49 AM pieter gmail
> <pieter.mar...@gmail.com> wrote:
> > [...]
> > The primary inspiration from UML is the insight that a language can
> > be self describing.  It is of course inevitable in the real world
> > as we can not tolerate infinite regression with regards to every
> > level needing yet another meta level to describe it.
> > 
> 
> 
> Yes, and it has that in common with many other languages, starting
> with BNF. For a scripting language like Gremlin, you can even speak
> of self-interpretation.
> 
>  
> > [...]
> > To be clear I am not using any OMG standard as such. If we were to
> > do that we would define the property graph model using MOF (meta
> > object facility) or its counter part EMF. While this is entirely
> > possible it is not the approach taken here. Here the attempt is to
> > bootstrap the property graph model entirely and only with gremlin.
> > 
> 
> 
> Cool.
> 
>  
> > 
> > > The problem right now is that Gremlin's declarative semantics
> > > aren't very clear, and it is a relatively complex language.
> > 
> > 
> > This is not an attempt at a specification of the gremlin language.
> > It is only an attempt at formally specifying the implicit property
> > graph model assumed by the gremlin language. My understanding is
> > that the gremlin language will be formally defined by the antlr
> > grammar accompanied with documentation in English.
> > 
> 
> 
> Understood, though the ANTLR grammar in gremlin-language is only a
> specification of the surface syntax, not the semantics of the
> language. A specification of the semantics would define how the
> various steps map inputs to outputs, and how intermediate results are
> combined. Not what either of us are talking about here.
> 
>  
> > > I agree, and I think there is value in going one step further to
> > > create a general purpose data model for defining data models,
> > > with property graphs as a special case.
> > 
> > 
> > Here I do not agree. While there certainly is value in meta meta
> > models I do not think actually designing a new one belongs in
> > TinkerPop. TinkerPop is about the gremlin language and the property
> > graph model, not about meta meta models. The job of creating deeper
> > more abstract models with all that it entails is in my opinion a
> > huge task that has little to do TinkerPop, gremlin and its property
> > graph model.
> > 
> 
> 
> To each his own. One of the main advantages of a general-purpose
> model is that it allows you to define mappings between property
> graphs and other, unrelated data models. That can be useful for
> shipping data into and out of the graph. Lack of robust solutions
> around mappings to/from external data models has always been one of
> the major pain points of the property graph ecosystem. Everyone who
> undertakes to build larger, more complex property graph applications
> has to deal with this problem.
> 
> 
>  
> > Here it is the same critique. There is no need to say that a vertex
> > together with its label is in fact a type with a name. Type is not
> > a notion in gremlin nor a notion in our meta model so its not part
> > of our language.
> > 
> 
> 
> But defining types, and checking instances against types, is exactly
> what you are doing in your property graph model example. VertexLabel
> is a type, any instance of which has a string-valued label and zero
> or more VertexProperty-valued properties. Graph is a type, any
> instance of which can have vertices and/or edges. I'm just shifting
> your idea down a level to say that Person is a type, any instance of
> which has a string-valued name and zero or more Person-valued "knows"
> etc. You don't have to call your constructs "types", but it's useful
> to do so. Using the terminology "type", "type inference" etc. just
> puts you in a better position to re-use applicable concepts from
> programming language theory. Runtime performance becomes easier to
> reason about, etc.
> 
>  
> > 
> > > Cool, except that I would banish types like Date and Time
> > 
> > 
> > I have no strong intuitions about this art/science. Perhaps the
> > meta model should be extended to provide some support for non
> > primitive data types.
> > 
> 
> 
> IMO that's what you're already doing by assigning names to what I
> would call complex types like the ones in your example. As an
> intermediate example, imagine a type like LatLon, which you could
> model as a vertex with two properties.
> 
>  
> > 
> > I was actually hoping to avoid some arbitrary attempt at defining a
> > long list of possible primitives. I looked on the internet but
> > seems there is no standard body out there for this with every
> > language and database defining its own types. Perhaps the long list
> > is the only solution?
> > 
> 
> 
> No, a big enumeration of numeric types by precision is not the only
> way to go, but I currently prefer that approach over e.g.
> parameterized types (e.g. an integer type is constructed using two
> parameters: signedness and precision. This allows an unlimited number
> of integer types) because it's just simpler, and simplifies the
> supporting code you have to write.
> 
>  
> > Same critique as above. Letting in another language means gremlin
> > does not bootstrap itself.
> > 
> 
> 
> Similar response as above. You're defining a language whether you
> like it or not. The terms in your language are "Graph",
> "EdgeProperty", etc. You're using Gremlin as the medium for
> expressing the language, but you're still creating something new. The
> "something new" is the language I am talking about, not the Gremlin
> syntax you're using to define it.
> 
> 
> > > I don't see your approach of embedding model definitions and
> > > constraints natively in Gremlin as being at odds with having a
> > > formal data model.
> > 
> > 
> > Afraid I do see as being at odds with one another. Describing
> > gremlin using another language, be it MOF/EMF/category theory is a
> > very big difference to it being self describing. If we decide
> > against gremlin self describing then we abort this attempt, no
> > point in hacking it.
> > 
> 
> 
> Not sure we fully understood each other, but it's your idea; I'm just
> giving you the requested feedback.
> 
>  
> > For what its worth this is a bit of a proof of concept. To see if
> > gremlin can meaningfully self describe. It has done so for the last
> > 10 years.
> > 
> 
> 
> I think it's a worthwhile thing to do, though when you say it like
> that, I have to comment that making *Gremlin* self-describe is a
> much, much (much) bigger problem than defining a schema language
> within Gremlin. I think both problems are solvable, but the former is
> definitely a TinkerPop 4 proposition.
> 
>  
> > Perhaps we should, however, before discussing the merits of this
> > approach or another, first decide what we are trying to achieve in
> > the first place.
> > 
> 
> 
> +1
> 
>  
> > Here goes my understanding of what we are trying to achieve.
> > 
> > 1: A property graph meta model. To describe exactly what kind of
> > data structure the gremlin language operates on.
> > 
> 
> 
> +1
> 
>  
> > 2: Gremlin grammar together with the documentation specifies
> > gremlin the language fully.
> > 
> 
> 
> The surface syntax of the language (enough for expressing your schema
> constraints), yes.
> 
>  
> > 3: Extend the gremlin grammar to specify schema create/edit/delete
> > functionality.
> > 
> 
> 
> Why is that necessary, if you're embedding schemas in the graph? Just
> embed them in the graph. We don't have extra grammar for updating
> other types of graphs.
> 
>  
> > 4: Extend the grammar to query the schema. (This can be plain
> > gremlin, just operating at the schema level)
> > 
> 
> 
> Yeah, just plain Gremlin.
> 
>  
> > 5: A language agnostic specification of how to interact with a
> > remote gremlin enabled system. i.e. similar to the jdbc
> > specification only without reference to any particular language.
> > 
> 
> 
> Seems orthogonal to the language, and generation of constraints into
> Gremlin syntax.
>  
> 
> > As an aside, breaking user space should not even be considered.
> > i.e. 99% backward compatibility should be guaranteed at all times.
> > 
> 
> 
> I think you can do what you are proposing with no changes at all to
> the Gremlin language.
> 
> 
> Josh
>  
> 
>  
> > 
> > 
> > 
> > On Tue, 2022-01-11 at 10:47 -0800, Joshua Shinavier wrote:
> > > Hey Pieter,
> > > 
> > > Good to see some more motion on this front. Responses inline.
> > > 
> > > 
> > > On Sun, Jan 9, 2022 at 4:28 AM pieter gmail
> > > <pieter.mar...@gmail.com> wrote:
> > > > Hi,
> > > > 
> > > > I have done some work on defining a meta model for Gremlin's
> > > > property graph. I am using the approach used in the modelling
> > > > world, in particular as done by the OMG group when defining
> > > > their various meta models and specifications.
> > > > 
> > > 
> > > 
> > > +1 to using or drawing upon standards where we can. For those of
> > > us (including me) who have not worked with OMG standards other
> > > than occasionally bumping into UML, which parts of the approach
> > > you describe below were influenced by OMG?
> > > 
> > >  
> > > > However where OMG uses a subset of the UML to define their meta
> > > > models I suggest we use Gremlin. After all Gremlin is the
> > > > language we use to describe the world and the property graph
> > > > meta model can also be described in Gremlin.
> > > > 
> > > 
> > > 
> > > I agree, as long as these descriptions do not admit "arbitrary
> > > Gremlin". The problem right now is that Gremlin's declarative
> > > semantics aren't very clear, and it is a relatively complex
> > > language. I totally agree that you could define a DSL for
> > > defining models which could be embedded in Gremlin; you could
> > > even define the DSL in terms of itself.
> > > 
> > >  
> > > > I propose that we have 3 levels of modelling. Each of which can
> > > > itself be specified in gremlin.
> > > > 
> > > > 1: The property graph meta model.
> > > > 
> > > 
> > > 
> > > +1
> > > 
> > >  
> > > > 2: The model.
> > > > 
> > > 
> > > 
> > > I like the term "schema".
> > > 
> > >  
> > > > 3: The graph representing the actual data.
> > > > 
> > > 
> > > 
> > > +1. Not only is the graph a "model", but depending on how you
> > > define the modeling DSL, you can also see the other two models as
> > > "graphs", with types as elements.
> > > 
> > >  
> > > > 1) The property graph meta model describes the nature of the
> > > > property graph itself. i.e. that property graphs have vertices,
> > > > edges and properties.
> > > > 
> > > 
> > > 
> > > I agree, and I think there is value in going one step further to
> > > create a general purpose data model for defining data models,
> > > with property graphs as a special case.
> > > 
> > >  
> > > > 2) The model is an instance of the meta model. It describes the
> > > > schema of a particular graph. i.e. for TinkerPop's modern graph
> > > > this would be 'person', 'software', 'created' and 'knows' and
> > > > the various properties 'weight', 'age', 'name' and 'lang'
> > > > properties.
> > > > 
> > > 
> > > 
> > > +1
> > >  
> > > 
> > > > 3) The final level is an instance of the model. It is the
> > > > actual graph itself. i.e. for TinkerPop's modern graph it is
> > > > 'Marko', 'Josh', 'java' ...
> > > > 
> > > 
> > > 
> > > Yes. So to elaborate on what I said above about models and
> > > graphs, let's say we add a schema to the TinkerPop classic graph.
> > > The classic graph is an instance of the schema, and the schema is
> > > an instance of a property graph schema. Your three models are
> > > three graphs:
> > > 1) the classic graph ("data graph") has elements "Marko", "Josh",
> > > "ripple" etc. each of which is a value together with a type and a
> > > name (id). The type of Marko is "Person" (a named type) and the
> > > type of ripple is "Project" etc. The value of Marko is the record
> > > {"name": "marko", "age": 29} while the value of ripple is
> > > {"name": "ripple", "lang": "java"}.
> > > 2) the schema of the classic graph ("schema graph") has elements
> > > "Person", "Project", "knows", and "created". These again are
> > > values together with types and ids. E.g. the type of "Person" is
> > > something like {"name": string, "age": int32}, i.e. a record
> > > type.
> > > 3) the schema of the schema of the classic graph -- i.e. the core
> > > model or what you called the meta model -- is again a graph with
> > > elements like "Type", "Element", etc. Type expressions in the
> > > schema of the classic graph are values in the core model. The
> > > core model is its own schema.
> > > 
> > > Decide for yourself if the above makes sense to you, but this is
> > > how I think of the TinkerPop modeling layer cake these days -- as
> > > chained models in which the schema of one graph is the data of
> > > the next, usually arriving at a fixpoint -- the core -- within
> > > two steps.
> > > 
> > > 
> > > 
> > > > 1: Property Graph Meta Model
> > > > 
> > > >     public static Graph gremlinMetaModel() {
> > > >         enum GremlinDataType {
> > > >             STRING,
> > > >             INTEGER,
> > > >             DOUBLE,
> > > >             DATE,
> > > >             TIME
> > > >             //...
> > > >         }
> > > > 
> > > 
> > > 
> > > Cool, except that I would banish types like Date and Time from
> > > the core model. Drawing the line between primitive types and
> > > derived types is more art than science, but there is enough
> > > variation in what developers want out of dates/times that I put
> > > them on the other side of the fence. It also makes
> > > implementations easier if you have as few baked-in types as
> > > possible. On the other hand, I suggest adding many more numeric
> > > types, e.g. for integers:
> > > > - bigint
> > > > - int8
> > > > - int16
> > > > - int32
> > > > - int64
> > > > - uint8
> > > > - uint16
> > > > - uint32
> > > > - uint64
> > > 
> > > and for floating-point numbers:
> > > > - name: bigfloat
> > > > - name: float32
> > > > - name: float64
> > > 
> > > 
> > > > [snip metamodel definition]
> > > > 
> > > > 
> > > > 
> > > > This can be visualized as,
> > > > ...
> > > > 
> > > 
> > >  
> > > 
> > > I'm not sure if I'm reading this correctly, and I can't see the
> > > figure yet, but I understand that you are defining the metamodel
> > > as a graph. Cool.
> > > 
> > > 
> > >  
> > > > 
> > > > Notes: 
> > > > 1) GremlinDataType is an enumeration of named data types that
> > > > Gremlin supports. All gremlin data types are assumed to be
> > > > atomic and its life cycle fully owned by its containing parent.
> > > > How it is persisted on disc or transported over the wire is not
> > > > a concern for the meta model.
> > > > 
> > > 
> > > 
> > > Agree with most. Primitive/literal types are atomic, but you
> > > should be also able to define complex data types and bind them to
> > > names, and that is essentially what you are doing in the above.
> > > 
> > >  
> > > > 2) Gremlin's semantics is to weak to fully specify a valid meta
> > > > model. Accompanying the meta model we need a list of
> > > > constraints specified as gremlin queries to augment the
> > > > semantics of the meta model. These constraints/queries will be
> > > > able to validate any gremlin specified model for correctness.
> > > > 
> > > 
> > > 
> > > Or the other way around: we define a core model as its own thing
> > > using a well-defined, controlled vocabulary, then map it into
> > > Gremlin.
> > > 
> > > 
> > > > 3) It is trivial to extend the meta model. e.g. To specify
> > > > something like index support just add an 'Index' vertex and an
> > > > edge from 'VertexLabel' to it.
> > > > 
> > > 
> > > 
> > > I would say that you're extending a second-order model in that
> > > case. The core model / metamodel should be constant, but you can
> > > define additional models on top of it.
> > > 
> > >  
> > > > Property graph meta model constraints,
> > > > 
> > > > [...]
> > > > 
> > > 
> > > 
> > > Cool (though here, too, I would define constraints in a limited
> > > DSL, and map them into Gremlin).
> > > 
> > >  
> > > > 2: The model
> > > > 
> > > > What follows is an example of TinkerPop's 'modern' graph
> > > > specified as an instance of the above property graph meta
> > > > model.
> > > > [...]
> > > > 
> > > 
> > > 
> > > Cool.
> > > 
> > >  
> > > > 
> > > > There are lots of details to complete, but first we need to see
> > > > if there is any appetite for a modelling approach as I realize
> > > > there is some academic abstract algebra work happening
> > > > elsewhere.
> > > > 
> > > 
> > > 
> > > There is, but I don't see your approach of embedding model
> > > definitions and constraints natively in Gremlin as being at odds
> > > with having a formal data model. Have cake, eat it too.
> > > 
> > >  
> > > > It seems to me to have a lower barrier to entry for the
> > > > community to partake in the discussion of what constitutes a
> > > > property graph model.
> > > > 
> > > 
> > > 
> > > That's important. In my opinion, having the formal model defined
> > > up front gives you more power and flexibility for graph
> > > validation, transformations, and inference, but having the
> > > abstract model, you can also build developer-friendly DSLs on top
> > > of it.
> > > 
> > > 
> > > > Let me know if there are questions or criticisms.
> > > > 
> > > 
> > > 
> > > One of the nice things about your proposal is that it doesn't
> > > increase Java tech debt; you're suggesting defining models using
> > > Gremlin syntax, which is language variant -neutral. +1 to that.
> > > 
> > >  Josh
> > > 
> > > 
> > > 
> > 
> >

Re: A meta model for gremlin's property graph

Reply via email to