Re: A meta model for gremlin's property graph

Joshua Shinavier Sun, 16 Jan 2022 08:41:24 -0800

Hi Pieter,

Responses inline.


On Sat, Jan 15, 2022 at 9:49 AM pieter gmail <pieter.mar...@gmail.com>
wrote:

> [...]
> The primary inspiration from UML is the insight that a language can be
> self describing.  It is of course inevitable in the real world as we can
> not tolerate infinite regression with regards to every level needing yet
> another meta level to describe it.
>

Yes, and it has that in common with many other languages, starting with
BNF. For a scripting language like Gremlin, you can even speak of
self-interpretation.



> [...]
> To be clear I am not using any OMG standard as such. If we were to do that
> we would define the property graph model using MOF
> <https://www.omg.org/spec/MOF/2.4.2/> (meta object facility) or its
> counter part EMF <https://www.eclipse.org/modeling/emf/>. While this is
> entirely possible it is not the approach taken here. Here the attempt is to
> bootstrap the property graph model entirely and only with gremlin.
>

Cool.



>
> The problem right now is that Gremlin's declarative semantics aren't very
> clear, and it is a relatively complex language.
>
>
> This is not an attempt at a specification of the gremlin language. It is
> only an attempt at formally specifying the implicit property graph model
> assumed by the gremlin language. My understanding is that the gremlin
> language will be formally defined by the antlr grammar accompanied with
> documentation in English.
>

Understood, though the ANTLR grammar in gremlin-language is only a
specification of the surface syntax, not the semantics of the language. A
specification of the semantics would define how the various steps map
inputs to outputs, and how intermediate results are combined. Not what
either of us are talking about here.



> I agree, and I think there is value in going one step further to create a
> general purpose data model for defining data models, with property graphs
> as a special case.
>
>
> Here I do not agree. While there certainly is value in meta meta models I
> do not think actually designing a new one belongs in TinkerPop. TinkerPop
> is about the gremlin language and the property graph model, not about meta
> meta models. The job of creating deeper more abstract models with all that
> it entails is in my opinion a huge task that has little to do TinkerPop,
> gremlin and its property graph model.
>

To each his own. One of the main advantages of a general-purpose model is
that it allows you to define mappings between property graphs and other,
unrelated data models. That can be useful for shipping data into and out of
the graph. Lack of robust solutions around mappings to/from external data
models has always been one of the major pain points of the property graph
ecosystem. Everyone who undertakes to build larger, more complex property
graph applications has to deal with this problem.




> Here it is the same critique. There is no need to say that a vertex
> together with its label is in fact a type with a name. Type is not a notion
> in gremlin nor a notion in our meta model so its not part of our language.
>

But defining types, and checking instances against types, is exactly what
you are doing in your property graph model example. VertexLabel is a type,
any instance of which has a string-valued label and zero or more
VertexProperty-valued properties. Graph is a type, any instance of which
can have vertices and/or edges. I'm just shifting your idea down a level to
say that Person is a type, any instance of which has a string-valued name
and zero or more Person-valued "knows" etc. You don't have to call your
constructs "types", but it's useful to do so. Using the terminology "type",
"type inference" etc. just puts you in a better position to
re-use applicable concepts from programming language theory. Runtime
performance becomes easier to reason about, etc.



> Cool, except that I would banish types like Date and Time
>
>
> I have no strong intuitions about this art/science. Perhaps the meta model
> should be extended to provide some support for non primitive data types.
>

IMO that's what you're already doing by assigning names to what I would
call complex types like the ones in your example. As an intermediate
example, imagine a type like LatLon, which you could model as a vertex with
two properties.



> I was actually hoping to avoid some arbitrary attempt at defining a long
> list of possible primitives. I looked on the internet but seems there is no
> standard body out there for this with every language and database defining
> its own types. Perhaps the long list is the only solution?
>

No, a big enumeration of numeric types by precision is not the only way to
go, but I currently prefer that approach over e.g. parameterized types
(e.g. an integer type is constructed using two parameters: signedness and
precision. This allows an unlimited number of integer types) because it's
just simpler, and simplifies the supporting code you have to write.



> Same critique as above. Letting in another language means gremlin does not
> bootstrap itself.
>

Similar response as above. You're defining a language whether you like it
or not. The terms in your language are "Graph", "EdgeProperty", etc. You're
using Gremlin as the medium for expressing the language, but you're still
creating something new. The "something new" is the language I am talking
about, not the Gremlin syntax you're using to define it.


I don't see your approach of embedding model definitions and constraints
> natively in Gremlin as being at odds with having a formal data model.
>
>
> Afraid I do see as being at odds with one another. Describing gremlin
> using another language, be it MOF/EMF/category theory is a very big
> difference to it being self describing. If we decide against gremlin self
> describing then we abort this attempt, no point in hacking it.
>

Not sure we fully understood each other, but it's your idea; I'm just
giving you the requested feedback.



> For what its worth this is a bit of a proof of concept. To see if gremlin
> can meaningfully self describe. It has done so for the last 10 years.
>

I think it's a worthwhile thing to do, though when you say it like that, I
have to comment that making *Gremlin* self-describe is a much, much (much)
bigger problem than defining a schema language within Gremlin. I think both
problems are solvable, but the former is definitely a TinkerPop 4
proposition.



> Perhaps we should, however, before discussing the merits of this approach
> or another, first decide what we are trying to achieve in the first place.
>

+1



> Here goes my understanding of what we are trying to achieve.
>
> 1: A property graph meta model. To describe exactly what kind of data
> structure the gremlin language operates on.
>

+1



> 2: Gremlin grammar together with the documentation specifies gremlin the
> language fully.
>

The surface syntax of the language (enough for expressing your schema
constraints), yes.



> 3: Extend the gremlin grammar to specify schema create/edit/delete
> functionality.
>

Why is that necessary, if you're embedding schemas in the graph? Just embed
them in the graph. We don't have extra grammar for updating other types of
graphs.



> 4: Extend the grammar to query the schema. (This can be plain gremlin,
> just operating at the schema level)
>

Yeah, just plain Gremlin.



> 5: A language agnostic specification of how to interact with a remote
> gremlin enabled system. i.e. similar to the jdbc specification only without
> reference to any particular language.
>

Seems orthogonal to the language, and generation of constraints into
Gremlin syntax.


As an aside, breaking user space should not even be considered. i.e. 99%
> backward compatibility should be guaranteed at all times.
>

I think you can do what you are proposing with no changes at all to the
Gremlin language.


Josh




>
>
> On Tue, 2022-01-11 at 10:47 -0800, Joshua Shinavier wrote:
>
> Hey Pieter,
>
> Good to see some more motion on this front. Responses inline.
>
>
> On Sun, Jan 9, 2022 at 4:28 AM pieter gmail <pieter.mar...@gmail.com>
> wrote:
>
> Hi,
>
> I have done some work on defining a meta model for Gremlin's property
> graph. I am using the approach used in the modelling world, in particular
> as done by the OMG <https://www.omg.org/> group when defining their
> various meta models and specifications.
>
>
> +1 to using or drawing upon standards where we can. For those of us
> (including me) who have not worked with OMG standards other than
> occasionally bumping into UML, which parts of the approach you describe
> below were influenced by OMG?
>
>
>
> However where OMG uses a subset of the UML to define their meta models I
> suggest we use Gremlin. After all Gremlin is the language we use to
> describe the world and the property graph meta model can also be described
> in Gremlin.
>
>
> I agree, as long as these descriptions do not admit "arbitrary Gremlin".
> The problem right now is that Gremlin's declarative semantics aren't very
> clear, and it is a relatively complex language. I totally agree that you
> could define a DSL for defining models which could be embedded in Gremlin;
> you could even define the DSL in terms of itself.
>
>
>
> I propose that we have 3 levels of modelling. Each of which can itself be
> specified in gremlin.
>
> 1: The property graph meta model.
>
>
> +1
>
>
>
> 2: The model.
>
>
> I like the term "schema".
>
>
>
> 3: The graph representing the actual data.
>
>
> +1. Not only is the graph a "model", but depending on how you define the
> modeling DSL, you can also see the other two models as "graphs", with types
> as elements.
>
>
>
> 1) The property graph meta model describes the nature of the property
> graph itself. i.e. that property graphs have vertices, edges and properties.
>
>
> I agree, and I think there is value in going one step further to create a
> general purpose data model for defining data models, with property graphs
> as a special case.
>
>
>
> 2) The model is an instance of the meta model. It describes the schema of
> a particular graph. i.e. for TinkerPop's modern graph this would be
> 'person', 'software', 'created' and 'knows' and the various properties
> 'weight', 'age', 'name' and 'lang' properties.
>
>
> +1
>
>
> 3) The final level is an instance of the model. It is the actual graph
> itself. i.e. for TinkerPop's modern graph it is 'Marko', 'Josh', 'java' ...
>
>
> Yes. So to elaborate on what I said above about models and graphs, let's
> say we add a schema to the TinkerPop classic graph. The classic graph is an
> instance of the schema, and the schema is an instance of a property graph
> schema. Your three models are three graphs:
> 1) the classic graph ("data graph") has elements "Marko", "Josh", "ripple"
> etc. each of which is a value together with a type and a name (id). The
> type of Marko is "Person" (a named type) and the type of ripple is
> "Project" etc. The value of Marko is the record {"name": "marko", "age":
> 29} while the value of ripple is {"name": "ripple", "lang": "java"}.
> 2) the schema of the classic graph ("schema graph") has elements "Person",
> "Project", "knows", and "created". These again are values together with
> types and ids. E.g. the type of "Person" is something like {"name": string,
> "age": int32}, i.e. a record type.
> 3) the schema of the schema of the classic graph -- i.e. the core model or
> what you called the meta model -- is again a graph with elements like
> "Type", "Element", etc. Type expressions in the schema of the classic graph
> are values in the core model. The core model is its own schema.
>
> Decide for yourself if the above makes sense to you, but this is how I
> think of the TinkerPop modeling layer cake these days -- as chained models
> in which the schema of one graph is the data of the next, usually arriving
> at a fixpoint -- the core -- within two steps.
>
>
>
> 1: Property Graph Meta Model
>
>     public static Graph gremlinMetaModel() {
>
>         enum GremlinDataType {
>
>             STRING,
>
>             INTEGER,
>
>             DOUBLE,
>
>             DATE,
>
>             TIME
>
>             //...
>
>         }
>
>
>
> Cool, except that I would banish types like Date and Time from the core
> model. Drawing the line between primitive types and derived types is more
> art than science, but there is enough variation in what developers want out
> of dates/times that I put them on the other side of the fence. It also
> makes implementations easier if you have as few baked-in types as possible.
> On the other hand, I suggest adding many more numeric types, e.g. for
> integers:
>
> - bigint
> - int8
> - int16
> - int32
> - int64
> - uint8
> - uint16
> - uint32
> - uint64
>
> and for floating-point numbers:
>
> - name: bigfloat
> - name: float32
> - name: float64
>
>
> [snip metamodel definition]
>
>
>
> This can be visualized as,
> ...
>
>
>
> I'm not sure if I'm reading this correctly, and I can't see the figure
> yet, but I understand that you are defining the metamodel as a graph. Cool.
>
>
>
>
>
> Notes:
> 1) GremlinDataType is an enumeration of named data types that Gremlin
> supports. All gremlin data types are assumed to be atomic and its life
> cycle fully owned by its containing parent. How it is persisted on disc or
> transported over the wire is not a concern for the meta model.
>
>
> Agree with most. Primitive/literal types are atomic, but you should be
> also able to define complex data types and bind them to names, and that is
> essentially what you are doing in the above.
>
>
>
> 2) Gremlin's semantics is to weak to fully specify a valid meta model.
> Accompanying the meta model we need a list of constraints specified as
> gremlin queries to augment the semantics of the meta model. These
> constraints/queries will be able to validate any gremlin specified model
> for correctness.
>
>
> Or the other way around: we define a core model as its own thing using a
> well-defined, controlled vocabulary, then map it into Gremlin.
>
>
> 3) It is trivial to extend the meta model. e.g. To specify something like
> index support just add an 'Index' vertex and an edge from 'VertexLabel' to
> it.
>
>
> I would say that you're extending a second-order model in that case. The
> core model / metamodel should be constant, but you can define additional
> models on top of it.
>
>
>
> Property graph meta model constraints,
>
> [...]
>
>
> Cool (though here, too, I would define constraints in a limited DSL, and
> map them into Gremlin).
>
>
>
> 2: The model
>
> What follows is an example of TinkerPop's 'modern' graph specified as an
> instance of the above property graph meta model.
> [...]
>
>
> Cool.
>
>
>
>
> There are lots of details to complete, but first we need to see if there
> is any appetite for a modelling approach as I realize there is some
> academic abstract algebra work happening elsewhere.
>
>
> There is, but I don't see your approach of embedding model definitions and
> constraints natively in Gremlin as being at odds with having a formal data
> model. Have cake, eat it too.
>
>
>
> It seems to me to have a lower barrier to entry for the community to
> partake in the discussion of what constitutes a property graph model.
>
>
> That's important. In my opinion, having the formal model defined up front
> gives you more power and flexibility for graph validation, transformations,
> and inference, but having the abstract model, you can also build
> developer-friendly DSLs on top of it.
>
>
> Let me know if there are questions or criticisms.
>
>
> One of the nice things about your proposal is that it doesn't increase
> Java tech debt; you're suggesting defining models using Gremlin syntax,
> which is language variant -neutral. +1 to that.
>
>  Josh
>
>
>
>

Re: A meta model for gremlin's property graph

Reply via email to