Hey Pieter, Good to see some more motion on this front. Responses inline.
On Sun, Jan 9, 2022 at 4:28 AM pieter gmail <pieter.mar...@gmail.com> wrote: > Hi, > > I have done some work on defining a meta model for Gremlin's property > graph. I am using the approach used in the modelling world, in particular > as done by the OMG <https://www.omg.org/> group when defining their > various meta models and specifications. > +1 to using or drawing upon standards where we can. For those of us (including me) who have not worked with OMG standards other than occasionally bumping into UML, which parts of the approach you describe below were influenced by OMG? > However where OMG uses a subset of the UML to define their meta models I > suggest we use Gremlin. After all Gremlin is the language we use to > describe the world and the property graph meta model can also be described > in Gremlin. > I agree, as long as these descriptions do not admit "arbitrary Gremlin". The problem right now is that Gremlin's declarative semantics aren't very clear, and it is a relatively complex language. I totally agree that you could define a DSL for defining models which could be embedded in Gremlin; you could even define the DSL in terms of itself. > I propose that we have 3 levels of modelling. Each of which can itself be > specified in gremlin. > > 1: The property graph meta model. > +1 > 2: The model. > I like the term "schema". > 3: The graph representing the actual data. > +1. Not only is the graph a "model", but depending on how you define the modeling DSL, you can also see the other two models as "graphs", with types as elements. > 1) The property graph meta model describes the nature of the property > graph itself. i.e. that property graphs have vertices, edges and properties. > I agree, and I think there is value in going one step further to create a general purpose data model for defining data models, with property graphs as a special case. > 2) The model is an instance of the meta model. It describes the schema of > a particular graph. i.e. for TinkerPop's modern graph this would be > 'person', 'software', 'created' and 'knows' and the various properties > 'weight', 'age', 'name' and 'lang' properties. > +1 3) The final level is an instance of the model. It is the actual graph > itself. i.e. for TinkerPop's modern graph it is 'Marko', 'Josh', 'java' ... > Yes. So to elaborate on what I said above about models and graphs, let's say we add a schema to the TinkerPop classic graph. The classic graph is an instance of the schema, and the schema is an instance of a property graph schema. Your three models are three graphs: 1) the classic graph ("data graph") has elements "Marko", "Josh", "ripple" etc. each of which is a value together with a type and a name (id). The type of Marko is "Person" (a named type) and the type of ripple is "Project" etc. The value of Marko is the record {"name": "marko", "age": 29} while the value of ripple is {"name": "ripple", "lang": "java"}. 2) the schema of the classic graph ("schema graph") has elements "Person", "Project", "knows", and "created". These again are values together with types and ids. E.g. the type of "Person" is something like {"name": string, "age": int32}, i.e. a record type. 3) the schema of the schema of the classic graph -- i.e. the core model or what you called the meta model -- is again a graph with elements like "Type", "Element", etc. Type expressions in the schema of the classic graph are values in the core model. The core model is its own schema. Decide for yourself if the above makes sense to you, but this is how I think of the TinkerPop modeling layer cake these days -- as chained models in which the schema of one graph is the data of the next, usually arriving at a fixpoint -- the core -- within two steps. 1: Property Graph Meta Model > > public static Graph gremlinMetaModel() { > > enum GremlinDataType { > > STRING, > > INTEGER, > > DOUBLE, > > DATE, > > TIME > > //... > > } > > Cool, except that I would banish types like Date and Time from the core model. Drawing the line between primitive types and derived types is more art than science, but there is enough variation in what developers want out of dates/times that I put them on the other side of the fence. It also makes implementations easier if you have as few baked-in types as possible. On the other hand, I suggest adding many more numeric types, e.g. for integers: - bigint - int8 - int16 - int32 - int64 - uint8 - uint16 - uint32 - uint64 and for floating-point numbers: - name: bigfloat - name: float32 - name: float64 [snip metamodel definition] > > > This can be visualized as, > ... > I'm not sure if I'm reading this correctly, and I can't see the figure yet, but I understand that you are defining the metamodel as a graph. Cool. > Notes: > 1) GremlinDataType is an enumeration of named data types that Gremlin > supports. All gremlin data types are assumed to be atomic and its life > cycle fully owned by its containing parent. How it is persisted on disc or > transported over the wire is not a concern for the meta model. > Agree with most. Primitive/literal types are atomic, but you should be also able to define complex data types and bind them to names, and that is essentially what you are doing in the above. > 2) Gremlin's semantics is to weak to fully specify a valid meta model. > Accompanying the meta model we need a list of constraints specified as > gremlin queries to augment the semantics of the meta model. These > constraints/queries will be able to validate any gremlin specified model > for correctness. > Or the other way around: we define a core model as its own thing using a well-defined, controlled vocabulary, then map it into Gremlin. 3) It is trivial to extend the meta model. e.g. To specify something like > index support just add an 'Index' vertex and an edge from 'VertexLabel' to > it. > I would say that you're extending a second-order model in that case. The core model / metamodel should be constant, but you can define additional models on top of it. > Property graph meta model constraints, > > [...] > Cool (though here, too, I would define constraints in a limited DSL, and map them into Gremlin). > 2: The model > > What follows is an example of TinkerPop's 'modern' graph specified as an > instance of the above property graph meta model. > [...] > Cool. > There are lots of details to complete, but first we need to see if there > is any appetite for a modelling approach as I realize there is some > academic abstract algebra work happening elsewhere. > There is, but I don't see your approach of embedding model definitions and constraints natively in Gremlin as being at odds with having a formal data model. Have cake, eat it too. > It seems to me to have a lower barrier to entry for the community to > partake in the discussion of what constitutes a property graph model. > That's important. In my opinion, having the formal model defined up front gives you more power and flexibility for graph validation, transformations, and inference, but having the abstract model, you can also build developer-friendly DSLs on top of it. Let me know if there are questions or criticisms. > One of the nice things about your proposal is that it doesn't increase Java tech debt; you're suggesting defining models using Gremlin syntax, which is language variant -neutral. +1 to that. Josh