Hi, *** I’ve started a GoogleDoc called “A Multi-Model Data Type Specification.” ***
An abstract data type is a data structure + operations to manipulate it. Classic examples include: 1. stacks — arrays with push() and pop() operations. 2. lists — arrays with add(), remove(), get(), etc. operations. 3. graphs — networks with out(), in(), etc. operations. 4. … Databases can be defined by their ADT. Database ADTs typically involve a data structure+indices and a set of data manipulation operations. 1. key/value — pairs+key-index with get(), remove(), put(), etc. operations. 2. relational — relations+indices with select(), project(), join(), etc. operations. 3. RDF — statements+spog-indices with subject(), predicate(), object(), match(), etc. operations. 4. graph — vertices+edges+indices with has(), values(), out(), in(), etc. operations. 5. … In the spec thus far, I argue that the database industry has become overly fixated on classifying databases into discrete categories each with their own unique terminology (vertices/edge, tables/rows, documents, statements) and overlapping operations. I believe the primary reason for this is that databases are monolithic systems composed of a query language, a processing engine, and a data storage system. When all these pieces are assembled by the database engineering team, a “data perspective” (ADT) is set in stone. I believe this has unnecessarily created database technology silos. ——— What we are trying to do at Apache TinkerPop is create a multi-model ADT that spans the various database categories by using a generic lexicon and set of operations capable of performing all database operations. People may argue: “Why not just use the relational ADT and table/row lexicon as it can emulate every other know ADT relatively naturally?” I believe we are basically doing that with n-tuples (i.e. "schemaless rows"). However, what makes our approach unique is that our ADT doesn’t assume that it will solely be used by a monolithic database system. Instead, our ADT is designed on the assumption that the storage system, the processing engine, and the query language are independent components that are ultimately integrated into a “synthetic database” (a database that is custom assembled to meet the data modeling and performance requirements of an end user’s applications). Synthetic databases are possible with our multi-model ADT. A multi-model ADT compliant property graph query language assumes a basic property graph ADT embedding. {graph} V() {vertex} id() label() outE() inE() {edge} id() label() outV() inV() The query language says that it understands map-tuples ({}) of type graph, vertex, and edge. Moreover, along with core bytecode (has,values,…), it states that these tuples should be able to be processed using by the provided property graph-specific instructions. Thus, g.V(1).out(‘knows’).values(‘name’) === Gremlin compiles to Basic Property Graph Bytecode ==> V().filter(id().is(1)).outE().filter(label().is(‘knows’)).inV().values(‘name’) Without strategies, the above bytecode would execute as a series of inefficient linear “scan and filter” operations — the basic functional requirements of a “property graph." However, the data storage system says that it has various indices (and accessing instructions) for these types of tuples. {graph} V() V(object) {vertex} id() label() outE() inE() out(string..) in(string...) {edge} id() label() outV() inV() Thus, its property graph ADT extends the basic property graph ADT used by the query language. This enables TP4 strategies to rewrite the submitted bytecode to use the data storage system’s supported instructions (i.e. optimizations). V().filter(id().is(1)).outE().filter(label().is(‘knows’)).inV().values(‘name’) === Property Graph Bytecode compiles to Data Storage System Optimized Bytecode ==> V(1).out(‘knows').values(‘name’) This bytecode is then passed to the processing engine which seamlessly operates on the data storage system’s tuples as defined by the instructions. Question: What is the out(‘knows’) instruction? Simple, its a FlatMapFunction<Vertex,Edge> that calls the following method on the TP4 Vertex interface. <Iterator<Edge>> Vertex.out(String… labels) The data storage system says that its vertex tuple objects supports the out(string…) instruction and thus, the data storage system is organizing incident edges by label in its respective substrate (i.e. disk or memory). Great! IMPORTANT: Notice that our multi-model ADT’s operations are bytecode instructions. There is no longer a concept of “pointers.” A pointer is simply a map instruction! This means that our multi-model specification is not just a data structure specification, but also a bytecode specification. I believe we will ultimately have one spec driving TP4 VM development! ———————— !@#$@#^$%@&@%^@#$# Now lets get crazy……….. @#$%@#$%@#$%@#$%^ Assume the data storage system actually produces a tuple with the following information. { #pg.type:vertex, #rdbms.type:row, #pg.label:person, #id:1, name:marko, age:29 } Ah ha! So this tuple is both a vertex and a row! What does that mean? It means that we can ask the data storage system how it encodes the RDBMS ADT. Suppose it tell us: {database} table(string) {table} database() rows() select(string,predicate) project(string...) join(table, predicate…) {row} asTable() table() This specification says that it produces database, table, and row tuples that support the respective bytecode instructions. We can now interpret the tuple as a Row and can operate on it as such. {row}.asTable().join(database().table(‘companies’)).by(‘name’,’employee_name’).values(‘industry’) In essence, if a “multi-model language” existed, the following ‘graph’/‘rdbms’-hybrid query would have been: g.V(1).out(‘knows’). sql(‘SELECT industry FROM people, companies WHERE people.id=$id AND people.name=employee_name’).by(id()) What is this frankenstein data structure? Its a RDBMS with foreign key relations that can be traversed! Or its a graph database that supports dynamic edge creation (i.e. vertex joins). Or, its just the multi-model ADT being sweeeeeeet. A few points to realize (*** Kuppitz will like this ***): 1. When we have a {table} tuple, we simply have a proxy/reference to (e.g.) the respective MySQL table. 2. The processing engine isn’t pulling in the table’s data. 3. Because the data storage system supports join(), we let it do the work instead of having the processing engine do it. In conclusion. Just like any other database ADT, Our multi-model ADT can be used to define data structures and operations. However, our multi-model ADT can be used to simulate other database ADTs using their native lexicon and operations. Moreover, tuples in a our multi-model ADT can be used across multiple ADT embeddings! Finally, query language, processing engine, and data storage system stakeholders can all be different entities whose technologies interact with each other via multi-model ADT tuples and bytecode, where the TP4 VM can speedily glue it all together at query time using each component’s published ADT embeddings. Thoughts?, Marko. http://rredux.com <http://rredux.com/>