Hello,
*** I believe the following email provides the most elegant TP4 structure/
proposal to date. ***
Every database (withStructure()) is understood by TP4 as an unordered sequence
of #type'd tuples.
{#type:?, ...}
{#type:?, ...}
{#type:?, ...}
...
This is the TP4 “universal model.”
Every graph database is understood by TP4 as an unordered sequence of tuples
with the following #types.
{#type:index, ...}
{#type:vertex, ...}
{#type:edge, ...}
...
This is the TP4 “property graph model."
It is up to the underlying database to organize tuples in memory and on disk as
it sees fit. For example, an RDBMS-based graph database may want an
indices-table, a vertices-table, and an edges-table. Or, for schema-oriented
graph data, it may want a person-table, a knows-table, a project-table, a
created-table, etc. The TP4 VM doesn't care how the tuples are organized by the
database. The TP4 VM interprets the entire database (data + indices) as an
unordered heterogenous sequence of #type’d tuples.
————————
All structure providers (i.e. database vendors) must implement a
db(string...)-instruction which will emit all tuples of a particular #type.
db() -> a sequence of all tuples
db('vertex') -> a sequence of all #type=vertex tuples
db('index','vertex') -> a sequence of all #type=index and #type=vertex tuples.
The difference between a graph database, a relational database, a document
database, etc. is wholly dependent on the tuple #types and their respective
keys.
###################################################
### A TP4 Property Graph Database Specification ###
###################################################
#type=index OR vertex OR edge
...
#type=index.#vertex-ids=(pointer(object...) to sequence<vertex>)
[OPTIONAL]
#type=index.#vertex-property=(pointer(string,object) to sequence<vertex>)
[OPTIONAL]
...
#type=vertex.#id=(object)
#type=vertex.#label=(string)
#type=vertex.#outE=(pointer to sequence<edge>)
#type=vertex.#inE=(pointer to sequence<edge>)
#type=vertex.#outE-labels=(pointer(string...) to sequence<edge>) [OPTIONAL]
#type=vertex.#inE-labels=(pointer(string...) to sequence<edge>) [OPTIONAL]
#type=vertex.#out-labels=(pointer(string...) to sequence<vertex>) [OPTIONAL]
#type=vertex.#in-labels=(pointer(string...) to sequence<vertex>) [OPTIONAL]
...
#type=edge.#id=(object)
#type=edge.#label=(string)
#type=edge.#outV=(pointer to vertex)
#type=edge.#inV=(pointer to vertex)
The above specifies how the TP4 “universal model” is constrained to encode a
TP4 "property graph model."
=== EXAMPLE #1 ===
Suppose a graph database provider supports #vertex-ids and #out-labels. The
graph database's Structure implementation would express this fact via its
getStrategies() implementation returning:
GraphCentricIndexStrategy.create(‘vertex-ids')
VertexCentricIndexStrategy.create(‘out-labels')
These TP4 provided strategies would then perform the following bytecode
rewrites:
g.V(1).out('knows').values('name’)
== Gremlin compiles to TP4 universal bytecode ==>
db('vertex').has('#id',1).values('#outE').has('#label','knows').values('#inV').values('name')
== GraphCentricIndexStrategy rewrites the bytecode to ==>
db('index').values('#vertex.ids').apply(1).values('#outE').has('#label','knows').values('#inV').values('name')
== VertexCentricIndexStrategy rewrites the bytecode to ==>
db('index').values('#vertex.ids').apply(1).values('#out.labels').apply('knows').values('name')
If the original TP4 universal bytecode were to execute, it would do a linear
scan of all vertex-tuples and filter out those that don’t have an #id=1. It
would then do a linear scan of all outgoing incident edge-tuples to v[1] and
filter out those edges that don’t have #label=knows. The result would be
semantically correct, but the evaluation would be painfully slow for large
graph datasets. Indices can be used to speed up data access. In the TP4
universal model, an index is simply a pointer to a sub-sequence of tuples in
db().
GraphCentricIndexStrategy re-writes the original compiled TP4 universal
bytecode to use a graph-centric index (pointer) to directly access the vertex
with #id=1.
VertexCentricIndexStrategy re-writes the previous strategized TP4 bytecode to
use a vertex-centric index (pointer) to directly access all the incident
knows-edges of v[1].
NOTES:
values(‘#vertex-ids') references an n-arg pointer. apply(object...) resolves
the pointer to its tuple sequence using the provided arguments.
values('#outE') references a 0-arg point. apply() is not required as the
pointer is immediately resolved to its tuple sequence.
[TO BE CONTINUED…]
Tomorrow I will write this same email but from the perspective of a relational
data structure encoded in the TP4 universal model.
Feedback is much appreciated.
Take care,
Marko.
http://rredux.com <http://rredux.com/>