design

David A. Holland Mon, 20 Oct 2014 21:50:54 -0700

Module Name:    othersrc
Committed By:   dholland
Date:           Tue Oct 21 04:50:11 UTC 2014


Added Files:
        othersrc/external/bsd/bikeshed/dist/design: schema.txt

Log Message:
Some notes on storing version control info in a graph datastore.


To generate a diff of this commit:
cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/bikeshed/dist/design/schema.txt

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Added files:

Index: othersrc/external/bsd/bikeshed/dist/design/schema.txt
diff -u /dev/null othersrc/external/bsd/bikeshed/dist/design/schema.txt:1.1
--- /dev/null	Tue Oct 21 04:50:11 2014
+++ othersrc/external/bsd/bikeshed/dist/design/schema.txt	Tue Oct 21 04:50:11 2014
@@ -0,0 +1,166 @@
+Data schema for bikeshed
+------------------------
+
+Version control history forms a graph (a directed acyclic graph) so
+the best way to think of the version database is explicitly as a graph
+database. The property graph model is the least toxic extant way to
+handle graph-structured data, so we'll describe things in those terms.
+
+In the property graph model, a database is composed of one or more
+graphs; a graph is composed of nodes and edges; nodes and edges have
+IDs; edges also have labels; and nodes and edges furthermore have an
+arbitrary collection of attributes. A description of a schema gives
+the various types of nodes and edges that appear, and the attributes
+they're expected to have.
+
+
+1. Metadata vs. data
+
+The version control system has data (that is, files whose versions it
+controls) and it also has metadata: commit messages, version numbers,
+tags, etc.
+
+It is clear from the long-simmering argument about editing commit
+messages that most or all metadata needs to be mutable, even in the
+long term. Granting this one then immediately recognizes that metadata
+needs to be versioned: obviously if you e.g. change a commit message
+the old version should be retained for archival purposes, and to give
+an audit trail if needed. Meanwhile, the negative example of
+Mercurial's odd tagging semantics shows that metadata and data
+versions need to be handled separately. In particular, metadata
+changes should not create new data versions... and checking out an old
+data version should not cause the metadata to also be rolled back.
+(Maybe, however, rolling back the metadata should also force rolling
+back to a matching data version; however, it isn't clear if rolling
+back metadata, as opposed to just inspecting its history, is even an
+interesting thing to do.)
+
+
+2. Versions, changesets, branches, etc.
+
+Since we're versioning a whole filesystem tree, we need a
+representation of (a single state of) the tree, the files within the
+tree, and possibly regions within files. This will be some collection
+of graph nodes and edges.
+
+Each such version will chain back to the previous and forward to the
+next. (Or possibly back to more than one previous version for merges,
+and forward to more than one when branching occurs.) There should be
+one master node for the whole thing, which links forward and backward
+to the master nodes for other versions, and which also links to the
+nodes that describe the version.
+
+Individual objects within the version will be represented by nodes and
+these should also chain backwards and forwards to corresponding nodes
+in other versions. This allows traversing the history of an individual
+file cheaply.
+
+Aggregation of sub-versions into a single full version can be done by
+having the toplevel version point to the sub-versions, the same way a
+branch node points to the versions on the branch.
+
+Now, over on the metadata side, each metadata version has a system of
+nodes and edges that's a projection of the graph of data versions. All
+of this links to a master node for the metadata version, and the
+metadata version links forward and backward to other metadata
+versions.
+
+There's a common pattern here: given a graph (whether a filesystem
+tree or version graph), we manage a bigger graph that's a collection
+of versions of that graph. Generalizing this recursion would allow
+tracking meta-metadata, e.g. commit messages for metadata changes, and
+maintaining its history; however, there's no limit and I don't think
+it ends up sane. Hopefully we can track commit messages for metadata
+changes as part of the metadata history; although that becomes
+self-referential in a way that might also cause problems.
+
+We also need nodes for branches that link to the versions that are on
+that branch. The flow relationships between branches should be edges
+between branches.
+
+For tracking which changesets have been propagated to parallel
+branches and which haven't... we could make edges that join equivalent
+or merge-equivalent changesets directly, but this by itself isn't
+sufficient. Because changes aren't necessarily propagated in
+one-to-one fashion (e.g. one might merge several changes into another
+branch in one go), there needs to be a node that represents the
+propagation, and it must point to the changesets on each side. Then
+since we'll have a series of these we need another node to represent
+the relationship between the branches. For changes that have been
+considered and rejected, I guess we add a propagation node that points
+to rejected changes on one side and nothing on the other side. For
+changes that haven't been considered yet, I suppose we have nothing:
+we can find such changesets by the absence of such linkages.
+
+A similar structure is needed for files that have been cloned.
+
+In a more transient part of the database, we need a designated node
+that points to all the current heads, and if we have bookmarks, we
+need nodes for those that link to the version they name.
+
+Hyper-branches (as mentioned in requirements.txt) are nodes that link
+to groups of branch nodes.
+
+
+3. Filesystem tree state
+
+We need a sound operational model for the state of a filesystem tree
+and the actions on it. This is or will be discussed elsewhere. Many
+existing tools don't get it quite right, so it needs to be sorted out
+before anyone starts coding.
+
+In the database we probably want a node for each directory that links
+to nodes for the files in the directory. (And if files are composed of
+regions, then each file node links to the regions it contains.)
+
+
+4. Storing files and changes to files
+
+I think the graph database part should treat files (or file regions)
+and diffs on them as blobs, and these blobs should be kept in an
+external blob store. This will help keep the graph stuff compact; we
+will however want to give some attention to keeping the blob store
+sorted in a useful way.
+
+Each diff or materialized full file (or full file region) should have
+a node in the graph database, though, because that's how we keep track
+of what we have.
+
+I think every N versions of a file we should store a materialized full
+version, where N isn't necessarily fixed but maybe depends on the
+change rate; e.g. each time the cumulative size of the diffs exceeds
+1.5x the size of the full version we should store a full version, or
+something like that. The idea is to be able to extract any particular
+version reasonably rapidly without spending an enormous amount of time
+applying diffs.
+
+I have no idea what format diffs should be stored in, although as
+context isn't needed for applying changes to a known version of
+something the common patch formats aren't the right answer. We want
+something that can be applied either forwards or backwards.
+
+Both diff and full-version blobs should be stored compressed on disk.
+
+Need to think about how to generate annotate results rapidly. Also,
+want to be able to cache these; and also cache intermediate portions
+or parts of them, as it's common to run annotate on the same file
+using successively older versions.
+
+
+5. The graph database itself
+
+The graph database itself is append-only, so most of it can and should
+be stored using some kind of immutable graph representation like a
+CSR. I think we want a scheme where the database is composed of CSR
+bundles and simple (mutable, non-cooked) bundles, and pick the
+boundaries of CSR bundles by things like branch points. This should
+make it possible to clone history in interesting chunks by copying CSR
+bundles verbatim from server to client.
+
+Given some of the stuff above we clearly need a way for nodes to hold
+huge bundles of edges (e.g. the node for the default branch will point
+to every commit on the default branch, which will be a lot) and
+clearly these lists need to be ordered. They may span multiple CSR
+bundles too.
+
+

CVS commit: othersrc/external/bsd/bikeshed/dist/design

Reply via email to