Module Name: othersrc Committed By: dholland Date: Tue Oct 21 04:50:11 UTC 2014
Added Files: othersrc/external/bsd/bikeshed/dist/design: schema.txt Log Message: Some notes on storing version control info in a graph datastore. To generate a diff of this commit: cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/bikeshed/dist/design/schema.txt Please note that diffs are not public domain; they are subject to the copyright notices on the relevant files.
Added files: Index: othersrc/external/bsd/bikeshed/dist/design/schema.txt diff -u /dev/null othersrc/external/bsd/bikeshed/dist/design/schema.txt:1.1 --- /dev/null Tue Oct 21 04:50:11 2014 +++ othersrc/external/bsd/bikeshed/dist/design/schema.txt Tue Oct 21 04:50:11 2014 @@ -0,0 +1,166 @@ +Data schema for bikeshed +------------------------ + +Version control history forms a graph (a directed acyclic graph) so +the best way to think of the version database is explicitly as a graph +database. The property graph model is the least toxic extant way to +handle graph-structured data, so we'll describe things in those terms. + +In the property graph model, a database is composed of one or more +graphs; a graph is composed of nodes and edges; nodes and edges have +IDs; edges also have labels; and nodes and edges furthermore have an +arbitrary collection of attributes. A description of a schema gives +the various types of nodes and edges that appear, and the attributes +they're expected to have. + + +1. Metadata vs. data + +The version control system has data (that is, files whose versions it +controls) and it also has metadata: commit messages, version numbers, +tags, etc. + +It is clear from the long-simmering argument about editing commit +messages that most or all metadata needs to be mutable, even in the +long term. Granting this one then immediately recognizes that metadata +needs to be versioned: obviously if you e.g. change a commit message +the old version should be retained for archival purposes, and to give +an audit trail if needed. Meanwhile, the negative example of +Mercurial's odd tagging semantics shows that metadata and data +versions need to be handled separately. In particular, metadata +changes should not create new data versions... and checking out an old +data version should not cause the metadata to also be rolled back. +(Maybe, however, rolling back the metadata should also force rolling +back to a matching data version; however, it isn't clear if rolling +back metadata, as opposed to just inspecting its history, is even an +interesting thing to do.) + + +2. Versions, changesets, branches, etc. + +Since we're versioning a whole filesystem tree, we need a +representation of (a single state of) the tree, the files within the +tree, and possibly regions within files. This will be some collection +of graph nodes and edges. + +Each such version will chain back to the previous and forward to the +next. (Or possibly back to more than one previous version for merges, +and forward to more than one when branching occurs.) There should be +one master node for the whole thing, which links forward and backward +to the master nodes for other versions, and which also links to the +nodes that describe the version. + +Individual objects within the version will be represented by nodes and +these should also chain backwards and forwards to corresponding nodes +in other versions. This allows traversing the history of an individual +file cheaply. + +Aggregation of sub-versions into a single full version can be done by +having the toplevel version point to the sub-versions, the same way a +branch node points to the versions on the branch. + +Now, over on the metadata side, each metadata version has a system of +nodes and edges that's a projection of the graph of data versions. All +of this links to a master node for the metadata version, and the +metadata version links forward and backward to other metadata +versions. + +There's a common pattern here: given a graph (whether a filesystem +tree or version graph), we manage a bigger graph that's a collection +of versions of that graph. Generalizing this recursion would allow +tracking meta-metadata, e.g. commit messages for metadata changes, and +maintaining its history; however, there's no limit and I don't think +it ends up sane. Hopefully we can track commit messages for metadata +changes as part of the metadata history; although that becomes +self-referential in a way that might also cause problems. + +We also need nodes for branches that link to the versions that are on +that branch. The flow relationships between branches should be edges +between branches. + +For tracking which changesets have been propagated to parallel +branches and which haven't... we could make edges that join equivalent +or merge-equivalent changesets directly, but this by itself isn't +sufficient. Because changes aren't necessarily propagated in +one-to-one fashion (e.g. one might merge several changes into another +branch in one go), there needs to be a node that represents the +propagation, and it must point to the changesets on each side. Then +since we'll have a series of these we need another node to represent +the relationship between the branches. For changes that have been +considered and rejected, I guess we add a propagation node that points +to rejected changes on one side and nothing on the other side. For +changes that haven't been considered yet, I suppose we have nothing: +we can find such changesets by the absence of such linkages. + +A similar structure is needed for files that have been cloned. + +In a more transient part of the database, we need a designated node +that points to all the current heads, and if we have bookmarks, we +need nodes for those that link to the version they name. + +Hyper-branches (as mentioned in requirements.txt) are nodes that link +to groups of branch nodes. + + +3. Filesystem tree state + +We need a sound operational model for the state of a filesystem tree +and the actions on it. This is or will be discussed elsewhere. Many +existing tools don't get it quite right, so it needs to be sorted out +before anyone starts coding. + +In the database we probably want a node for each directory that links +to nodes for the files in the directory. (And if files are composed of +regions, then each file node links to the regions it contains.) + + +4. Storing files and changes to files + +I think the graph database part should treat files (or file regions) +and diffs on them as blobs, and these blobs should be kept in an +external blob store. This will help keep the graph stuff compact; we +will however want to give some attention to keeping the blob store +sorted in a useful way. + +Each diff or materialized full file (or full file region) should have +a node in the graph database, though, because that's how we keep track +of what we have. + +I think every N versions of a file we should store a materialized full +version, where N isn't necessarily fixed but maybe depends on the +change rate; e.g. each time the cumulative size of the diffs exceeds +1.5x the size of the full version we should store a full version, or +something like that. The idea is to be able to extract any particular +version reasonably rapidly without spending an enormous amount of time +applying diffs. + +I have no idea what format diffs should be stored in, although as +context isn't needed for applying changes to a known version of +something the common patch formats aren't the right answer. We want +something that can be applied either forwards or backwards. + +Both diff and full-version blobs should be stored compressed on disk. + +Need to think about how to generate annotate results rapidly. Also, +want to be able to cache these; and also cache intermediate portions +or parts of them, as it's common to run annotate on the same file +using successively older versions. + + +5. The graph database itself + +The graph database itself is append-only, so most of it can and should +be stored using some kind of immutable graph representation like a +CSR. I think we want a scheme where the database is composed of CSR +bundles and simple (mutable, non-cooked) bundles, and pick the +boundaries of CSR bundles by things like branch points. This should +make it possible to clone history in interesting chunks by copying CSR +bundles verbatim from server to client. + +Given some of the stuff above we clearly need a way for nodes to hold +huge bundles of edges (e.g. the node for the default branch will point +to every commit on the default branch, which will be a lot) and +clearly these lists need to be ordered. They may span multiple CSR +bundles too. + +