[I sent this from my gmail account last night, but it did not appear to make it to the list, so I'm re-sending. Sorry if you get this twice.]
Linus, Background: ----------- The SCM problem is something I've been toying with for a while now on my own because I wanted a distributed SCM tool, but didn't like one thing or another about the existing projects. I thought monotone came very close to what I wanted, but I didn't like the repository as a binary blob. So I've been playing with my own prototype SCM project. I'd like to know what you think of this approach, and maybe observations on this prototype implementation. The Repository: --------------- The on-disk format looks like this: linux-repository linux_kernel_repo/ |-- cache |-- nodes ("revisions" would be a better name) |-- patches `-- tags Files are created in the repository, and never modified once they are there. Files are never deleted from the repository (with the possible exception of stuff in the cache directory). The patches directory contains <hashofself>.patch files, which are basically [comments] [diffstat] [diff -aurN] The nodes directory contains <hashofself>.node files, which have [author] [date] [<hash>.node <hash>.patch] ... (There can be 0 or more [node,patch] tuples.) The tags directory contains files named like this: [EMAIL PROTECTED] (i.e: <epoch time>-<author>-<tag-name>.tag) that contain the name of a node: <hash>.node Then finally, there is the cache directory. With my current implementation, this area is used to cache unpacked revisions of the source tree for diffing and the like. The contents of these directories are hard-linked to conserve some space. Repository Implications: ------------------------ This repository layout allows for a number of neat features. Like git, you can use rsync to push and pull everything from one repository to another. (You'll probably want to exclude (most of) the cache directory.) But even better than that, you can pull a cached node from a repository into a new repository, (such as linux-2.6.11) then pull your current head revision and patches tracking back to that node, giving you a minimal repository with just the history you want. Then when you're happy with what you have, push it out. That would let you have every revision of Linux known to man in a public repository somewhere (kernel.org?), but you could keep a much reduced copy of the repository on your working machine. And we can do all of that without writing a dedicated SCM server: ftp, sftp, rsync, etc. should all work. Since the patches don't reference a node in themselves, they can be applied to nodes other than the one they were created from. I believe this will allow merging to work better than git would allow because we can track that the same patch was applied to both of two lines of development that we're trying to merge. Current performance isn't very good, but I haven't really optimized it yet, either. The numbers below are from commits to my repository on a 2.6GHz Celeron w/768MB RAM. Commit 2.6.0 95.31user 354.33system 9:07.91elapsed 82%CPU Commit 2.6.1 61.53user 139.10system 6:01.20elapsed 55%CPU Commit 2.6.2 21.48user 32.11system 1:02.36elapsed 85%CPU Commit 2.6.3 22.40user 31.51system 1:00.45elapsed 89%CPU Commit 2.6.4 26.41user 42.79system 1:17.18elapsed 89%CPU Commit 2.6.5 24.16user 38.97system 1:13.21elapsed 86%CPU Commit 2.6.6 25.37user 38.99system 1:15.28elapsed 85%CPU Commit 2.6.7 35.83user 55.06system 1:46.76elapsed 85%CPU Commit 2.6.8 43.94user 70.02system 2:27.96elapsed 77%CPU Commit 2.6.9 43.90user 75.02system 2:42.44elapsed 73%CPU Commit 2.6.10 52.16user 89.46system 3:03.61elapsed 77%CPU Commit 2.6.11 54.47user 93.21system 3:25.57elapsed 71%CPU Commit 2.6.11.1 14.06user 29.82system 1:04.60elapsed 67%CPU Commit 2.6.11.2 5.42user 12.66system 0:32.30elapsed 56%CPU Commit 2.6.11.3 1.91user 1.35system 0:06.49elapsed 50%CPU Commit 2.6.11.4 1.82user 1.41system 0:06.84elapsed 47%CPU Commit 2.6.11.5 1.91user 1.31system 0:04.26elapsed 75%CPU Commit 2.6.11.6 1.89user 1.35system 0:06.94elapsed 46%CPU Commit 2.6.11.7 2.07user 1.50system 0:07.44elapsed 48%CPU Doing some profiling leads me to believe that there is a lot of room for improvement for the nothing -> 2.6.0 and the 2.6.0 -> 2.6.1 commits. I believe the 2.6.x commits can also be improved. The 2.6.11.x commits tend to be <10 seconds. Which should be your common case. (Yes, I know git is blazingly fast, but are these numbers reasonable? How far would they need to go to be reasonable? I think these are comparable to what monotone would do, but they've done some optimization recently, so I'm probably slower.) As for disk usage (du -sck) of the above repository: 152 tags 168 nodes 354572 patches Those three directories contain all the history. 1379704 cache And that can be re-created from the nodes+patches. There is no compression going on here, which would be an easy way to improve that. I haven't implemented that yet though. The Sandbox: ------------ At the top directory of the sandbox, I have a .scm directory containing state information such as current revision and repository location. The rest is just source code. Pretty straight-forward. Efficiency/Optimizations: ------------------------- I used some of the ideas from the git discussion in this implementation. The sandbox keeps some cached information in the .scm; the file stat information, and a current hash of each file. That way it can look at only the files it needs to for committing, diffing, etc. (Sadly, this may not have helped as much as I expected. I have a version of my tool that just does a straight recursive diff on the whole thing, and the times are comparable. That may just point to a really dumb bug on my part though.) I do a lot of os.system()'s and os.popen()'s that could probably be redone. Doing a quick grep, it looks like I use: diff, sed, cp, filterdiff, diffstat, grep, tail, patch, find, xargs, stat, sort, sha1sum, rsync and $EDITOR. (In my defense, I was aiming for speed of implementation--after all, git is progressing quite quickly.) It might make sense to glue this together with git -- use your "content addressable file system" for the cache directory to get the speed, but use the the nodes and patches to record the history. Implementation: --------------- Attached is scm.py (pronounced "skimpy"), which as the name suggests, is a python implementation of the SCM system I've described. I'm using Python 2.3.4 on Fedora Core 3, and I haven't tested this on other distributions/platforms/etc. (This is all original code I've done on my own time and equipment, and I'm releasing it under the GPL-v2-or-later, with absolutely no warranty, and the warning that it may kill your dog, wreck your car, etc.) There are a number of known issues: The program is a little user-hostile--when it runs into something unexpected, it's liable to raise an exception, print a back-trace, and exit. Hopefully it'll give a useful hint in the exception. It may even leave something in /tmp, though if it does, that's a bug I'd like to hear about. Sometimes you'll get prompted by patch when scm calls out to it when it really should be set up to run non-interactively. I have not dealt with file renames yet, but I expect that to become part of the patch "comment" area, much like the diffstat is. I have not decided how to best represent that. I have not implemented merging yet, but the groundwork is there. I expect to do a breadth-first search for a common ancestor, find the patches in the line of development you are merging into your own that you do not already have in your own, apply them to your line of development creating nodes along the way, then let you deal with the broken parts. As a first pass anyway. I haven't directly addressed file permissions or directories yet. I'm expecting those would wind up being handled similarly to file renames in the patch files. If we want to add arbitrary attributes for patches and/or nodes, we could create a "attr" directory containing a directory named after the node or patch, and in that, a file like <epoch>-<author>-<attribute-name> containing the value of that attribute. For that matter, we could re-do tags that way. Quick Start: ------------ ./scm.py help ./scm.py init <path-to-new-repository> ./scm.py --repo=<repository> checkout mysandbox cd mysandbox <copy your source in> ./scm.py diff ./scm.py diffstat ./scm.py commit And this one is fun: ./scm.py --repo=<repository> graph | dot -Tps repo-history.ps Main Points: ------------ I believe that this representation of the kernel history is better than git's because I believe it will improve our ability to do merges and more closely represents what is occurring. I believe that the repository design will give us many of the same benefits as git, and some that git does not give us. I believe that the implementation can be significantly improved. The performance scales primarily by the size of the change, rather than the size of the code base or the size of the repository. Ok, so that's scm.py. Now, let me lock my ego in this nicely padded safe, don the asbestos coveralls, and cower behind this nice sturdy boulder.... Alright, pull out the flame throwers and have at it! :) Eli -----------------------. "If it ain't broke now, Eli Carter \ it will be soon." -- crypto-gram [EMAIL PROTECTED] `--------------------------------------------
scm.py
Description: application/python