Some of you will recall that I did a lot of work early this year on the dump stream documentation as part of an effort to enable reposurgeon to read Subversion dump files directly, translating Subversion histories into a DVCS-style commit DAG that can then be exported into git, hg, or bzr.
I am pleased to be able to announce that this effort has been (somewhat belatedly) successful. The project had stalled for six months, but production-quality Subversion support in reposurgeon has now been verified by successful lift to DVCS of a large, old Subversion repository with lots of ugly metadata corner cases in it (the repo of the Network UPS Tools project). Not only does it work, it even works *fast*. The dumpfile analyzer cranks through more than a thousand commits per minute on vanilla desktop hardware. Subversion contributor Greg Hudson deserves credit for this, as he contributed a performance patch implementing copy-on-write filemaps that tremendously sped up processing. More than that, a slight extension of Greg's idea enabled me to abolish some code that I had suspected (correctly, it turns out) of harboring the subtle bug that had stalled the project for half a year - eliminating another O(n**2) lookup that Greg hadn't been originally targeting in the process. It may be of interest that the bug involved incorrect translation of two successive copies in opposite directions across a pair of branches. Another case that gave me trouble was a branch delete followed by a copy to the same branch name. A third was a directory copy followed by a file change in one of the copied files *before* commit. There are also various mixtures of file system copies with Subversion copy and commit operations that a tool like this needs to detect and patch so the history looks as though proper Subversion operations were used throughout, otherwise the commit DAG will be missing some ancestry links that semantically ought to be there. For example, if you (a) create a branch directory, (b) use file system copy to populate it from another branch, and (c) commit, the DAG builder needs to detect this and treat step (b) as though it had been done with Subversion directory and/or file copies. Fortunately this is a relatively simple exercise in hash matching. I'm still polishing; one thing that needs more work is interpretation of mergeinfo properties. The cherry-picking model Subversion uses doesn't match the way git/hg/bzr want to do things. Simple mergeinfos translate well but there are complex cases that yield perverse-looking merge links. Still, all that wrestling with strange corner cases paid off - reposurgeon is now better at translating Subversion repos to DVCS histories than anything else out there. It even handles cross-branch mixed commits without breaking stride. But it doesn't try to do everything. One of the philosophical premises behind reposurgeon was that repository translation is more like literary translation than people who write repository-conversion tools normally understand. That is, low-level mechanical translations don't work very well - they need to be cleaned up by a human who understands the ontological mismatches between VCSes and the idioms of both source and target VCS. A very simple example of this requirement is: what should be done with Subversion revision references like r456 in commit comments? Not just "what should they be translated into?" but "how can we even *recognize* them reliably?" Humans come up with lots of variant ways to write these even within the same repo, and mechanical translators have trouble spotting them all. reposurgeon was built with the goal of amplifying human judgment (making it as easy as possible for a human to improve on reposurgeon's basic mechanical translation) rather than trying to eliminate human judgment. This choice now seems well vindicated. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> The only purpose for which power can be rightfully exercised over any member of a civilized community, against his will, is to prevent harm to others. His own good, either physical or moral, is not a sufficient warrant. -- John Stuart Mill, "On Liberty", 1859