8 hours ago, Matthew Flatt wrote: > At Thu, 23 May 2013 07:09:17 -0400, Eli Barzilay wrote: > > "Relevant history" is vague. > > The history I want corresponds to `git log --follow' on each of the > files that end up in a repository.
(In this context this is clear; the problem in Carl's post is that it seemed like he was suggesting keeping the whole repository and doing the split by removing material from clones -- which is and even fuller history, but one that has large parts that are irrelevant.) > That's true if you use `git filter-branch' in a particular way. I'll > suggest an alternative way, which involves filtering the set of > files in a commit-specific way. That is, the right set of files to > keep for each commit are not the ones in the final place, but the > ones whose history we need at each commit. If that can be done reliabely, then of course it makes it possible to do the split reliabley after the first restructure. It does come with a set of issues though... > [... scripts description ...] Here are a bunch of things that I thought about as I went over this. In no particular order, probably not exhaustive, and possibly repetitive: * Minor: better to use `find-executable-path' since it's common to find systems (like mine) with an antique git in /usr/bin and a modern one elsewhere. (In my case, both scripts failed since /usr/bin has an antique version.) * There is an important point of fragility here: you're relying on git to be able to find all of the relevant file movements (renames and copies), which might not always be correct. On one hand, you don't want to miss these operations, and on the other you don't want to have a low-enough threshold to identify bogus copies and renames. * Because of this, I think that it's really best to inspect the results manually. The danger of bogus copies, for example, is real, especially with small and very boilerplate-ish files like "info.rkt" files. If there's a mistaken identification of such a copy you can end up with a bogus directory kept in the trimmed repo. In addition, consider this information that the script detects via git for a specific commit: A/f1.ss renamed to B/f1.rkt A/f2.ss renamed to B/f2.rkt ... A/f47.ss renamed to B/f47.rkt A/f48.ss renamed to B/f48.rkt A/f49.ss deleted A/f50.ss deleted B/f49.rkt created B/f49.rkt created For a human reviewer, it's pretty clear that this is just a misidentification of two more moves (likely to happen with the kind of restructures that we did in the past, where a single commit both moves a file, and changes its contents). This is why on one hand I *really* like to use such scripts (to make sure that I don't miss such things), but OTOH I want to review the analysis results to see potential problems and either fix them manually or figure out a way to improve the analysis and run it again. * Also, I'd worry about file movements on top of paths that existed under a different final path at some point, and exactly situations like you described, where a file was left behind, but that file is completely new and should be considered separate (as in the case of a file move and a stub created in its place). * The script should also take care to deal with files that got removed in the past. For example, the drscheme collection has some file which gets removed, and later (completely unrelated) most of the contents migrated to drracket. If the result of the analysis is that most of the material moved this way, and because of that you decide to keep the old drscheme collection -- you'd also want to keep that file that disappeared before the move, since it's part of the relevant history. So I'd modify this script to run on the *complete* repository -- the whole tree and all commits -- and generate information about movements. Possibly do what your script is does for the whole tree, then add a second step that runs and looks for such files that are unaccounted for in the results, and decide what to do with them. I think that this also means that it makes sense to create a global database of all file movements in a single scan, instead of running it for each package. * Technical: I thought that it might make sense to use a racket server (with netcat for the actual command), or have it "compile" a /bin/sh script to do the actual work instead of using `racket/kernel' for speed. However, when I tried it on the plt tree, it started with spitting out new commits rapidly, but eventually slowed down to more than a second between commits, so probably even the kernel trick is not helping much... * Actually, given the huge amount of time it's running (see next bullet), it's probably best to make it do the movements from all paths at the same time. In this specific context, this means that it scans the package-restructured repo (from the first step) into a package-restructured repo (possibly with the same toplevel names) with all the files moved to their correct places, and the resulting repo can now be conveniently split into the sub-repos with a simple subdirectory filter. * And speaking about the time: what I saw is about 19k commits (I think, I killed it and speaking from memory now), where it started to work very fast, then slowed down considerabely. After about 5 hours it was about half-way through the 19k commits, and the rate reached about 1.5 seconds per commit. Assuming a linear growth in time per commit, this means that the whole operation is something that would take about 20 hours. (I didn't leave it running, since I don't want it to disturb the nightly build that will start soon.) This is not making it impossible -- just very hard to do reliabely, so I really wouldn't want to see this going without a human supervising eye as I described above. * Much better would be to run this to generate human readable and editable output: then not only go over this output manually and make sure that it all makes sense, but also identify points of optimization. For example, knowing that all of drscheme/* moved to drracket/* is going to work out much better than dealing with each file separately on each commit re-doing. * Re the commit messages being edited with "--msg-filter": one thing to note is that there are already such lines in the history portion that was ported from subversion. Even for those commits, you probably want to add the sha1s still, since there might be references to the sha1s of an svn-imported commit, ending with the svn revision, then the original sha1 that was the first translation of the svn commit. * It's not clear to me what you want to do at this point, but you originally described two filter-branch steps: one to restructure the repository soon, and another to split into packages. If this is still the plan, then each of these steps would need to add a historical sha1 behind. Alternatively, do the first restructure with in-repo moves instead, but then it would be a good idea to run the slice script and make sure that it succeeds in finding all of *these* renames correctly, as a good first-level sanity test. * Still, I consider all of this a huge amount of work and I still don't see any benefit for doing it. Just the time spent on these explanations is more than what I'd spend on what I suggested yesterday wrt starting a separate setup step for the core and for the rest. BTW, the script is still useful, of course -- I'd probably do something similar, except that I'd use some shell scripts to inspect the history of all files, and refine it as I described above. The thing is that this can be done without any effect on current progress, since the split during the build is made on current directories. -- ((lambda (x) (x x)) (lambda (x) (x x))) Eli Barzilay: http://barzilay.org/ Maze is Life! _________________________ Racket Developers list: http://lists.racket-lang.org/dev