At Thu, 23 May 2013 07:09:17 -0400, Eli Barzilay wrote: > "Relevant history" is vague.
The history I want corresponds to `git log --follow' on each of the files that end up in a repository. > The thing that you can't do with > filter-branch is keep the complete history if you remove files from > the history -- the files that are gone go with their history. That's true if you use `git filter-branch' in a particular way. I'll suggest an alternative way, which involves filtering the set of files in a commit-specific way. That is, the right set of files to keep for each commit are not the ones in the final place, but the ones whose history we need at each commit. To make sure I'm not confused, I've implemented this idea. My implementation is unlikely to be exactly right, yet, but I think it works as a proof of concept. The enclosed "slice.rkt" script takes a subdirectory and a destination directory. Run it in the top directory of a git repository, and it finds all the files in the given subdirectory, and then it closes over the history of each file via `git log --follow'. >From that point, we could use the computed set of paths as the ones to keep during a `git filter-branch' on every commit, but that's not ideal. For example, a file in collection "a" that is destined for package "a" may have originated in "b" (think "mzlib"), where the same-named file sticks around in "b" after the copy. It's nicer and cleaner to have irrelevant files disappear after the relevant copy/move is made. So, I took one more step: "slice.rkt" constructs a range of commits during which the file should exist, based on when it was moved or copied. (Forks and merges are a minor obstacle, which the script works around by enlarging ranges to hit commits in the `--first-parent' traversal.) Conceptually, the result is a mapping from commit ids to paths, but that would be a big table to read on every `filter-branch' step, so it's reported as a table of commits with enter/leave transitions. The output of "slice.rkt" is files: "state.rktd" for the set of files to be kept in the initial commit, and "actions.rktd" to specify the transitions. The enclosed "prune.rkt" script works with `git filter-branch --index-filter'. It uses "actions.rktd" (read-only) and "state.rktd" (which it updates via transitions). The Racket git repo is large, so I've only tried the `git filter-branch' step so far on smaller repos, such as the "iplt" repository. In my clone of "iplt", I `git mv'ed "web/internal" to "ex/internal". Then, with the scripts in "/tmp", racket /tmp/slice.rkt ex /tmp git filter-branch --index-filter "racket /tmp/prune.rkt /tmp" --prune-empty leaves the repo with only the files of "ex", and `git log --follow' on various files looks right. I'll try on a clone of the Racket repo and report back. FWIW, before doing this for real, I'd want to add a `--msg-filter' that extends each commit message to add the original commit id, since we have references to the old ids in various places (and so it would be handy to have them in the new repos).
slice.rkt
Description: Binary data
prune.rkt
Description: Binary data
_________________________ Racket Developers list: http://lists.racket-lang.org/dev