Joseph Myers <jos...@codesourcery.com>: > This is still hypothetical, since I haven't seen any scripts posted that > would actual implement this, or any resulting mappings of commits, and one > wouldn't normally expect a repository conversion to attempt to distinguish > committer from author when the source version control system has no such > distinction.
Wow. Even attempting this would be a huge, ugly job. I strongly recomend that if you want to try this, you separate it from the initial repo conversion. That is, get the project to git first. Then see if you can data-mine author information out of the history. If, and only if, you get results that look reasonable, then you patch the repo and force-push it, warning everyone there'll be a flag day. The reason I recommend this is that I think you're going to have serious trouble getting clean authorship data with good coverage. The data mining will be messy and take longer than you expect. Here's how I'd do it: 1. Write an analyzer for commit logs. Its goal should be to parse logs and produce a list of records each consiting of an author, a commit date, and a list of modified-file paths - one record per commit-log entry. 2. Run this once on each terminal commit log - that is, at each branch head on both the main Commit log and all its archival versions. Aggregate all the records, dropping duplicates. 3. Write a custom Python extension to reposurgeon that generates the same report, only this time per-commit and thus yielding a committer ID. 3. Set a recognition time window. It must be more than 24 hours or you're going to have spurious negatives due to time-zone skew. 4. Write a program that fuzzy-matches the commit-log file-modification cliques to the per-commit cliques. One aspect of "fuzzy" is the time window; you need to include as potential matches any commits back from the date of the commit-log entry *and those up to 24 hours forward* (time-zone skew again). Also, you can't only look at the most recent matching commit if it's within the 24-hour window - time zone skew might mean that another one that looks older also matches, and might actually be more recent. 5. Try the naive implementation using a 24-hour time window. Now look at the percentages of unmatched commits and commit-log entries. If it's too high, how does it vary as the time window rises? Alas, there are other dimensions of 'fuzzy'. Here are a couple: 1. Typos or omissions in the commit-log file cliques and/or author names. To get good coverage you might find you need to do something like a Ratcliff-Obershelp fuzzy match. Set a high similarity percentage, then back off it if you have lots of unmatched commits. 2. What if someone did two or more commits on different filesets, but described them in one commit-log entry? Ideally you'd like to propagate the commit-log author info correctly to both, but testing for this case mechanically would be combinatorially explosive. Your only hope is that you end up with few enough unmatched commits and commit-log entries that the problem can be solved manually. Maybe you'll get lucky and the residuals (the sets of commits and commit-log entries that don't have a match in the other set) will be tiny. I wouldn't count on it - I'd expect that you will trip over other noise sources and have to figure out ways to fuzzy-match around them. Once you have the residuals down to an acceptably low number, make your matcher grind out a set of reposurgeon commands that patches the attributions appropriately. Apply. By careful to add a predicate check that prevents each transformation from applying if the date matches more than one commit; those two will have to be treated as residuals and hand-patched. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a>