Re: Repository for the conversion machinery

Eric S. Raymond Mon, 10 Oct 2016 13:09:40 -0700

Joseph Myers <jos...@codesourcery.com>:
> This is still hypothetical, since I haven't seen any scripts posted that 
> would actual implement this, or any resulting mappings of commits, and one 
> wouldn't normally expect a repository conversion to attempt to distinguish 
> committer from author when the source version control system has no such 
> distinction.


Wow.  Even attempting this would be a huge, ugly job.

I strongly recomend that if you want to try this, you separate it from the
initial repo conversion.  That is, get the project to git first.  Then
see if you can data-mine author information out of the history. If,
and only if, you get results that look reasonable, then you patch the repo
and force-push it, warning everyone there'll be a flag day.

The reason I recommend this is that I think you're going to have serious
trouble getting clean authorship data with good coverage.  The data
mining will be messy and take longer than you expect.

Here's how I'd do it:

1. Write an analyzer for commit logs.  Its goal should be to parse
   logs and produce a list of records each consiting of an author, a
   commit date, and a list of modified-file paths - one record
   per commit-log entry.

2. Run this once on each terminal commit log - that is, at each branch
   head on both the main Commit log and all its archival versions.
   Aggregate all the records, dropping duplicates.

3. Write a custom Python extension to reposurgeon that generates the
   same report, only this time per-commit and thus yielding a committer ID.

3. Set a recognition time window.  It must be more than 24 hours or you're
   going to have spurious negatives due to time-zone skew.

4. Write a program that fuzzy-matches the commit-log file-modification
   cliques to the per-commit cliques.  One aspect of "fuzzy" is the
   time window; you need to include as potential matches any commits back
   from the date of the commit-log entry *and those up to 24 hours forward*
   (time-zone skew again).  Also, you can't only look at the most recent
   matching commit if it's within the 24-hour window - time zone skew might
   mean that another one that looks older also matches, and might actually
   be more recent.

5. Try the naive implementation using a 24-hour time window.  Now look
   at the percentages of unmatched commits and commit-log entries.  If
   it's too high, how does it vary as the time window rises?

Alas, there are other dimensions of 'fuzzy'. Here are a couple:

1. Typos or omissions in the commit-log file cliques and/or author
   names.  To get good coverage you might find you need to do
   something like a Ratcliff-Obershelp fuzzy match.  Set a high
   similarity percentage, then back off it if you have lots of
   unmatched commits.

2. What if someone did two or more commits on different filesets, but
   described them in one commit-log entry?  Ideally you'd like to propagate
   the commit-log author info correctly to both, but testing for this case
   mechanically would be combinatorially explosive.  Your only hope is that
   you end up with few enough unmatched commits and commit-log entries
   that the problem can be solved manually.

Maybe you'll get lucky and the residuals (the sets of commits and commit-log
entries that don't have a match in the other set) will be tiny.  I wouldn't
count on it - I'd expect that you will trip over other noise sources and
have to figure out ways to fuzzy-match around them.

Once you have the residuals down to an acceptably low number, make your
matcher grind out a set of reposurgeon commands that patches the attributions
appropriately.  Apply.  By careful to add a predicate check that prevents
each transformation from applying if the date matches more than one commit;
those two will have to be treated as residuals and hand-patched.
-- 
                <a href="http://www.catb.org/~esr/";>Eric S. Raymond</a>

Re: Repository for the conversion machinery

Reply via email to