On Mon, 9 Dec 2019, Bernd Schmidt wrote:

> On 12/9/19 7:19 PM, Joseph Myers wrote:
> > 
> > For any conversion we're clearly going to need to run various validation
> > (comparing properties of the converted repository, such as contents at
> > branch tips, with expected values of those properties based on the SVN
> > repository) and fix issues shown up by that validation.  reposurgeon has
> > its own tools for such validation; I also intend to write some validation
> > scripts myself.
> 
> Would it be feasible to require that both conversions produce the same output
> repository to some degree? Can we just look at release tags and require that
> they have the same hash in both conversions, or are there good reasons why the
> two would produce different outputs?

The same hashes are not practical.  There are several areas where two 
perfectly correct conversions are still expected to have different 
contents because of subjective decisions and heuristics involved in the 
conversion.

If some alternative heuristic is found to be clearly better than an 
existing one in reposurgeon, so that it would be better for any project 
converting with reposurgeon, or if some preference in the GCC case can 
readily be represented as a configuration option to choose between 
different approaches, it makes sense to implement the improvements in 
reposurgeon so that any project with similar issues can benefit.  For 
example, see Richard's suggestions in reposurgeon issue 174 of two 
possible improvements to ChangeLog handling: disregarding ChangeLog data 
if a commit adds multiple ChangeLog entries by different authors, and 
specifing a wildcard to allow ChangeLog processing on ChangeLog* files to 
cover ChangeLog.<branch>.  GCC is hardly the last project converting from 
SVN to git, so we can benefit from the experiences of past conversions, 
and help contribute to having useful features available for future 
conversions.

Here are some cases for differences between two correct conversions:

* Tree contents should mostly be identical at any given commit, but 
reposurgeon deliberately produces a .gitignore with contents based on 
svn:ignore if the SVN tree contents don't have a .gitignore (we use 
--user-ignores to prefer the .gitignore file in SVN if it exists), and 
removes any .cvsignore file.

* The first parent of a commit should typically be the same between 
conversions, but (a) might be corrected in some way for cvs2svn issues, 
(b) might skip SVN commits that would translate into empty git commits, 
depending on the choices made for handling of such commits.

* Cases that give rise to no tree changes in a commit (which thus might 
not become a git commit at all depending on the choices made and whether 
they also don't change any merge information properties) include (a) 
branch or tag creation as an exact copy of some revision of some branch, 
(b) branch recreation as a copy, e.g. when trunk was deleted accidentally, 
(c) commits that in SVN only add or remove empty directories, as git does 
not store empty directories, (d) commits that in SVN just remove some file 
or directory and replace it with a copy from some revision of some branch 
that happens to have identical contents to the file or directory removed 
(yes, we do have commits like that in GCC SVN).

* Subsequent parents of a commit based on merge info handling may well 
have subjective differences between correct conversions.

* Commit messages might differ, both because of heuristics to improve 
them, like Richard's work on that, and because of different choices for 
how to represent the SVN revision number information in commit messages.

* Author and committer identifications, and commit timestamps (especially 
timezones, something git has, SVN doesn't and reposurgeon has a per-author 
map for) may vary because of different heuristics or author maps used, 
especially when there is no ChangeLog entry for a commit or the ChangeLog 
entry is in some way malformed or the commit adds ChangeLog entries for 
multiple changes with different authors.

-- 
Joseph S. Myers
jos...@codesourcery.com

Reply via email to