This is v2 of a patch series to improve the heuristics that `git diff` and related commands use to position ambiguous blocks of added/deleted lines. Thanks to Stefan, Jacob, Peff, Junio, and Jakub for their comments about v1 [1]. This patch series is also available from my GitHub account [2] as branch `diff-indent-heuristic`.
This version started out as a light touch-up of v1, but as I was working I realized that it needed more changes. Among other things, the final heuristic is now improved relative to the one proposed in v1. Summary of changes: * I changed the `--compaction-heuristic` code such that if a group of added/deleted lines can be aligned with a group of deleted/added lines in the other file, then that should be done rather than try to slide to a blank line. In the existing code, the compaction heuristic was attempted, incorrectly, *after* trying to align groups, which doesn't make sense. * I changed `recs_match()` similarly to `is_blank_line()`: now it takes two `xrecord_t *` as arguments rather than an array of `xrecord_t` and two indexes. * I refactored the old `xdl_change_compact()` more thoroughly. All of the index manipulation was pretty confusing, and I was having trouble convincing myself that even the old code was working correctly. So I introduced a `struct group`, which is like a cursor that keeps track of a (possibly empty) group of changed lines in the old or new version of a file. I added functions to initialize a group to the first change in a file, to move to the next or previous groups, and to slide a group forward or backward if possible (i.e., if the group is a "slider"). * In the course of that refactoring, I found (and fixed) another bug in the `--compaction-heuristic` code: it didn't notice if the top possible position of a slider had a blank line as its last line. See the commit message of patch [5/5] for more details. * I realized that the behavior of the indent heuristic from v1, because it multiplied weighting factors times indent widths, would behave differently if a project changed its convention for how many spaces to use per level of indentation. That doesn't make sense, so I changed the handling of indentation: Now, the sum of the indentation width at the top and bottom of the slider are added, *but those sums are only compared* between slider positions, and the weight is multipled by -1, 0, or +1 depending on whether the first indent sum is less than, equal, or greater than the other. This should make the behavior relatively independent of the project's indentation convention, and thus make the heuristic work more consistently across projects. The resulting heuristic works significantly better than the one proposed in v1: | repository | count | Git 2.9.0 | v1 | v2 | | --------------------- | ----- | -------------- | -------------- | -------------- | | afnetworking | 109 | 89 (81.7%) | 3 (2.8%) | 2 (1.8%) | | alamofire | 30 | 18 (60.0%) | 2 (6.7%) | 0 (0.0%) | | angular | 184 | 127 (69.0%) | 15 (8.2%) | 5 (2.7%) | | animate | 313 | 2 (0.6%) | 2 (0.6%) | 2 (0.6%) | | ant | 380 | 356 (93.7%) | 22 (5.8%) | 15 (3.9%) | | bugzilla | 306 | 263 (85.9%) | 24 (7.8%) | 15 (4.9%) | | corefx | 126 | 91 (72.2%) | 15 (11.9%) | 6 (4.8%) | | couchdb | 78 | 44 (56.4%) | 17 (21.8%) | 6 (7.7%) | | cpython | 937 | 158 (16.9%) | 26 (2.8%) | 5 (0.5%) | | discourse | 160 | 95 (59.4%) | 24 (15.0%) | 13 (8.1%) | | docker | 307 | 194 (63.2%) | 16 (5.2%) | 8 (2.6%) | | electron | 163 | 132 (81.0%) | 15 (9.2%) | 6 (3.7%) | | git | 536 | 470 (87.7%) | 18 (3.4%) | 16 (3.0%) | | gitflow | 127 | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | | ionic | 133 | 89 (66.9%) | 6 (4.5%) | 1 (0.8%) | | ipython | 482 | 362 (75.1%) | 45 (9.3%) | 11 (2.3%) | | junit | 161 | 147 (91.3%) | 4 (2.5%) | 1 (0.6%) | | lighttable | 15 | 5 (33.3%) | 2 (13.3%) | 0 (0.0%) | | magit | 88 | 75 (85.2%) | 5 (5.7%) | 0 (0.0%) | | neural-style | 28 | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | | nodejs | 781 | 649 (83.1%) | 98 (12.5%) | 5 (0.6%) | | phpmyadmin | 491 | 481 (98.0%) | 7 (1.4%) | 2 (0.4%) | | react-native | 168 | 130 (77.4%) | 5 (3.0%) | 0 (0.0%) | | rust | 171 | 128 (74.9%) | 17 (9.9%) | 14 (8.2%) | | spark | 186 | 149 (80.1%) | 17 (9.1%) | 2 (1.1%) | | tensorflow | 115 | 66 (57.4%) | 7 (6.1%) | 5 (4.3%) | | test-more | 19 | 15 (78.9%) | 1 (5.3%) | 1 (5.3%) | | test-unit | 51 | 34 (66.7%) | 8 (15.7%) | 2 (3.9%) | | xmonad | 23 | 22 (95.7%) | 1 (4.3%) | 1 (4.3%) | | --------------------- | ----- | -------------- | -------------- | -------------- | | totals | 6668 | 4391 (65.9%) | 422 (6.3%) | 144 (2.2%) | * I noticed that most of the "bonus" weights were actually negative, so calling them "bonuses" was misleading. Therefore, I negated the coefficients and now call them "penalties" * I noticed that `git blame` was parsing diff options like `--indent-heuristic` from the command line, but wasn't using the values. That is fixed in a new patch [07/07]. * I redid all of the analysis and training with a bigger corpus. To check whether I was overtraining the heuristic, I also did the following experiment: I optimized the weights against a training set consisting of only some of the repositories, then tested it against the rest of the corpus. See the full results in the commit message for patch [06/06]. * I added a couple of smoke tests. The companion project [3] now provides an easy way to replicate and extend these results, if anybody is interested. The new heuristic still has to be enabled via the `--indent-heuristic` command-line parameter or the `diff.indentHeuristic` configuration setting. Before we release this, we should decide what the final UI should look like and make it so. If we decide to replace the existing compaction heuristic with this one and/or turn this heuristic on for all users by default, those steps might look like branch `indent-heuristic-default` in my GitHub repository [2]. Michael [1] http://public-inbox.org/git/cover.1470259583.git.mhag...@alum.mit.edu/t/#u [2] https://github.com/mhagger/git [3] https://github.com/mhagger/diff-slider-tools Michael Haggerty (7): xdl_change_compact(): fix compaction heuristic to adjust ixo xdl_change_compact(): only use heuristic if group can't be matched is_blank_line(): take a single xrecord_t as argument recs_match(): take two xrecord_t pointers as arguments xdl_change_compact(): introduce the concept of a change group diff: improve positioning of add/delete blocks in diffs blame: actually use the diff opts parsed from the command line Documentation/diff-config.txt | 7 +- Documentation/diff-options.txt | 9 +- builtin/blame.c | 11 +- diff.c | 23 +- git-add--interactive.perl | 5 +- t/t4059-diff-indent.sh | 160 +++++++++++ xdiff/xdiff.h | 1 + xdiff/xdiffi.c | 635 ++++++++++++++++++++++++++++++++++------- 8 files changed, 736 insertions(+), 115 deletions(-) create mode 100755 t/t4059-diff-indent.sh -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html