Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
On Tue, Nov 17, 2015 at 12:18:28AM -0500, Jonathan Lebon wrote: > > In the other thread I mentioned earlier, the solution I cooked up was > > dropping highlighting entirely for hunks over a certain percentage of > > highlighting. I wonder if we could do something similar here (e.g., > > don't match lines where more than 50% of the line would be highlighted). > > I looked over but haven't tested the patches in the other thread yet. But > overall, the LCS definitely looks promising. I'm hoping to find some time > to have a more serious go at it and maybe pick it up where you left off. > [...] > Thanks again for reviewing these patches and apologies for the delayed > reply. Great! I look forward to seeing what you produce. Let me know if you want me to clarify anything from that earlier discussion. And don't worry about delayed replies; it's part of the normal workflow around here. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
On Tue, Nov 3, 2015 at 5:03 PM, Jeff King wrote: > Your is _much_ slower. I get: > > real0m25.538s > user0m25.420s > sys 0m0.120s > > for the old versus: > > real2m3.580s > user2m3.548s > sys 0m0.156s Thanks for investigating and trying it out. I got the same results here as well. > for your series. In an interactive setting, the latency may not be that > noticeable, but if you are digging far into history (e.g., "git log -p", > then using "/" in less to search for a commit or some test), I suspect > it would be very noticeable. Agreed. > I was thinking there was some low-hanging fruit in memoizing the > calculations, but even the prefix/suffix computation is pairwise. I'm > not really sure how to make this much faster. I gave memoization a try to see if it could improve the situation. I also lowered maxhunksize to 10. Doing `git log -p` on git.git went from 2m31 to 2m11. So I think it would require a whole other approach overall. > As for the output itself, the diff between the two looks promising. The > first several cases I looked at ar strict improvements. Some of them are > kind of weird, especially in English text. Yes, I'm very happy with the improvements and run with these patches all the time for now. > In the other thread I mentioned earlier, the solution I cooked up was > dropping highlighting entirely for hunks over a certain percentage of > highlighting. I wonder if we could do something similar here (e.g., > don't match lines where more than 50% of the line would be highlighted). I looked over but haven't tested the patches in the other thread yet. But overall, the LCS definitely looks promising. I'm hoping to find some time to have a more serious go at it and maybe pick it up where you left off. > > -Peff Thanks again for reviewing these patches and apologies for the delayed reply. Jonathan -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
On Tue, Nov 03, 2015 at 04:44:16PM -0500, Jeff King wrote: > Have you looked at a diff of the old/new output on something like > git.git? This should be pretty easy to do (and time). I tried: git log --oneline --color -p >base time perl highlight.old old time perl highlight.new new diff -c old new | less where the "highlight.*" scripts are the versions at master, and master with your series applied. Your is _much_ slower. I get: real0m25.538s user0m25.420s sys 0m0.120s for the old versus: real2m3.580s user2m3.548s sys 0m0.156s for your series. In an interactive setting, the latency may not be that noticeable, but if you are digging far into history (e.g., "git log -p", then using "/" in less to search for a commit or some test), I suspect it would be very noticeable. I was thinking there was some low-hanging fruit in memoizing the calculations, but even the prefix/suffix computation is pairwise. I'm not really sure how to make this much faster. As for the output itself, the diff between the two looks promising. The first several cases I looked at ar strict improvements. Some of them are kind of weird, especially in English text. For example, see the RelNotes update in 2635c2b. The base diff is: * "git rebase -i" had a minor regression recently, which stopped considering a line that begins with an indented '#' in its insn - sheet not a comment, which is now fixed. - (merge 1db168e gr/rebase-i-drop-warn later to maint). + sheet not a comment. Further, the code was still too picky on + Windows where CRLF left by the editor is turned into a trailing CR + on the line read via the "read" built-in command of bash. Both of + these issues are now fixed. + (merge 39743cf gr/rebase-i-drop-warn later to maint). Before we highlighted nothing, and now we hone in on "now fixed". Which is _sort of_ a match, but really the whole sentence structure has changed. If this is the worst regression, I can certainly live with it. And even a proper LCS diff would probably end up making spaghetti of a text change like this. In the other thread I mentioned earlier, the solution I cooked up was dropping highlighting entirely for hunks over a certain percentage of highlighting. I wonder if we could do something similar here (e.g., don't match lines where more than 50% of the line would be highlighted). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] diff-highlight: match up lines before highlighting
On Mon, Nov 02, 2015 at 09:05:33PM -0500, Jonathan Lebon wrote: > As mentioned in the README, one of the current limitations of > diff-highlight is that it only calculates highlights when the hunk > contains the same number of removed lines as added lines. > > A further limitation upon this is that diff-highlight assumes that the > first line removed matches the first line added, similarly with the > second, the third, etc... As was demonstrated in the "Bugs" section of > the README, this poses limitations since that assumption does not always > give the best result. > > With this patch, we eliminate those limitations by trying to match up > the removed and added lines before highlighting them. This is done using > a recursive algorithm. Hmm. So this seems like a reasonable incremental feature. I do think it is a hack (piled on the hack that is the existing script :) ), and the right solution is to actually do an LCS diff for each hunk that crosses line boundaries. I made some headway on that this summer as part of this discussion: http://thread.gmane.org/gmane.comp.version-control.git/271692 It's long, but there's some good stuff in there. But I think I came to the conclusion that this really needs to go inside of diff.c itself (which will also require some heavy refactoring of the whitespace code; see the referenced thread for details). But I'm not opposed to an incremental feature like this in the meantime. The real test for me is: how does it look in practice? These are all heuristics, so I don't think we have anything better than eyeballing the output. Have you looked at a diff of the old/new output on something like git.git? > Note that I did not bother with some common optimizations such as > memoization since the usual number of removed/added lines in a single > hunk are small. In practice, I have not felt any lag at all during > paging. I'd worry less about normal use, and more about hitting some extreme corner case. Your algorithm looks roughly quadratic in the size of the hunk. I guess that is canceled out by the max-hunk-size option in the next patch, though. I don't think it's easy to make your algorithm non-quadratic, either, as it inherently relies on pairwise comparisons (and not, say, generating a fingerprint of each line and sorting them or something like that). It might be worth memo-izing find_common_* calls, though, as that is just repeated work (quadratic or not). It should be easy to time. > + # prime the loop > + my ($besti, $bestj) = ($a_first, $b_first); > + my $bestn = calculate_match($a->[$a_first], $b->[$b_first]) + 1; > + > + for (my $i = $a_first; $i < $a_last; $i++) { > + for (my $j = $b_first; $j < $b_last; $j++) { > + my $n = calculate_match($a->[$i], $b->[$j]); > + if ($n < $bestn) { > + ($besti, $bestj, $bestn) = ($i, $j, $n); > + } > + } > + } > + > + # find the best matches in the lower pairs > + match_and_highlight_pairs($a, $a_first, $besti, $b, $b_first, $bestj, > $queue); Hmm, is this actually O(n^3)? We have a quadratic loop, and then we recurse for the remainder. If we have two candidates for which calculate_match returns the same value, how do we break the tie? It looks like we'll just pick the lowest-numbered match. I'd think we would want to prefer the one with the closest line number. But not having thought too hard about it, I wonder: 1. Does it actually make a difference which line we pick? The interesting bit is how much we highlight, so in that sense do we care only about the prefix and suffix sizes? 2. Do you end up picking the closest line with your algorithm anyway, as will tend to match as we go, skipping over only lines that will likely have no match? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html