Re: [RFC] Diff (blame) optimization: how to go forward
Johan Corveleyn wrote on Fri, 8 Oct 2010 at 01:44 -: With suffix-lines-to-keep=50, you'd need to insert a block of text that has its last 50 lines identical to the 50 lines preceding the insertion point, to mess up the diff result. - If really necessary, we could say default=50, but it can be overridden by the user with another -x option. So the main question is: is such a change in behavior (unlikely to have an impact in most realistic cases) acceptable for this kind of performance improvement? Just to clarify: are you asking whether it's acceptable to have a not nice (but still semantically correct) diff output in case the user adds a block whose last 50 lines match the 50 lines prior to the first difference? And 'not nice' just means the preceding, rather than trailing, instance of the repeated block would be grabbed by the /^+/ lines in the diff, right? In this case, I think I'm going to answer your long mail with just... Yes. or actually... Yes, and let's move on to deeper problems (if any). (I'm not /particularly/ worried about a different-than-expected, but still semantically correct, diff --- though, yes, it's nice to output the expected diff if that's possible without too much effort and without breaking other use cases.) :-), Daniel (p.s. I don't have as much time to look into this as I'd like to...)
Re: [RFC] Diff (blame) optimization: how to go forward
On Thu, Oct 7, 2010 at 6:44 PM, Johan Corveleyn jcor...@gmail.com wrote: Hi all, This is a follow-up to the WIP-patches I posted last week [1], which are about improving performance of svn_diff (and therefor also blame on the client-side), especially for large files. To summarize: the idea was (is) to strip off the identical prefix and suffix, and then letting the existing diff algorithm work on the remainder. As stated in [2], I ran into some difficulties to get the exact same result of the existing diff algorithm (because of too eager suffix stripping). I've spent some time searching for a perfect solution, but I can't find one. So now I'm stuck and would like to have some feedback on what to do with it. First: the thing works, and it can reduce the time needed for svn_diff by 20% up to 80% for large files (depending on the distribution of the changes). An extreme example is a reduction from ~150ms to ~30ms for a 6-lines file with a small number of lines changed (say up to 100) located close together (so lots of identical prefix/suffix). Blame gets similar benefits, *if the server is fast enough*. I used svnserve built from Stefan^2's performance branch to test this. A normal svnserve with FSFS isn't really fast enough (unless maybe with an SSD, but I don't have one here) to serve up the 2500 revisions of the large file quickly enough. But the performance-enhanced-svnserve, with a hot cache filled with the necessary full-texts, is definitely fast enough to make the time of blame completely dependent on the client-side performance. Conclusion: the diff optimization we're talking about is much more useful for blame together with a server with the full-text membuffer caching of the performance branch. Now, the reason why I'm stuck: suffix scanning sometimes strips a little bit too much (see [2] for full explanation). In this case the resulting diff is still correct, but it's not as nice for the user. As I noted in [2], GNU diff also has this problem, but the user can help a little bit by specifying a large enough number of prefix/suffix-lines to keep (--horizon-lines). So, what to do? - I have not found a better solution than to use some form of horizon-lines to get the nice result, but still have a similar performance improvement (unless if I leave out suffix stripping entirely, only strip prefix, but that halves the power of the optimization). - I've tested with keeping up to 50 suffix-lines. It didn't have the slightest impact on the performance improvement (we're talking about stripping away 1000's of lines). This fixed all real-life examples I could find/test (diff/blame output is precisely identical to original svn). If I kept only 20 lines, I still ran into some differences (30 lines was enough for my example file, but I took it up to 50 to be sure). - I would like to avoid leaving the decision to the user to come up with an adequate number of suffix-lines-to-keep. So I'd like to just hardcode some value, say 50, or even 100. I think this would give good (=identical to original svn) diff/blame results in all but the most extreme cases. With suffix-lines-to-keep=50, you'd need to insert a block of text that has its last 50 lines identical to the 50 lines preceding the insertion point, to mess up the diff result. - If really necessary, we could say default=50, but it can be overridden by the user with another -x option. So the main question is: is such a change in behavior (unlikely to have an impact in most realistic cases) acceptable for this kind of performance improvement? Other options: - Come up with a clever solution to fix the optimization properly (but I don't know if that's possible - suggestions welcome :-)). - Throw this optimization away, and take a look at something like Patience diff (supposed to be fast as well, and give nice results). However, I don't know Patience diff well enough to code up something, or even to judge whether it would give good results in all (or most) cases. - Just throw it away and don't look back :-) - Something else? My not-having-looked-deeply-at-the-problem reaction is this: if we can introduce significant speed-ups in the common case, while still maintaining functionality, let's do it. There may be corner cases where somebody gets an ugly diff, but in those corner cases there are likely to be *other* issues as play as well. Let's not penalize everybody for the sake of a couple of corner cases. -Hyrum
Re: [RFC] Diff (blame) optimization: how to go forward
On Fri, Oct 8, 2010 at 3:08 PM, Daniel Shahaf d...@daniel.shahaf.name wrote: Johan Corveleyn wrote on Fri, 8 Oct 2010 at 01:44 -: With suffix-lines-to-keep=50, you'd need to insert a block of text that has its last 50 lines identical to the 50 lines preceding the insertion point, to mess up the diff result. - If really necessary, we could say default=50, but it can be overridden by the user with another -x option. So the main question is: is such a change in behavior (unlikely to have an impact in most realistic cases) acceptable for this kind of performance improvement? Just to clarify: are you asking whether it's acceptable to have a not nice (but still semantically correct) diff output in case the user adds a block whose last 50 lines match the 50 lines prior to the first difference? And 'not nice' just means the preceding, rather than trailing, instance of the repeated block would be grabbed by the /^+/ lines in the diff, right? Yep, that's what I meant. In this case, I think I'm going to answer your long mail with just... Yes. Ok, great. or actually... Yes, and let's move on to deeper problems (if any). What deeper problems? :-) Seriously: I saw this as the biggest problem in the scope of this optimization. If the result is not good the entire optimization would be worthless, so any other problem with my implementation would be irrelevant. Good to hear that it's not that big of a problem. (I realize this optimization is just a small thing compared to more important/fundamental stuff going on right now in subversion. But it's important to me personally. And it's the most I can do with my abilities/knowledge/time right now, so ...) (I'm not /particularly/ worried about a different-than-expected, but still semantically correct, diff --- though, yes, it's nice to output the expected diff if that's possible without too much effort and without breaking other use cases.) :-), Daniel (p.s. I don't have as much time to look into this as I'd like to...) No problem. Your feedback is very much appreciated. On Fri, Oct 8, 2010 at 8:46 PM, Hyrum K. Wright hyrum_wri...@mail.utexas.edu wrote: My not-having-looked-deeply-at-the-problem reaction is this: if we can introduce significant speed-ups in the common case, while still maintaining functionality, let's do it. There may be corner cases where somebody gets an ugly diff, but in those corner cases there are likely to be *other* issues as play as well. Let's not penalize everybody for the sake of a couple of corner cases. Ok, perfect, we seem to have a consensus then :-). I'll cook up a version of the patch which keeps 50 suffix lines. Thanks for the feedback, Daniel and Hyrum. Just one more thing: as I mentioned in my rather long mail, blame would benefit the most from my optimization if the server were fast enough. So if anyone could spare some review cycles (ok, I know they are scarce these days), reviewing Stefan^2's integrate-cache-membuffer branch would really help my cause as well ... Cheers, -- Johan
Re: [RFC] Diff (blame) optimization: how to go forward
On Fri, Oct 8, 2010 at 3:24 PM, Johan Corveleyn jcor...@gmail.com wrote: ... Just one more thing: as I mentioned in my rather long mail, blame would benefit the most from my optimization if the server were fast enough. So if anyone could spare some review cycles (ok, I know they are scarce these days), reviewing Stefan^2's integrate-cache-membuffer branch would really help my cause as well ... Yeah, just a question of tuits. If you've reviewed the branch, posting your conclusions here would be a great contribution. Speaking for myself, I feel you've got more experience in some areas of that branch than I do. :) (Apologies if you've already done so, and I've missed it.) -Hyrum
Re: [RFC] Diff (blame) optimization: how to go forward
On Fri, Oct 8, 2010 at 10:35 PM, Hyrum K. Wright hyrum_wri...@mail.utexas.edu wrote: On Fri, Oct 8, 2010 at 3:24 PM, Johan Corveleyn jcor...@gmail.com wrote: ... Just one more thing: as I mentioned in my rather long mail, blame would benefit the most from my optimization if the server were fast enough. So if anyone could spare some review cycles (ok, I know they are scarce these days), reviewing Stefan^2's integrate-cache-membuffer branch would really help my cause as well ... Yeah, just a question of tuits. If you've reviewed the branch, posting your conclusions here would be a great contribution. Speaking for myself, I feel you've got more experience in some areas of that branch than I do. :) (Apologies if you've already done so, and I've missed it.) No, sorry, I haven't either. I have compiled and run it (well actually the performance branch itself, not the integrate-cache-membuffer branch). But I haven't looked at the source code. Mmmm, tuits, I wish I had some spare ones lying around :-). Who knows, maybe I'll find some ... Cheers, -- Johan
[RFC] Diff (blame) optimization: how to go forward
Hi all, This is a follow-up to the WIP-patches I posted last week [1], which are about improving performance of svn_diff (and therefor also blame on the client-side), especially for large files. To summarize: the idea was (is) to strip off the identical prefix and suffix, and then letting the existing diff algorithm work on the remainder. As stated in [2], I ran into some difficulties to get the exact same result of the existing diff algorithm (because of too eager suffix stripping). I've spent some time searching for a perfect solution, but I can't find one. So now I'm stuck and would like to have some feedback on what to do with it. First: the thing works, and it can reduce the time needed for svn_diff by 20% up to 80% for large files (depending on the distribution of the changes). An extreme example is a reduction from ~150ms to ~30ms for a 6-lines file with a small number of lines changed (say up to 100) located close together (so lots of identical prefix/suffix). Blame gets similar benefits, *if the server is fast enough*. I used svnserve built from Stefan^2's performance branch to test this. A normal svnserve with FSFS isn't really fast enough (unless maybe with an SSD, but I don't have one here) to serve up the 2500 revisions of the large file quickly enough. But the performance-enhanced-svnserve, with a hot cache filled with the necessary full-texts, is definitely fast enough to make the time of blame completely dependent on the client-side performance. Conclusion: the diff optimization we're talking about is much more useful for blame together with a server with the full-text membuffer caching of the performance branch. Now, the reason why I'm stuck: suffix scanning sometimes strips a little bit too much (see [2] for full explanation). In this case the resulting diff is still correct, but it's not as nice for the user. As I noted in [2], GNU diff also has this problem, but the user can help a little bit by specifying a large enough number of prefix/suffix-lines to keep (--horizon-lines). So, what to do? - I have not found a better solution than to use some form of horizon-lines to get the nice result, but still have a similar performance improvement (unless if I leave out suffix stripping entirely, only strip prefix, but that halves the power of the optimization). - I've tested with keeping up to 50 suffix-lines. It didn't have the slightest impact on the performance improvement (we're talking about stripping away 1000's of lines). This fixed all real-life examples I could find/test (diff/blame output is precisely identical to original svn). If I kept only 20 lines, I still ran into some differences (30 lines was enough for my example file, but I took it up to 50 to be sure). - I would like to avoid leaving the decision to the user to come up with an adequate number of suffix-lines-to-keep. So I'd like to just hardcode some value, say 50, or even 100. I think this would give good (=identical to original svn) diff/blame results in all but the most extreme cases. With suffix-lines-to-keep=50, you'd need to insert a block of text that has its last 50 lines identical to the 50 lines preceding the insertion point, to mess up the diff result. - If really necessary, we could say default=50, but it can be overridden by the user with another -x option. So the main question is: is such a change in behavior (unlikely to have an impact in most realistic cases) acceptable for this kind of performance improvement? Other options: - Come up with a clever solution to fix the optimization properly (but I don't know if that's possible - suggestions welcome :-)). - Throw this optimization away, and take a look at something like Patience diff (supposed to be fast as well, and give nice results). However, I don't know Patience diff well enough to code up something, or even to judge whether it would give good results in all (or most) cases. - Just throw it away and don't look back :-) - Something else? Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2010-10/0032.shtml [2] http://svn.haxx.se/dev/archive-2010-10/0050.shtml