Re: [RFC] Diff (blame) optimization: how to go forward

2010-10-08 Thread Daniel Shahaf
Johan Corveleyn wrote on Fri, 8 Oct 2010 at 01:44 -:
 With suffix-lines-to-keep=50, you'd need to insert a
 block of text that has its last 50 lines identical to the 50 lines
 preceding the insertion point, to mess up the diff result.
 
 - If really necessary, we could say default=50, but it can be
 overridden by the user with another -x option.
 
 So the main question is: is such a change in behavior (unlikely to
 have an impact in most realistic cases) acceptable for this kind of
 performance improvement?

Just to clarify: are you asking whether it's acceptable to have a not
nice (but still semantically correct) diff output in case the user adds
a block whose last 50 lines match the 50 lines prior to the first
difference?

And 'not nice' just means the preceding, rather than trailing, instance
of the repeated block would be grabbed by the /^+/ lines in the diff,
right?

In this case, I think I'm going to answer your long mail with just...

Yes.

or actually...

Yes, and let's move on to deeper problems (if any).

(I'm not /particularly/ worried about a different-than-expected, but
still semantically correct, diff --- though, yes, it's nice to output
the expected diff if that's possible without too much effort and
without breaking other use cases.)

:-),

Daniel
(p.s. I don't have as much time to look into this as I'd like to...)


Re: [RFC] Diff (blame) optimization: how to go forward

2010-10-08 Thread Hyrum K. Wright
On Thu, Oct 7, 2010 at 6:44 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Hi all,

 This is a follow-up to the WIP-patches I posted last week [1], which
 are about improving performance of svn_diff (and therefor also blame
 on the client-side), especially for large files.

 To summarize: the idea was (is) to strip off the identical prefix and
 suffix, and then letting the existing diff algorithm work on the
 remainder. As stated in [2], I ran into some difficulties to get the
 exact same result of the existing diff algorithm (because of too eager
 suffix stripping). I've spent some time searching for a perfect
 solution, but I can't find one. So now I'm stuck and would like to
 have some feedback on what to do with it.

 First: the thing works, and it can reduce the time needed for svn_diff
 by 20% up to 80% for large files (depending on the distribution of the
 changes). An extreme example is a reduction from ~150ms to ~30ms for a
 6-lines file with a small number of lines changed (say up to 100)
 located close together (so lots of identical prefix/suffix).

 Blame gets similar benefits, *if the server is fast enough*. I used
 svnserve built from Stefan^2's performance branch to test this. A
 normal svnserve with FSFS isn't really fast enough (unless maybe with
 an SSD, but I don't have one here) to serve up the 2500 revisions of
 the large file quickly enough. But the performance-enhanced-svnserve,
 with a hot cache filled with the necessary full-texts, is definitely
 fast enough to make the time of blame completely dependent on the
 client-side performance. Conclusion: the diff optimization we're
 talking about is much more useful for blame together with a server
 with the full-text membuffer caching of the performance branch.

 Now, the reason why I'm stuck: suffix scanning sometimes strips a
 little bit too much (see [2] for full explanation). In this case the
 resulting diff is still correct, but it's not as nice for the user. As
 I noted in [2], GNU diff also has this problem, but the user can help
 a little bit by specifying a large enough number of
 prefix/suffix-lines to keep (--horizon-lines).

 So, what to do?
 - I have not found a better solution than to use some form of
 horizon-lines to get the nice result, but still have a similar
 performance improvement (unless if I leave out suffix stripping
 entirely, only strip prefix, but that halves the power of the
 optimization).

 - I've tested with keeping up to 50 suffix-lines. It didn't have the
 slightest impact on the performance improvement (we're talking about
 stripping away 1000's of lines). This fixed all real-life examples I
 could find/test (diff/blame output is precisely identical to original
 svn). If I kept only 20 lines, I still ran into some differences (30
 lines was enough for my example file, but I took it up to 50 to be
 sure).

 - I would like to avoid leaving the decision to the user to come up
 with an adequate number of suffix-lines-to-keep. So I'd like to just
 hardcode some value, say 50, or even 100. I think this would give good
 (=identical to original svn) diff/blame results in all but the most
 extreme cases. With suffix-lines-to-keep=50, you'd need to insert a
 block of text that has its last 50 lines identical to the 50 lines
 preceding the insertion point, to mess up the diff result.

 - If really necessary, we could say default=50, but it can be
 overridden by the user with another -x option.

 So the main question is: is such a change in behavior (unlikely to
 have an impact in most realistic cases) acceptable for this kind of
 performance improvement?

 Other options:
 - Come up with a clever solution to fix the optimization properly (but
 I don't know if that's possible - suggestions welcome :-)).
 - Throw this optimization away, and take a look at something like
 Patience diff (supposed to be fast as well, and give nice results).
 However, I don't know Patience diff well enough to code up something,
 or even to judge whether it would give good results in all (or most)
 cases.
 - Just throw it away and don't look back :-)
 - Something else?

My not-having-looked-deeply-at-the-problem reaction is this: if we can
introduce significant speed-ups in the common case, while still
maintaining functionality, let's do it.  There may be corner cases
where somebody gets an ugly diff, but in those corner cases there are
likely to be *other* issues as play as well.  Let's not penalize
everybody for the sake of a couple of corner cases.

-Hyrum


Re: [RFC] Diff (blame) optimization: how to go forward

2010-10-08 Thread Johan Corveleyn
On Fri, Oct 8, 2010 at 3:08 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Fri, 8 Oct 2010 at 01:44 -:
 With suffix-lines-to-keep=50, you'd need to insert a
 block of text that has its last 50 lines identical to the 50 lines
 preceding the insertion point, to mess up the diff result.

 - If really necessary, we could say default=50, but it can be
 overridden by the user with another -x option.

 So the main question is: is such a change in behavior (unlikely to
 have an impact in most realistic cases) acceptable for this kind of
 performance improvement?

 Just to clarify: are you asking whether it's acceptable to have a not
 nice (but still semantically correct) diff output in case the user adds
 a block whose last 50 lines match the 50 lines prior to the first
 difference?

 And 'not nice' just means the preceding, rather than trailing, instance
 of the repeated block would be grabbed by the /^+/ lines in the diff,
 right?

Yep, that's what I meant.

 In this case, I think I'm going to answer your long mail with just...

    Yes.

Ok, great.

 or actually...

    Yes, and let's move on to deeper problems (if any).

What deeper problems? :-)

Seriously: I saw this as the biggest problem in the scope of this
optimization. If the result is not good the entire optimization would
be worthless, so any other problem with my implementation would be
irrelevant.

Good to hear that it's not that big of a problem.

(I realize this optimization is just a small thing compared to more
important/fundamental stuff going on right now in subversion. But it's
important to me personally. And it's the most I can do with my
abilities/knowledge/time right now, so ...)

 (I'm not /particularly/ worried about a different-than-expected, but
 still semantically correct, diff --- though, yes, it's nice to output
 the expected diff if that's possible without too much effort and
 without breaking other use cases.)

 :-),

 Daniel
 (p.s. I don't have as much time to look into this as I'd like to...)

No problem. Your feedback is very much appreciated.

On Fri, Oct 8, 2010 at 8:46 PM, Hyrum K. Wright
hyrum_wri...@mail.utexas.edu wrote:
 My not-having-looked-deeply-at-the-problem reaction is this: if we can
 introduce significant speed-ups in the common case, while still
 maintaining functionality, let's do it.  There may be corner cases
 where somebody gets an ugly diff, but in those corner cases there are
 likely to be *other* issues as play as well.  Let's not penalize
 everybody for the sake of a couple of corner cases.

Ok, perfect, we seem to have a consensus then :-). I'll cook up a
version of the patch which keeps 50 suffix lines.

Thanks for the feedback, Daniel and Hyrum.

Just one more thing: as I mentioned in my rather long mail, blame
would benefit the most from my optimization if the server were fast
enough. So if anyone could spare some review cycles (ok, I know they
are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
branch would really help my cause as well ...

Cheers,
-- 
Johan


Re: [RFC] Diff (blame) optimization: how to go forward

2010-10-08 Thread Hyrum K. Wright
On Fri, Oct 8, 2010 at 3:24 PM, Johan Corveleyn jcor...@gmail.com wrote:
 ...

 Just one more thing: as I mentioned in my rather long mail, blame
 would benefit the most from my optimization if the server were fast
 enough. So if anyone could spare some review cycles (ok, I know they
 are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
 branch would really help my cause as well ...

Yeah, just a question of tuits.  If you've reviewed the branch,
posting your conclusions here would be a great contribution.  Speaking
for myself, I feel you've got more experience in some areas of that
branch than I do. :)

(Apologies if you've already done so, and I've missed it.)

-Hyrum


Re: [RFC] Diff (blame) optimization: how to go forward

2010-10-08 Thread Johan Corveleyn
On Fri, Oct 8, 2010 at 10:35 PM, Hyrum K. Wright
hyrum_wri...@mail.utexas.edu wrote:
 On Fri, Oct 8, 2010 at 3:24 PM, Johan Corveleyn jcor...@gmail.com wrote:
 ...

 Just one more thing: as I mentioned in my rather long mail, blame
 would benefit the most from my optimization if the server were fast
 enough. So if anyone could spare some review cycles (ok, I know they
 are scarce these days), reviewing Stefan^2's integrate-cache-membuffer
 branch would really help my cause as well ...

 Yeah, just a question of tuits.  If you've reviewed the branch,
 posting your conclusions here would be a great contribution.  Speaking
 for myself, I feel you've got more experience in some areas of that
 branch than I do. :)

 (Apologies if you've already done so, and I've missed it.)

No, sorry, I haven't either. I have compiled and run it (well actually
the performance branch itself, not the integrate-cache-membuffer
branch). But I haven't looked at the source code.

Mmmm, tuits, I wish I had some spare ones lying around :-). Who knows,
maybe I'll find some ...

Cheers,
-- 
Johan


[RFC] Diff (blame) optimization: how to go forward

2010-10-07 Thread Johan Corveleyn
Hi all,

This is a follow-up to the WIP-patches I posted last week [1], which
are about improving performance of svn_diff (and therefor also blame
on the client-side), especially for large files.

To summarize: the idea was (is) to strip off the identical prefix and
suffix, and then letting the existing diff algorithm work on the
remainder. As stated in [2], I ran into some difficulties to get the
exact same result of the existing diff algorithm (because of too eager
suffix stripping). I've spent some time searching for a perfect
solution, but I can't find one. So now I'm stuck and would like to
have some feedback on what to do with it.

First: the thing works, and it can reduce the time needed for svn_diff
by 20% up to 80% for large files (depending on the distribution of the
changes). An extreme example is a reduction from ~150ms to ~30ms for a
6-lines file with a small number of lines changed (say up to 100)
located close together (so lots of identical prefix/suffix).

Blame gets similar benefits, *if the server is fast enough*. I used
svnserve built from Stefan^2's performance branch to test this. A
normal svnserve with FSFS isn't really fast enough (unless maybe with
an SSD, but I don't have one here) to serve up the 2500 revisions of
the large file quickly enough. But the performance-enhanced-svnserve,
with a hot cache filled with the necessary full-texts, is definitely
fast enough to make the time of blame completely dependent on the
client-side performance. Conclusion: the diff optimization we're
talking about is much more useful for blame together with a server
with the full-text membuffer caching of the performance branch.

Now, the reason why I'm stuck: suffix scanning sometimes strips a
little bit too much (see [2] for full explanation). In this case the
resulting diff is still correct, but it's not as nice for the user. As
I noted in [2], GNU diff also has this problem, but the user can help
a little bit by specifying a large enough number of
prefix/suffix-lines to keep (--horizon-lines).

So, what to do?
- I have not found a better solution than to use some form of
horizon-lines to get the nice result, but still have a similar
performance improvement (unless if I leave out suffix stripping
entirely, only strip prefix, but that halves the power of the
optimization).

- I've tested with keeping up to 50 suffix-lines. It didn't have the
slightest impact on the performance improvement (we're talking about
stripping away 1000's of lines). This fixed all real-life examples I
could find/test (diff/blame output is precisely identical to original
svn). If I kept only 20 lines, I still ran into some differences (30
lines was enough for my example file, but I took it up to 50 to be
sure).

- I would like to avoid leaving the decision to the user to come up
with an adequate number of suffix-lines-to-keep. So I'd like to just
hardcode some value, say 50, or even 100. I think this would give good
(=identical to original svn) diff/blame results in all but the most
extreme cases. With suffix-lines-to-keep=50, you'd need to insert a
block of text that has its last 50 lines identical to the 50 lines
preceding the insertion point, to mess up the diff result.

- If really necessary, we could say default=50, but it can be
overridden by the user with another -x option.

So the main question is: is such a change in behavior (unlikely to
have an impact in most realistic cases) acceptable for this kind of
performance improvement?

Other options:
- Come up with a clever solution to fix the optimization properly (but
I don't know if that's possible - suggestions welcome :-)).
- Throw this optimization away, and take a look at something like
Patience diff (supposed to be fast as well, and give nice results).
However, I don't know Patience diff well enough to code up something,
or even to judge whether it would give good results in all (or most)
cases.
- Just throw it away and don't look back :-)
- Something else?

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0032.shtml
[2] http://svn.haxx.se/dev/archive-2010-10/0050.shtml