Re: svn diff optimization to make blame faster?

2010-09-20 Thread Branko Čibej
 On 15.09.2010 14:20, Johan Corveleyn wrote:
 Some update on this: I have implemented this for svn_diff (excluding
 the identical prefix and suffix of both files, and only then starting
 to fill up the token tree and let the lcs-agorithm to its thing). It
 makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one
 line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms
 vs. 150-170 ms).


Hmmm ... looks to me like test data tailored to the optimization. :)

-- Brane


Re: svn diff optimization to make blame faster?

2010-09-20 Thread Johan Corveleyn
On Mon, Sep 20, 2010 at 11:52 AM, Branko Čibej br...@xbc.nu wrote:
  On 15.09.2010 14:20, Johan Corveleyn wrote:
 Some update on this: I have implemented this for svn_diff (excluding
 the identical prefix and suffix of both files, and only then starting
 to fill up the token tree and let the lcs-agorithm to its thing). It
 makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one
 line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms
 vs. 150-170 ms).


 Hmmm ... looks to me like test data tailored to the optimization. :)

Nope, that's real data from a real repository, with a normal kind of
change that happens here every day.

Of course this optimization is most effective if there are a lot of
common prefix/suffix lines. If there is a single change in the first
line, and a single change in the last one, this optimization will do
nothing but introduce a little bit of extra overhead. And it will
obviously make the most impact on large files (in fact it's just
relative to the ratio of the number of common prefix/suffix lines to
the number of lines in between).

I'm sorry it takes me longer than expected to post a version of this
to the list, but I'm still having some problems with a couple of edge
conditions (I'm learning C as I go, and I'm struggling with a couple
of pointer calculations/comparisons). I plan to post something during
this week...

Cheers,
-- 
Johan


Re: svn diff optimization to make blame faster?

2010-09-15 Thread Johan Corveleyn
On Sun, Sep 5, 2010 at 1:53 AM, Johan Corveleyn jcor...@gmail.com wrote:
 On Tue, Aug 24, 2010 at 11:11 AM, Philip Martin
 philip.mar...@wandisco.com wrote:
 Johan Corveleyn jcor...@gmail.com writes:
 On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej br...@xbc.nu wrote:
 svn_diff uses basically the same algorithm as GNU diff but implemented
 slightly differently and IIRC it doesn't have some of GNU diff's
 optimizations. I'm sure it can be speeded up, but haven't a clue about
 how much.

 Ok, thanks. In the meantime I saw that there is not that much
 difference anymore between GNU diff and svn_diff, after running the
 latter from a release build, and disabling my anti-virus (which makes
 me wonder why my anti-virus slows down svn_diff (impact when opening
 the modified datasource), but not on GNU diff).

 The big difference between Subversion's diff and GNU diff is that GNU
 uses heuristics to cut short the diff algorithm whereas Subversion
 grinds on to the end.  Certain files trigger the heuristics and then
 GNU diff is much faster than Subversion.  Typically the file has a
 large number of matches and non-matches repeated through the file,
 machine generated files sometimes fit this pattern.

 GNU diff's heuristics work well so when they trigger the resulting
 diff is usually good enough.  They can be disabled using the --minimal
 option and using that makes GNU diff performance much more like
 Subversion.

 Hmm, I thought harder about this. If I try --minimal with GNU diff,
 it's just as fast, still significantly faster than svn diff.

 Then I read a little bit more about diff algorithms, like the blog
 entry that Greg Hudson suggested in the other thread [1], about the
 Patience Diff algorithm [2]. In there the following two lines caught
 my eye, first steps in the diff algo:

 [[[
 1) Match the first lines of both if they're identical, then match the
 second, third, etc. until a pair doesn't match.
 2) Match the last lines of both if they're identical, then match the
 next to last, second to last, etc. until a pair doesn't match.
 ]]]
 (then the longest common subsequence algorithm kicks in)

 Now, these steps are not specific to Patience Diff, they are common to
 most diff algorithms. And I bet it's a much more efficient to exclude
 the common prefix/suffix this way, than to include them into the
 entire lcs algorithm.

 Right now, svn diff doesn't do this AFAICS. However, GNU diff does,
 and I think that's where most of the performance difference comes
 from. For big files, where a couple of lines are modified somewhere in
 the middle (and the first 3 and last 3 lines are identical), I
 think this can make a big difference.

 Thoughts?

 If no-one else fancies this, I think I'll take a stab at implementing
 this optimization (first with a POC to see if it's significant).
 Unless feedback tells me it's not worth it, not a good idea, ...

Some update on this: I have implemented this for svn_diff (excluding
the identical prefix and suffix of both files, and only then starting
to fill up the token tree and let the lcs-agorithm to its thing). It
makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one
line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms
vs. 150-170 ms).

This means blame is much faster for such big files as well, because
diff accounts for 90% of the client-side work of blame. This means
that in most cases the client cpu won't be the bottleneck anymore, so
it's more constrained by server-side and network. In my tests, blame
was about twice as fast for bigfile.xml (since all my tests were done
with client and server on the same machine, I'm guessing the server is
now the bottleneck (as well as them running on the same machine)).
Same factor of performance increase for a medium-sized file (a java
source file of 300 Kb).

The patch is still very crude, with lots of code duplication, and
everything stuffed in one big function (and I may have overlooked some
edge conditions), so I don't feel it's ready to post yet (don't want
to waste people's time :)). I'll try to clean it up as soon as I can
and post a reasonable version to the list.

Some notes already:
- I basically wrote a new function datasources_open in diff_file.c
(similar to datasource_open), that opens both files (original and
modified), and compares bytes from the beginning until the first
nonmatching byte (counting newlines when it sees them), and backwards
from the end to find the first (or last, depending how you look at it)
nonmatching byte. The number of prefix_lines is then used to determine
the starting offset of the tokens (lines) that will later be
extracted.
- I think I should generalize this function to take an array of
datasources, and not just two, so it can be used by diff3 and diff4 as
well. But for now I didn't want to take on that extra complication.
- The code is complicated by the fact that the svn_diff code
(diff_file.c) works with chunks of data (doesn't read the entire
file(s) 

Re: svn diff optimization to make blame faster?

2010-09-15 Thread Stefan Sperling
On Wed, Sep 15, 2010 at 02:20:05PM +0200, Johan Corveleyn wrote:
 Just to know how hard/fast I should pursue this: is there any chance
 that a change like this could still make it into 1.7 (maybe even
 backported to 1.6?), provided that the patch is of good quality and
 all... ?

Depending on how invasive the change is, and how quickly we get it
reviewed and tested, there's a chance that this can make the 1.7 release.
For 1.6 it's probably not worth the effort.

Stefan


Re: svn diff optimization to make blame faster?

2010-09-15 Thread Daniel Shahaf
Johan Corveleyn wrote on Wed, Sep 15, 2010 at 14:20:05 +0200:
 The patch is still very crude, with lots of code duplication, and
 everything stuffed in one big function (and I may have overlooked some
 edge conditions), so I don't feel it's ready to post yet (don't want
 to waste people's time :)). I'll try to clean it up as soon as I can
 and post a reasonable version to the list.

+1, sounds like you have some interesting stuff working there. :-)


Re: svn diff optimization to make blame faster?

2010-08-24 Thread Johan Corveleyn
On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej br...@xbc.nu wrote:
 On 18.08.2010 00:59, Johan Corveleyn wrote:
 Hi devs,

 While Looking to improve performance of svn annotate [1], I found
 that the current blame algorithm is mainly client-side bound, and that
 most of its time is spent on svn diff (calls to svn_diff_file_diff_2
 from add_file_blame in blame.c). Apart from avoiding to build
 full-texts and diffing them altogether (which is subject of further
 discussion in [1]), I'm wondering if optimization of svn diff
 wouldn't also be an interesting way to improve the speed of blame.

 So the main question is: is it worth it to spend time to analyze this
 further and try to improve performance? Or has this already been
 optimized in the past, or is it simply already as optimal as it can
 get? I have no idea really, so if anyone can shed some light ...

 Gut feeling tells me that there must be room for optimization, since
 GNU diff seems a lot faster than svn diff for the same large file
 (with one line changed) on my machine [1]. But maybe svn's diff
 algorithm is purposefully different (better? more accurate? ...) than
 GNU's, or there are specific things in the svn context so svn diff has
 to do more work.

 Any thoughts?


 svn_diff uses basically the same algorithm as GNU diff but implemented
 slightly differently and IIRC it doesn't have some of GNU diff's
 optimizations. I'm sure it can be speeded up, but haven't a clue about
 how much.

Ok, thanks. In the meantime I saw that there is not that much
difference anymore between GNU diff and svn_diff, after running the
latter from a release build, and disabling my anti-virus (which makes
me wonder why my anti-virus slows down svn_diff (impact when opening
the modified datasource), but not on GNU diff).

There may still be some slight speed difference (enough to be
significant for a blame operation doing 100's or 1000's of diffs), but
not that much as I thought at first. So I don't think I'm going to
spend more time on trying to speed up svn_diff (also, I'm not really
an expert at optimizing C code, so ... I'll leave that to others :-)).

Cheers,
-- 
Johan


Re: svn diff optimization to make blame faster?

2010-08-24 Thread Philip Martin
Johan Corveleyn jcor...@gmail.com writes:

 Ok, thanks. In the meantime I saw that there is not that much
 difference anymore between GNU diff and svn_diff, after running the
 latter from a release build, and disabling my anti-virus (which makes
 me wonder why my anti-virus slows down svn_diff (impact when opening
 the modified datasource), but not on GNU diff).

The big difference between Subversion's diff and GNU diff is that GNU
uses heuristics to cut short the diff algorithm whereas Subversion
grinds on to the end.  Certain files trigger the heuristics and then
GNU diff is much faster than Subversion.  Typically the file has a
large number of matches and non-matches repeated through the file,
machine generated files sometimes fit this pattern.

GNU diff's heuristics work well so when they trigger the resulting
diff is usually good enough.  They can be disabled using the --minimal
option and using that makes GNU diff performance much more like
Subversion.

-- 
Philip


Re: svn diff optimization to make blame faster?

2010-08-20 Thread Johan Corveleyn
On Thu, Aug 19, 2010 at 11:38 PM, Johan Corveleyn jcor...@gmail.com wrote:
 Feh, I just redid my apr_time_now+printf profiling with a release
 build (of trunk), instead of a debug build, and that makes a *big*
 difference. Total time of the svn_diff_diff call is now down to ~300
 ms. Still a lot slower than GNU diff (78 ms), but a lot better than
 with the debug build. That will teach me not to do performance tests
 with a debug build.

 Still, the challenge holds: why is svn diff ~4 times slower than GNU diff?

 Also, still puzzling to me: why does datasource_open take so long in
 the modified case? Now it takes twice as long as the while loop ...

@#...@#!!!
/me bangs head against wall

The slowness of datasource_open was because of my antivirus (AVG free
edition). After disabling it, it's completely gone.

Running a release build, with antivirus disabled, and minimal
instrumentation brings svn diff performance (after a couple of runs)
more or less on par with GNU diff:

$ time svn diff bigfile.xml
Starting diff ... (entered svn_diff_diff in diff.c)
Diff finished in 78125 usec (78 millis)
[snip]
real0m0.359s
user0m0.015s
sys 0m0.031s

I guess this closes this thread ... on to more useful stuff. There is
no point in trying to optimize svn diff, so I guess blame can only be
sped up by more fundamental changes (i.e. avoiding diffs, which still
amount to ~90% of blame time on the client side).

Sorry for any wasted time.

(on the bright side, I learned quite a lot about accurately measuring
performance)

Cheers,
-- 
Johan


Re: svn diff optimization to make blame faster?

2010-08-20 Thread Daniel Shahaf
Johan Corveleyn wrote on Fri, Aug 20, 2010 at 01:43:16 +0200:
 If anyone has any more suggestions, short of installing a Linux
 distribution and trying a linux profiler ...

Ask one of the other devs to do the profiling for you?


Re: svn diff optimization to make blame faster?

2010-08-19 Thread Johan Corveleyn
On Wed, Aug 18, 2010 at 3:44 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Wed, Aug 18, 2010 at 12:49 PM, Stefan Sperling s...@elego.de wrote:
 Can you show a profiler run that illustrates where the client is
 spending most of its time during diff? That would probably help with
 getting opinions from people, because it saves them from spending time
 doing this research themselves.
 You've already hinted at svn_diff__get_tokens() in another mail, but
 a real profiler run would show more candidates.

 Sorry, I'm still very much a beginner in C (I've been programming for
 10 years, but mainly in Java). Especially the tooling is a steep part
 of the learning curve :-), so I don't know (how to use) a profiler
 yet. Any suggestions? I'm on Windows (XP), using VCE 2008, and used
 cygwin to compare with GNU diff.

I googled around a bit for C profilers on Windows. I found several
which seem useful:
- very sleepy (http://www.codersnotes.com/sleepy/sleepy)
- Shiny (http://sourceforge.net/projects/shinyprofiler/)
- AMD CodeAnalyst (http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx)
- AQTime - not free, but has a trial of 30 days
(http://www.automatedqa.com/products/aqtime/)

Before I dive in and try them out: any preference/favorites from the
windows devs on this list? Or other suggestions?

Further, some additional context to the manual-profiling numbers: see below.

 For the time being, I've helped myself using apr_time_now() from
 apr_time.h and printf statements. Not terribly accurate and probably
 somewhat overheadish, but it gives me some hints about the
 bottlenecks.

 FWIW, below is the output of a recent run with my instrumented trunk
 build. I've instrumented svn_diff_diff in diff.c and
 svn_diff__get_tokens in token.c. I checked out bigfile.xml from a
 repository, and edited a single line of it in my working copy. The two
 statements Starting diff and Diff finished are the first and last
 statements inside the svn_diff_diff function. These are numbers from
 the second run (any following runs show approximately the same
 numbers).

 $ time svn diff bigfile.xml
 Starting diff ... (entered svn_diff_diff in diff.c)
  - calling svn_diff__get_tokens for svn_diff_datasource_original
      == svn_diff__get_tokens datasource_open: 0 usec
      == svn_diff__get_tokens while loop: 265625 usec
         calls to datasource_get_next_token: 62500 usec
                svn_diff__tree_insert_token: 171875 usec
                                       rest: 15625 usec
  - done: 281250 usec
  - calling svn_diff__get_tokens for svn_diff_datasource_modified
      == svn_diff__get_tokens datasource_open: 234375 usec
      == svn_diff__get_tokens while loop: 312500 usec
         calls to datasource_get_next_token: 62500 usec
                svn_diff__tree_insert_token: 171875 usec
                                       rest: 46875 usec
  - done: 562500 usec
  - calling svn_diff__lcs
  - done: 0 usec
  - calling svn_diff__diff
  - done: 0 usec
 Diff finished in 875000 usec (875 millis)
 Index: bigfile.xml
 ===
 [snip]

 real    0m1.266s
 user    0m0.015s
 sys     0m0.031s

For comparison: a debug build from trunk, with only one
instrumentation spot (at start and end of svn_diff_diff):
$ time svn diff bigfile.xml
Starting diff ... (entered svn_diff_diff in diff.c)
Diff finished in 703125 usec (703 millis)
[snip]
real0m1.109s
user0m0.015s
sys 0m0.031s

So the instrumentation in the critical loop probably costs me around
150-200 ms.

A release build will also probably be faster, but I have no time to
make one and test it now. If I time my 1.6.5 build from tigris.org,
that's still a lot faster:
$ time svn diff bigfile.xml
[snip]
real0m0.468s
user0m0.015s
sys 0m0.031s

Maybe that can be totally attributed to the build (release vs. debug),
or maybe 1.6.5 was faster at svn diff than trunk is?

 Some observations:
 - svn_diff__get_tokens takes most of the time
 - for some reason, in the case of datasource_modified, the call to
 datasource_open takes a long time (234 ms). In case of
 datasource_original, it's instantaneous. Maybe because of translation,
 ... of the pristine file? But I'd think that would be the original??
 - apart from that, most of the time goes to the while loop in
 svn_diff__get_tokens.
 - Inside the while loop, most time is spent on calls to
 svn_diff__tree_insert_token (which compares tokens (=lines) to insert
 them into some tree structure). Calls to datasource_get_next_token
 also take some time.

 I'm not too sure of the accuracy of those last numbers with my simple
 profiling method, because it's the accumulated time of calls inside a
 while loop (with 61000 iterations).


 For completeness, the same diff with /usr/bin/diff (under cygwin), of
 the edited bigfile.xml vs. the original, as of the second diff run:

 $ ls -l settings.xml
 -rwxr-xr-x+ 1 User None 1447693 2010-08-17 14:20 bigfile.xml
 $ time diff 

RE: svn diff optimization to make blame faster?

2010-08-19 Thread Bert Huijben


 -Original Message-
 From: Johan Corveleyn [mailto:jcor...@gmail.com]
 Sent: donderdag 19 augustus 2010 4:55
 To: Subversion Development
 Subject: Re: svn diff optimization to make blame faster?
 


 
 I googled around a bit for C profilers on Windows. I found several
 which seem useful:
 - very sleepy (http://www.codersnotes.com/sleepy/sleepy)
 - Shiny (http://sourceforge.net/projects/shinyprofiler/)
 - AMD CodeAnalyst
 (http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx)
 - AQTime - not free, but has a trial of 30 days
 (http://www.automatedqa.com/products/aqtime/)

Microsoft and Intel have some advanced profilers integrated in their C
products. (Only a bit of experience with those)

I had good results with very sleepy. When you have the PDB files available
it is as simple as just running the application and you get an easy to
understand result. (Make sure you have a recent version; the older versions
only allowed attaching to an already running process)

Bert



Re: svn diff optimization to make blame faster?

2010-08-18 Thread Stefan Sperling
On Wed, Aug 18, 2010 at 12:59:21AM +0200, Johan Corveleyn wrote:
 Hi devs,
 
 While Looking to improve performance of svn annotate [1], I found
 that the current blame algorithm is mainly client-side bound, and that
 most of its time is spent on svn diff (calls to svn_diff_file_diff_2
 from add_file_blame in blame.c). Apart from avoiding to build
 full-texts and diffing them altogether (which is subject of further
 discussion in [1]), I'm wondering if optimization of svn diff
 wouldn't also be an interesting way to improve the speed of blame.
 
 So the main question is: is it worth it to spend time to analyze this
 further and try to improve performance? Or has this already been
 optimized in the past, or is it simply already as optimal as it can
 get? I have no idea really, so if anyone can shed some light ...
 
 Gut feeling tells me that there must be room for optimization, since
 GNU diff seems a lot faster than svn diff for the same large file
 (with one line changed) on my machine [1]. But maybe svn's diff
 algorithm is purposefully different (better? more accurate? ...) than
 GNU's, or there are specific things in the svn context so svn diff has
 to do more work.
 
 Any thoughts?

Can you show a profiler run that illustrates where the client is
spending most of its time during diff? That would probably help with
getting opinions from people, because it saves them from spending time
doing this research themselves.
You've already hinted at svn_diff__get_tokens() in another mail, but
a real profiler run would show more candidates.

Thanks,
Stefan


svn diff optimization to make blame faster?

2010-08-17 Thread Johan Corveleyn
Hi devs,

While Looking to improve performance of svn annotate [1], I found
that the current blame algorithm is mainly client-side bound, and that
most of its time is spent on svn diff (calls to svn_diff_file_diff_2
from add_file_blame in blame.c). Apart from avoiding to build
full-texts and diffing them altogether (which is subject of further
discussion in [1]), I'm wondering if optimization of svn diff
wouldn't also be an interesting way to improve the speed of blame.

So the main question is: is it worth it to spend time to analyze this
further and try to improve performance? Or has this already been
optimized in the past, or is it simply already as optimal as it can
get? I have no idea really, so if anyone can shed some light ...

Gut feeling tells me that there must be room for optimization, since
GNU diff seems a lot faster than svn diff for the same large file
(with one line changed) on my machine [1]. But maybe svn's diff
algorithm is purposefully different (better? more accurate? ...) than
GNU's, or there are specific things in the svn context so svn diff has
to do more work.

Any thoughts?

-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-08/0408.shtml