Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-19 Thread Johan Corveleyn
On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad julian.f...@wandisco.com wrote:
 On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote:
 On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad julian.f...@wandisco.com 
 wrote:
  On Sat, 2010-10-09, Johan Corveleyn wrote:
  On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com 
  wrote:
   So I wrote a patch - attached - that refactors this into an array of 4
   sub-structures, and simplifies all the code that uses them.
  [...]
  Yes, great idea! That would indeed vastly simplify a lot of the code.
  So please go ahead and commit the refactoring.
 
  OK, committed in r1021282.

 Thanks, looks much more manageable now.

 I'd like to see a simplified version of your last patch, taking
 advantage of that, before you go exploring other options.

Ok, here's a new version of the patch, taking advantage of your
file_info refactoring. This vastly simplifies the code, so that it
might actually be understandable now :-).

Other things I've done in this version:
1) Generalized everything to handle an array of datasources/files,
instead of just two. This makes it slightly more complex here and
there (using for loops everywhere), but I think it's ok, and it's also
more consistent/generic. If anyone has better ideas to do those for
loops, suggestions welcome.

This makes the algorithm usable by diff3 as well (and diff4 if needed
(?)). I have not yet enabled it for diff3, because I haven't yet
understood how it handles the generation of its diff output (needs to
take into account the prefix_lines. I tried some quick hacks, but lots
of tests were failing, so I'll have to look more into it - that's for
a follow up patch). When I can enable it for diff3 (and diff4), I can
remove datasource_open (with one datasource).

2) Removed get_prefix_lines from svn_diff_fns_t (and its
implementations in diff_file.c and diff_memory.c). Instead I pass
prefix_lines directly to token.c#svn_diff__get_tokens.

3) If prefix scanning ended in the last chunk, the suffix scanning now
reuses that buffer which already contains the last chunk. As a special
case, this also avoids reading the file twice if it's smaller than 128
Kb.

4) Added doc strings everywhere. Feel free to edit those, I'm new at
documenting things in svn.


Still TODO:
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- See if implementing the critical parts of increment_pointers and
decrement_pointers in a macro improves performance.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. For
this (and for other reasons), I'd still like to investigate pushing
this optimization into the token parsing/handling layer, to extract
entire tokens etc., even if this means the current patch has to be
thrown away. I'll take this up in a separate thread.

Log message:
[[[
Make svn_diff skip identical prefix and suffix to make diff and blame faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function type datasources_open to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (svn_diff__mem_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added member prefix_lines, and inside the
   struct file_info the members suffix_start_chunk and suffix_offset_in_chunk.
  (increment_pointers, decrement_pointers): New functions.
  (is_one_at_bof, is_one_at_eof): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open multiple datasources and find
   their identical prefix and suffix, so these can be excluded from the rest
   of the diff algorithm, as a performance optimization. From the identical
   suffix, 50 lines are kept to help the diff algorithm find the nicest
   possible diff representation in case of ambiguity.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new function datasources_open.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened, and argument prefix_lines, the number
   of identical prefix lines.and use
   this as the starting offset for the token we're getting.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added arguments datasource_opened and
   prefix_lines. Only open the datasource if datasource_opened is FALSE.
   Set the starting offset of the position list to the number of prefix_lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument prefix_lines. Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of common diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open to open original and
   modified at once and find their 

Diff optimization: implement prefix/suffix-skipping in token-handling code (was: Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster)

2010-10-19 Thread Johan Corveleyn
On Tue, Oct 12, 2010 at 12:35 PM, Julian Foad julian.f...@wandisco.com wrote:
 On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote:
 On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn jcor...@gmail.com wrote:
  On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com 
  wrote:
  But this makes me think, it looks to me like this whole
  prefix-suffix-skipping functionality would fit better inside the
  lower-level diff algorithm rather than inside the datasources_open
  function.  Do you agree?
 
  It works as it is, but you were talking about wanting it to obey the
  standard token-parsing rules such as ignore white space, and so on.
  It seems that things like this would be much better one level down.
 
  Yes, I've been struggling with this. But I can't easily see it fit in
  the lower levels right now. Problem is that everything in those lower
  levels always acts on 1 datasource at a time (getting all the tokens,
  inserting them into the token tree, ... then the same for the next
  datasource). The datasource_open seemed to me to be the easiest place
  to combine datasources to do things for both of them concurrently
  (with least impact on the rest of the code).
 
  Maybe those lower-level functions could also be made
  multi-datasource, but I have to think a little bit more about that.

 I've thought a little bit more about this, going through the code, and
 yeah, it might be a good idea to push the prefix/suffix scanning into
 the lower-level parts of the diff-algorithm (the token parsing /
 building the token tree).

 Something like:
 - Make token.c#svn_diff__get_tokens take multiple datasources.
 - In diff.c, diff3.c and diff4.c replace the multiple calls to this
 function to one call passing multiple datasources.
 - token.c#svn_diff__get_tokens (with multiple datasources) will take
 care of identical prefix and suffix scanning based on tokens (so can
 take advantage of the standard token-parsing rules with ignore-*
 options, by simply calling svn_diff_fns_t.datasource_get_next_token).

 One of the improvements we're looking for is to make use of the already
 existing diff options - ignore-white-space etc.

 We can get that improvement with a much smaller change: simply by
 calling the 'get_next_token' routine, or rather a part of it, from
 within your current 'find_identical_prefix' function, without touching
 any of the generic diffN.c/token.c code and APIs.

 Some things needed to support this:
 - svn_diff_fns_t.datasource_get_next_token: calculation of the hash
 should be made conditional (I don't want to waste time for the adler32
 hash for lines that are not going to be put in the token tree).

 Yes.  If you take this smaller change approach, then the way to do
 this would be to factor out the non-adler part of
 'datasource_get_next_token' into a separate private function, and call
 that.

 - For suffix scanning, I'll need another variation of
 datasource_get_next_token to get tokens from the end going backwards
 (say datasource_get_previous_token). No hash needed.

 Yes.

 Does that make sense? Or am I missing it completely?

 I can't really comment on the question of moving this right down into
 the diffN.c/token.c code, as I don't have time to study that right now.
 The possible benefits I can see are:

 * Sharing this feature across all kinds of diff implementations
 (currently: file and in-memory-string).  Good from a practical POV if
 and only if really long strings are being compared in memory.  Good from
 a code-structural POV.

Yes, that seems like a nice improvement, code-structurally. I don't
know if it will have noticeable performance impact. If I understand
the code correctly, the in-memory-diff code is used for diffing
properties. Some properties can be quite large (e.g. svn:mergeinfo),
but I don't know if they are large enough to have an impact (unless
some high-level operations perform a million property diffs or
something).

 * I'm curious about whether it would be possible to integrate prefix
 skipping into the main algorithm in such a way that the algorithm would
 see the skipped prefix as a possible match, just like any other chunk
 (including its adler32), but in a much quicker way than the algorithm
 currently finds such a prefix.  If so, it would be able to handle better
 the cases where one file has added a prefix that is a duplicate of
 existing text.  Same for the suffix.

Maybe that's possible, but I don't think it will help much. For one
thing, it introduces some overhead (adler32 calculation of the
prefix). And it will only help if the added piece of text is *exactly*
the same as the prefix (every line of it). If the added piece of text
misses the last line of the prefix, it will not match (different
adler32 hash). If you need to be able to match at different spots
inside the prefix, you'll have to hash (and compare) every line
separately, which voids the benefit of this optimization.

However, I've had another idea for optimization, which I 

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-11 Thread Julian Foad
On Sat, 2010-10-09, Johan Corveleyn wrote:
 On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com wrote:
  On Sat, 2010-10-09, Johan Corveleyn wrote:
  Ok, third iteration of the patch in attachment. It passes make check.
 
  As discussed in [1], this version keeps 50 lines of the identical
  suffix around, to give the algorithm a good chance to generate a diff
  output of good quality (in all but the most extreme cases, this will
  be the same as with the original svn_diff algorithm).
 
  That's about the only difference with the previous iteration. So for
  now, I'm submitting this for review. Any feedback is very welcome :-).
 
  Hi Johan.
 
 Hi Julian,
 
 Thanks for taking a look.
 
  I haven't reviewed it, but after seeing today's discussion I had just
  scrolled quickly through the previous version of this patch.  I noticed
  that the two main functions - find_identical_suffix and
  find_identical_suffix - are both quite similar (but not quite similar
  enough to make them easily share implementation) and both quite long,
  and I noticed you wrote in an earlier email that you were finding it
  hard to make the code readable.  I have a suggestion that may help.
[...]
  So I wrote a patch - attached - that refactors this into an array of 4
  sub-structures, and simplifies all the code that uses them.
[...]
 Yes, great idea! That would indeed vastly simplify a lot of the code.
 So please go ahead and commit the refactoring.

OK, committed in r1021282.


 Also, maybe the last_chunk number could be included in the file_info
 struct? Now it's calculated in several places: last_chunk0 =
 offset_to_chunk(file_baton-size[idx0]), or I have to pass it on
 every time as an extra argument. Seems like the sort of info that
 could be part of file_info.

It looks to me like you only need to calculate it in exactly one place:
in increment_pointer_or_chunk().  I wouldn't worry about pre-calculating
for efficiency, as it's a trivial calculation.


 One more thing: you might have noticed that for find_identical_suffix
 I use other buffers, chunks, curp's, endp's, ... than for the prefix.
 For prefix scanning I can just use the stuff from the diff_baton,
 because after prefix scanning has finished, everything is buffered and
 pointing correctly for the normal algorithm to continue (i.e.
 everything points at the first byte of the first non-identical line).
 For suffix scanning I need to use other structures (newly alloc'd
 buffer etc), so as to not mess with those pointers/buffers from the
 diff_baton.

 Maybe I can give it a go,
 first suffix then prefix, and see if I can find real-life problems ...

 So: I think I'll need the file_info struct to be available out of the
 diff_baton_t struct as well, so I can use this in suffix scanning
 also.

Yes; I gave the struct type a name - struct file_info - so you can use
it in other places.

 (side-note: I considered first doing suffix scanning, then prefix
 scanning, so I could reuse the buffers/pointers from diff_baton all
 the time, and still have everything pointing correctly after
 eliminating prefix/suffix. But that could give vastly different
 results in some cases, for instance when original file is entirely
 identical to both the prefix and the suffix of the modified file. So I
 decided it's best to stick with first prefix, then suffix).

And in a later email you wrote:
 [...] Maybe I can give it a go,
 first suffix then prefix, and see if I can find real-life problems ...

There's no need to re-visit that.  I think it's fine as is it, using a
separate pair of buffers.


  Responding to some of the other points you mentioned in a much earlier
  mail:
 
  3) It's only implemented for 2 files. I'd like to generalize this for
  an array of datasources, so it can also be used for diff3 and diff4.
 
  4) As a small hack, I had to add a flag datasource_opened to
  token.c#svn_diff__get_tokens, so I could have different behavior for
  regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
  no longer needed.
 
  Yes, I'd like to see 3), and so hack 4) will go away.
 
 I'm wondering though how I should represent the datasources to pass
 into datasources_open. An array combined with a length parameter?
 Something like:
 
 static svn_error_t *
 datasources_open(void *baton, apr_off_t *prefix_lines,
  svn_diff_datasource_e[] datasources, int datasources_len)
 
 ? And then use for loops everywhere I now do things twice for the two
 datasources?

I haven't looked at how datasources_open() API is used or what form of
API would be best.

For the internal functions, you can now pass around an array of
file_info pointers rather than indexes.


  5) I've struggled with making the code readable/understandable. It's
  likely that there is still a lot of room for improvement. I also
  probably need to document it some more.
 
  You need to write a full doc string for datasources_open(), at least.
  It needs especially to say how 

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-10 Thread Johan Corveleyn
On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com wrote:
 But this makes me think, it looks to me like this whole
 prefix-suffix-skipping functionality would fit better inside the
 lower-level diff algorithm rather than inside the datasources_open
 function.  Do you agree?

 It works as it is, but you were talking about wanting it to obey the
 standard token-parsing rules such as ignore white space, and so on.
 It seems that things like this would be much better one level down.

 Yes, I've been struggling with this. But I can't easily see it fit in
 the lower levels right now. Problem is that everything in those lower
 levels always acts on 1 datasource at a time (getting all the tokens,
 inserting them into the token tree, ... then the same for the next
 datasource). The datasource_open seemed to me to be the easiest place
 to combine datasources to do things for both of them concurrently
 (with least impact on the rest of the code).

 Maybe those lower-level functions could also be made
 multi-datasource, but I have to think a little bit more about that.

I've thought a little bit more about this, going through the code, and
yeah, it might be a good idea to push the prefix/suffix scanning into
the lower-level parts of the diff-algorithm (the token parsing /
building the token tree).

Something like:
- Make token.c#svn_diff__get_tokens take multiple datasources.
- In diff.c, diff3.c and diff4.c replace the multiple calls to this
function to one call passing multiple datasources.
- token.c#svn_diff__get_tokens (with multiple datasources) will take
care of identical prefix and suffix scanning based on tokens (so can
take advantage of the standard token-parsing rules with ignore-*
options, by simply calling svn_diff_fns_t.datasource_get_next_token).

Some things needed to support this:
- svn_diff_fns_t.datasource_get_next_token: calculation of the hash
should be made conditional (I don't want to waste time for the adler32
hash for lines that are not going to be put in the token tree).
- For suffix scanning, I'll need another variation of
datasource_get_next_token to get tokens from the end going backwards
(say datasource_get_previous_token). No hash needed.

Does that make sense? Or am I missing it completely?

It would mean I'd have to throw away the current patch and rewrite it
in the token layer (but I don't really mind, as long as a good
solution takes its place :-)). And your file_info refactoring would
also not be used (although it's certainly a worthwhile refactoring by
itself, making the current code clearer).

Cheers,
-- 
Johan


Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-09 Thread Johan Corveleyn
On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad julian.f...@wandisco.com wrote:
 On Sat, 2010-10-09, Johan Corveleyn wrote:
 Ok, third iteration of the patch in attachment. It passes make check.

 As discussed in [1], this version keeps 50 lines of the identical
 suffix around, to give the algorithm a good chance to generate a diff
 output of good quality (in all but the most extreme cases, this will
 be the same as with the original svn_diff algorithm).

 That's about the only difference with the previous iteration. So for
 now, I'm submitting this for review. Any feedback is very welcome :-).

 Hi Johan.

Hi Julian,

Thanks for taking a look.

 I haven't reviewed it, but after seeing today's discussion I had just
 scrolled quickly through the previous version of this patch.  I noticed
 that the two main functions - find_identical_suffix and
 find_identical_suffix - are both quite similar (but not quite similar
 enough to make them easily share implementation) and both quite long,
 and I noticed you wrote in an earlier email that you were finding it
 hard to make the code readable.  I have a suggestion that may help.

 I think the existing structure of the svn_diff__file_baton_t is
 unhelpful:
 {
  const svn_diff_file_options_t *options;
  const char *path[4];

  apr_file_t *file[4];
  apr_off_t size[4];

  int chunk[4];
  char *buffer[4];
  char *curp[4];
  char *endp[4];

  /* List of free tokens that may be reused. */
  svn_diff__file_token_t *tokens;

  svn_diff__normalize_state_t normalize_state[4];

  apr_pool_t *pool;
 } svn_diff__file_baton_t;

 All those array[4] fields are logically related, but this layout forces
 the programmer to address them individually.

 So I wrote a patch - attached - that refactors this into an array of 4
 sub-structures, and simplifies all the code that uses them.

 I think this will help you to get better code clarity because then your
 increment_pointer_or_chunk() for example will be able to take a single
 pointer to a file_info structure instead of a lot of pointers to
 individual members of the same.

 Would you take a look and let me know if you agree.  If so, I can commit
 the refactoring straight away.

Yes, great idea! That would indeed vastly simplify a lot of the code.
So please go ahead and commit the refactoring.

Also, maybe the last_chunk number could be included in the file_info
struct? Now it's calculated in several places: last_chunk0 =
offset_to_chunk(file_baton-size[idx0]), or I have to pass it on
every time as an extra argument. Seems like the sort of info that
could be part of file_info.

One more thing: you might have noticed that for find_identical_suffix
I use other buffers, chunks, curp's, endp's, ... than for the prefix.
For prefix scanning I can just use the stuff from the diff_baton,
because after prefix scanning has finished, everything is buffered and
pointing correctly for the normal algorithm to continue (i.e.
everything points at the first byte of the first non-identical line).
For suffix scanning I need to use other structures (newly alloc'd
buffer etc), so as to not mess with those pointers/buffers from the
diff_baton.

So: I think I'll need the file_info struct to be available out of the
diff_baton_t struct as well, so I can use this in suffix scanning
also.

(side-note: I considered first doing suffix scanning, then prefix
scanning, so I could reuse the buffers/pointers from diff_baton all
the time, and still have everything pointing correctly after
eliminating prefix/suffix. But that could give vastly different
results in some cases, for instance when original file is entirely
identical to both the prefix and the suffix of the modified file. So I
decided it's best to stick with first prefix, then suffix).

 Responding to some of the other points you mentioned in a much earlier
 mail:

 3) It's only implemented for 2 files. I'd like to generalize this for
 an array of datasources, so it can also be used for diff3 and diff4.

 4) As a small hack, I had to add a flag datasource_opened to
 token.c#svn_diff__get_tokens, so I could have different behavior for
 regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
 no longer needed.

 Yes, I'd like to see 3), and so hack 4) will go away.

I'm wondering though how I should represent the datasources to pass
into datasources_open. An array combined with a length parameter?
Something like:

static svn_error_t *
datasources_open(void *baton, apr_off_t *prefix_lines,
 svn_diff_datasource_e[] datasources, int datasources_len)

? And then use for loops everywhere I now do things twice for the two
datasources?

 5) I've struggled with making the code readable/understandable. It's
 likely that there is still a lot of room for improvement. I also
 probably need to document it some more.

 You need to write a full doc string for datasources_open(), at least.
 It needs especially to say how it relates to datasource_open() - why
 should the caller call this 

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-09 Thread Daniel Shahaf
Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200:
 (side-note: I considered first doing suffix scanning, then prefix
 scanning, so I could reuse the buffers/pointers from diff_baton all
 the time, and still have everything pointing correctly after
 eliminating prefix/suffix. But that could give vastly different
 results in some cases, for instance when original file is entirely
 identical to both the prefix and the suffix of the modified file. So I
 decided it's best to stick with first prefix, then suffix).

What Hyrum said.  How common /is/ this case?  And, anyway, in that case
both everything was appended and everything was prepended are
equally legitimate diffs.


Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-09 Thread Johan Corveleyn
On Sat, Oct 9, 2010 at 5:19 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200:
 (side-note: I considered first doing suffix scanning, then prefix
 scanning, so I could reuse the buffers/pointers from diff_baton all
 the time, and still have everything pointing correctly after
 eliminating prefix/suffix. But that could give vastly different
 results in some cases, for instance when original file is entirely
 identical to both the prefix and the suffix of the modified file. So I
 decided it's best to stick with first prefix, then suffix).

 What Hyrum said.  How common /is/ this case?  And, anyway, in that case
 both everything was appended and everything was prepended are
 equally legitimate diffs.

Hm, I'm not sure about this one. I just wanted to try the maximum
reasonably possible to keep the results identical to what they were.
Using another buffer for suffix scanning didn't seem that big of a
deal (only slight increase in memory use (2 chunks of 128K in current
implementation)). I made that decision pretty early, before I knew of
the other problem of suffix scanning, and the keep-50-suffix-lines
compromise we decided upon.

There may be more subtle cases than the one I described, I don't know.
OTOH, now that we have the keep-50-suffix-lines, that may help also in
this case. I'll have to think about that. Maybe I can give it a go,
first suffix then prefix, and see if I can find real-life problems ...

-- 
Johan


Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-08 Thread Johan Corveleyn
Ok, third iteration of the patch in attachment. It passes make check.

As discussed in [1], this version keeps 50 lines of the identical
suffix around, to give the algorithm a good chance to generate a diff
output of good quality (in all but the most extreme cases, this will
be the same as with the original svn_diff algorithm).

That's about the only difference with the previous iteration. So for
now, I'm submitting this for review. Any feedback is very welcome :-).

I still consider this a WIP, because of the following remaining todo's
(which may have a lot of impact on the current implementation):
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
switch the implementation to read out entire lines before comparing
(like datasources_get_next_token does).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix. From the identical suffix, 50 lines are kept to
   help the diff algorithm find the nicest possible diff representation
   in case of ambiguity.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument datasource_opened. Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument prefix_lines. Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of common diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
  (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]


Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2010-10/0141.shtml
Index: subversion/include/svn_diff.h
===
--- subversion/include/svn_diff.h   (revision 1006020)
+++ subversion/include/svn_diff.h   (working copy)
@@ -112,6 +112,11 @@ typedef struct svn_diff_fns_t
   svn_error_t *(*datasource_open)(void *diff_baton,
   svn_diff_datasource_e datasource);
 
+  /** Open the datasources of type @a datasources. */
+  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
+   svn_diff_datasource_e datasource0,
+   svn_diff_datasource_e datasource1);
+
   /** Close the datasource of type @a datasource. */
   svn_error_t *(*datasource_close)(void *diff_baton,
svn_diff_datasource_e datasource);
@@ -124,6 +129,9 @@ typedef struct svn_diff_fns_t
 void *diff_baton,
 svn_diff_datasource_e datasource);
 
+  /** Get the number of identical prefix lines from the @a diff_baton. */
+  apr_off_t (*get_prefix_lines)(void *diff_baton);
+
   /** A function for ordering the tokens, resembling 'strcmp' in functionality.
* @a compare should contain the return value of the comparison:
* If @a ltoken and @a rtoken are equal, return 0.  If @a ltoken is
Index: subversion/libsvn_diff/diff_file.c
===
--- subversion/libsvn_diff/diff_file.c  (revision 1006020)
+++ 

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-08 Thread Julian Foad
On Sat, 2010-10-09, Johan Corveleyn wrote:
 Ok, third iteration of the patch in attachment. It passes make check.
 
 As discussed in [1], this version keeps 50 lines of the identical
 suffix around, to give the algorithm a good chance to generate a diff
 output of good quality (in all but the most extreme cases, this will
 be the same as with the original svn_diff algorithm).
 
 That's about the only difference with the previous iteration. So for
 now, I'm submitting this for review. Any feedback is very welcome :-).

Hi Johan.

I haven't reviewed it, but after seeing today's discussion I had just
scrolled quickly through the previous version of this patch.  I noticed
that the two main functions - find_identical_suffix and
find_identical_suffix - are both quite similar (but not quite similar
enough to make them easily share implementation) and both quite long,
and I noticed you wrote in an earlier email that you were finding it
hard to make the code readable.  I have a suggestion that may help.

I think the existing structure of the svn_diff__file_baton_t is
unhelpful:
{
  const svn_diff_file_options_t *options;
  const char *path[4];

  apr_file_t *file[4];
  apr_off_t size[4];

  int chunk[4];
  char *buffer[4];
  char *curp[4];
  char *endp[4];

  /* List of free tokens that may be reused. */
  svn_diff__file_token_t *tokens;

  svn_diff__normalize_state_t normalize_state[4];

  apr_pool_t *pool;
} svn_diff__file_baton_t;

All those array[4] fields are logically related, but this layout forces
the programmer to address them individually.

So I wrote a patch - attached - that refactors this into an array of 4
sub-structures, and simplifies all the code that uses them.

I think this will help you to get better code clarity because then your
increment_pointer_or_chunk() for example will be able to take a single
pointer to a file_info structure instead of a lot of pointers to
individual members of the same.

Would you take a look and let me know if you agree.  If so, I can commit
the refactoring straight away.


Responding to some of the other points you mentioned in a much earlier
mail:

 3) It's only implemented for 2 files. I'd like to generalize this for
 an array of datasources, so it can also be used for diff3 and diff4.
 
 4) As a small hack, I had to add a flag datasource_opened to
 token.c#svn_diff__get_tokens, so I could have different behavior for
 regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
 no longer needed.

Yes, I'd like to see 3), and so hack 4) will go away.

 5) I've struggled with making the code readable/understandable. It's
 likely that there is still a lot of room for improvement. I also
 probably need to document it some more.

You need to write a full doc string for datasources_open(), at least.
It needs especially to say how it relates to datasource_open() - why
should the caller call this function rather than that function, and are
they both necessary - and how the caller is supposed to use the
'prefix_lines' parameter.  And doc strings for
increment/decrement_...().

But this makes me think, it looks to me like this whole
prefix-suffix-skipping functionality would fit better inside the
lower-level diff algorithm rather than inside the datasources_open
function.  Do you agree?

It works as it is, but you were talking about wanting it to obey the
standard token-parsing rules such as ignore white space, and so on.
It seems that things like this would be much better one level down.

 6) I've not paid too much attention to low level optimizations, so
 here also there's probably room for improvement, which may be
 significant for the critical loops.

Maybe. Not important at this stage.

 7) There are two warnings left conversion from 'apr_off_t' to 'int',
 in diff_file.c#find_identical_suffix. To completely eliminate this, I
 should probably make all chunks of type apr_off_t instead of int
 (but I have not done that yet, because the original implementation
 also used int for the chunk in the svn_diff__file_baton_t struct).
 Should I do this (also in the baton struct)? Or should I use an
 explicit cast somewhere?

I recommend using an explicit cast where needed, in this patch.
Changing the 'chunk' data type everywhere could be done in a separate
patch, either before or after this patch.

- Julian


 I still consider this a WIP, because of the following remaining todo's
 (which may have a lot of impact on the current implementation):
 - Generalize for more than 2 datasources (for diff3 and diff4).
 - revv svn_diff_fns_t and maybe other stuff I've changed in public API.
 - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
 switch the implementation to read out entire lines before comparing
 (like datasources_get_next_token does).
 
 Log message:
 [[[
 Make svn_diff_diff skip identical prefix and suffix to make diff and blame
 faster.
 
 * subversion/include/svn_diff.h
   (svn_diff_fns_t): Added new function types datasources_open and

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-03 Thread Johan Corveleyn
On Sun, Oct 3, 2010 at 1:46 AM, Johan Corveleyn jcor...@gmail.com wrote:
 Hi,

 Here is a second iteration of the patch. It now passes make check.

 Differences from the previous version are:
 - Support for \r eol-style (\n and \r\n was already ok).
 - The number of prefix_lines is now passed to svn_diff__lcs, so it can
 use that value to set the position offset of the EOF marker
 correctly, in case one of both files has become empty after skipping
 the prefix. This fixes the crashes in blame_tests.py 2 and 7.

 The patch is pretty big, so please let me know if I should split it up
 to make it more reviewable (I could easily split it up between the
 prefix-finding and the suffix-finding, at the cost of having overview
 over the entire algorithm).

 Still to do:
 - Think about why results are sometimes different (because of
 eliminated suffix, the LCS can sometimes be slightly different), and
 what can be done about it.

Hm, this problem has made me reconsider whether my patch is a good
solution (more details below). I'm thinking of other ways of
implementing this kind of optimization, so for now I'm pulling this
patch. Please don't review it yet, as I might go for a radically
different approach (sorry for any time spent that may be lost).

The details:
The problem happens if there are two (or more) non-matching parts with
an identical part in between (so the prefix scanning stops early,
doesn't meet the suffix in one of the files), and the suffix scanning
goes too far because the end of the last change is identical to the
original at that point.

For example (best viewed with fixed-width font):
[[[
original | modified
---
This is line 1   | This is line 1
This is line 2   | This line is *changed*
This is line 3   | This is line 3
... (more identical lines)   | ...
existing_function()  | existing_function()
{| {
  // do stuff|   // do stuff
  return SVN_NO_ERROR;   |   return SVN_NO_ERROR;
}| }
 |
This is the end of the file  | new_function()
-| {
 |   // do other stuff
 |   return SVN_NO_ERROR;
 | }
 |
 | This is the end of the file
 |-
]]]

The following identical suffix will be eliminated:
[[[
  return SVN_NO_ERROR;
}

This is the end of the file
]]]

which means the LCS algorithm cannot do better than to say that the
following lines are new:
[[[
+  return SVN_NO_ERROR;
+}
+
+new_function()
+{
+  // do other stuff
]]]

Not quite what we want/expect. If the suffix is not eliminated, the
LCS algorithm gives the correct/expected result. Also if the change in
the beginning of the file isn't there, the result is correct (because
the prefix scanning goes first, and eliminates the identical stuff
from the start (I'll leave it as an exercise to the reader to see that
the result is better)).

Interestingly, GNU diff also has this problem. It mitigates it a bit
by keeping a number of identical prefix/suffix lines (at least the
number of context lines, or a higher number if supplied by
--horizon-lines=NUM). This cannot completely solve the problem though:
for any given number of horizon-lines one can always come up with an
example that doesn't give the correct result.

So I need to find a way to not eliminate the identical suffix, but
to mark those lines as identical and then include them in the
position_list that's going to be used by the LCS algorithm. I'd like
to do the same for the prefix. This means I need to approach this
problem on a different level.

To be continued ...

Cheers,
-- 
Johan


Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-10-02 Thread Johan Corveleyn
Hi,

Here is a second iteration of the patch. It now passes make check.

Differences from the previous version are:
- Support for \r eol-style (\n and \r\n was already ok).
- The number of prefix_lines is now passed to svn_diff__lcs, so it can
use that value to set the position offset of the EOF marker
correctly, in case one of both files has become empty after skipping
the prefix. This fixes the crashes in blame_tests.py 2 and 7.

The patch is pretty big, so please let me know if I should split it up
to make it more reviewable (I could easily split it up between the
prefix-finding and the suffix-finding, at the cost of having overview
over the entire algorithm).

Still to do:
- Think about why results are sometimes different (because of
eliminated suffix, the LCS can sometimes be slightly different), and
what can be done about it.
- Generalize for more than 2 datasources (for diff3 and diff4).
- revv svn_diff_fns_t and maybe other stuff I've changed in public API.
- Add support for -x-b, -x-w, and -x--ignore-eol-style options.

But I'd like to do those things in follow-up patches, after this one
has been reviewed and digested a little bit. So at this point: review,
feedback, ... very welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New functions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument datasource_opened. Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Added argument prefix_lines. Use this to correctly set
   the offset of the sentinel position for EOF, even if one of the files
   became empty after eliminating the identical prefix.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of common diff for identical prefix.
  (svn_diff_diff): Use new function datasources_open, to open original and
   modified at once, and find their identical prefix and suffix. Pass
   prefix_lines to svn_diff__lcs and to svn_diff__diff.

* subversion/libsvn_diff/diff3.c
  (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.

* subversion/libsvn_diff/diff4.c
 (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
   Pass prefix_lines = 0 to svn_diff__lcs.
]]]

Cheers,
-- 
Johan
Index: subversion/include/svn_diff.h
===
--- subversion/include/svn_diff.h   (revision 1003326)
+++ subversion/include/svn_diff.h   (working copy)
@@ -112,6 +112,11 @@ typedef struct svn_diff_fns_t
   svn_error_t *(*datasource_open)(void *diff_baton,
   svn_diff_datasource_e datasource);
 
+  /** Open the datasources of type @a datasources. */
+  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
+   svn_diff_datasource_e datasource0,
+   svn_diff_datasource_e datasource1);
+
   /** Close the datasource of type @a datasource. */
   svn_error_t *(*datasource_close)(void *diff_baton,
svn_diff_datasource_e datasource);
@@ -124,6 +129,9 @@ typedef struct svn_diff_fns_t
 void *diff_baton,
 svn_diff_datasource_e datasource);
 
+  /** Get the number of identical prefix lines from the @a diff_baton. */
+  apr_off_t (*get_prefix_lines)(void *diff_baton);
+
   /** A function for ordering the tokens, resembling 'strcmp' in functionality.
* @a compare should contain the return value of the comparison:
* If @a ltoken and @a rtoken are equal, return 0.  If @a ltoken is
Index: subversion/libsvn_diff/diff_file.c

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-09-28 Thread Daniel Shahaf
 Index: subversion/include/svn_diff.h
 ===
 --- subversion/include/svn_diff.h (revision 1001548)
 +++ subversion/include/svn_diff.h (working copy)
 @@ -112,6 +112,11 @@
(personally I prefer 'svn diff -x-p' to show the function name here)
svn_error_t *(*datasource_open)(void *diff_baton,
svn_diff_datasource_e datasource);
  
 +  /** Open the datasources of type @a datasources. */
 +  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines,
 +   svn_diff_datasource_e datasource0,
 +   svn_diff_datasource_e datasource1);
 +

So, you're extending the svn_diff_fns_t struct, which is defined in
a public header.

It's a public struct with no constructor function, so I believe you have
to revv it (into svn_diff_fns2_t) in order to extend it (for binary
compatibility: people allocating this struct and then using a newer
Subversion library at runtime).

If it did have a constructor function, you'd have to extend it only at
the end, and even then make sure that if the added elements are NULL (eg
because an old caller didn't know to fill them) then everything still
worked.

Probably more important to get the logic right than to revv it right
away, though; the latter is a SMOP.

/** Close the datasource of type @a datasource. */
svn_error_t *(*datasource_close)(void *diff_baton,
 svn_diff_datasource_e datasource);
 @@ -124,6 +129,9 @@
  void *diff_baton,
  svn_diff_datasource_e 
 datasource);
  
 +  /** Get the number of identical prefix lines from the @a diff_baton. */
 +  apr_off_t (*get_prefix_lines)(void *diff_baton);
 +
/** A function for ordering the tokens, resembling 'strcmp' in 
 functionality.
 * @a compare should contain the return value of the comparison:
 * If @a ltoken and @a rtoken are equal, return 0.  If @a ltoken is
 Index: subversion/libsvn_diff/diff_memory.c
 ===
 --- subversion/libsvn_diff/diff_memory.c  (revision 1001548)
 +++ subversion/libsvn_diff/diff_memory.c  (working copy)
 @@ -95,7 +95,23 @@
return SVN_NO_ERROR;
  }
  
 +/* Implements svn_diff_fns_t::datasources_open */
 +static svn_error_t *
 +datasources_open(void *baton, apr_off_t *prefix_lines,
 + svn_diff_datasource_e datasource0, 
 + svn_diff_datasource_e datasource1)
 +{
 +  /* Do nothing: everything is already there and initialized to 0 */
 +  return SVN_NO_ERROR;
 +}
  
 +/* Implements svn_diff_fns_t::datasource_get_prefix_lines */
 +static apr_off_t
 +get_prefix_lines(void *baton)
 +{
 +  return 0;
 +}
 +
  /* Implements svn_diff_fns_t::datasource_close */
  static svn_error_t *
  datasource_close(void *baton, svn_diff_datasource_e datasource)
 @@ -189,8 +205,10 @@
  static const svn_diff_fns_t svn_diff__mem_vtable =
  {
datasource_open,
 +  datasources_open,
datasource_close,
datasource_get_next_token,
 +  get_prefix_lines,
token_compare,
token_discard,
token_discard_all
 Index: subversion/libsvn_diff/diff_file.c
 ===
 --- subversion/libsvn_diff/diff_file.c(revision 1001548)
 +++ subversion/libsvn_diff/diff_file.c(working copy)
 @@ -77,6 +77,10 @@
char *curp[4];
char *endp[4];
  
 +  apr_off_t prefix_lines;
 +  int suffix_start_chunk[4];
 +  apr_off_t suffix_offset_in_chunk[4];
 +
/* List of free tokens that may be reused. */
svn_diff__file_token_t *tokens;
  
 @@ -233,7 +237,330 @@
  curp, length, 0, file_baton-pool);
  }
  
 +static svn_error_t *
 +increment_pointer_or_chunk(svn_diff__file_baton_t *file_baton,
 +   char **curp, char **endp, int *chunk_number,
 +   char *buffer, apr_off_t last_chunk_number, int 
 idx)
 +{
 +  apr_off_t length;
  
 +  if ((*curp) == (*endp) - 1)
 +{
 +  if (*chunk_number == last_chunk_number)
 +(*curp)++; /* *curp == *endp with last chunk signals end of file */
 +  else
 +{
 +  (*chunk_number)++;
 +  length = *chunk_number == last_chunk_number ?
 +offset_in_chunk(file_baton-size[idx]) : CHUNK_SIZE;
 +  SVN_ERR(read_chunk(file_baton-file[idx],
 + file_baton-path[idx],
 + buffer, length,
 + chunk_to_offset(*chunk_number),
 + file_baton-pool));
 +  *endp = buffer + length;
 +  *curp = buffer;
 +}
 +}
 +  else
 +{
 +  (*curp)++;
 +}
 +
 +  return SVN_NO_ERROR;
 +}
 +
 +static svn_error_t *
 +decrement_pointer_or_chunk(svn_diff__file_baton_t *file_baton,
 +  

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-09-28 Thread Johan Corveleyn
Hi Daniel,

Thanks for the feedback.

On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf d...@daniel.shahaf.name wrote:
 Index: subversion/include/svn_diff.h
 ===
 --- subversion/include/svn_diff.h     (revision 1001548)
 +++ subversion/include/svn_diff.h     (working copy)
 @@ -112,6 +112,11 @@
 (personally I prefer 'svn diff -x-p' to show the function name here)

Ok, will do next time.

    svn_error_t *(*datasource_open)(void *diff_baton,
                                    svn_diff_datasource_e datasource);

 +  /** Open the datasources of type @a datasources. */
 +  svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t 
 *prefix_lines,
 +                                   svn_diff_datasource_e datasource0,
 +                                   svn_diff_datasource_e datasource1);
 +

 So, you're extending the svn_diff_fns_t struct, which is defined in
 a public header.

 It's a public struct with no constructor function, so I believe you have
 to revv it (into svn_diff_fns2_t) in order to extend it (for binary
 compatibility: people allocating this struct and then using a newer
 Subversion library at runtime).

 If it did have a constructor function, you'd have to extend it only at
 the end, and even then make sure that if the added elements are NULL (eg
 because an old caller didn't know to fill them) then everything still
 worked.

 Probably more important to get the logic right than to revv it right
 away, though; the latter is a SMOP.

Doh! I'm sure that observation was in the back of my head somewhere,
but I forgot about it while working on the solution. Anyway, you're
right: there is first some more work to get the algorithm right.

I've had some progress:
- The blame_tests.py now all pass (tests 2 and 7 provoked a core
dump). That makes all tests pass. However, although I fixed the
coredump, I don't fully understand the root cause (why they ended up
in the situation where they ended up). So I'm going to study that
first a bit more.
- I have now included support for files with \r eol-style.

I'll send a new version of the patch shortly.

I'm also thinking that I could try to take advantage of -x-b/-x-w or
-x--ignore-eol-style options to make it even faster (right now the
prefix/suffix scanning will stop at the first difference, regardless
if it's a whitespace or eol difference that could/should be ignored).

However, I'm not sure if I should implement this myself, as part of
the find_identical_prefix and find_identical_suffix functions, or
switch to the usage of datasource_get_next_token (which is the
function that is used by the real diff algorithm to extract the
lines, and which uses util.c#svn_diff__normalize_buffer to normalize
whitespace and eol's if needed).

Right now, I don't read entire lines (tokens) but compare each byte as
I go. But I could do it line-based as well (extract line from file1,
extract line from file2, memcmp lines). I would have to make the
calculation of the adler32 hash in datasource_get_next_token
conditional on some boolean, or factor out the part of the function
that's useful to me into a new static function.

There is one caveat to this approach: I'm not sure if it would work
backwards (for suffix scanning). Well, the normalization function
wouldn't have to be changed, but the extraction of lines would have to
go backward. Surely it's possible, but I have no idea how much I'd
have to change the code in get_next_token to get lines backwards...

I'm also not sure if one would be (significantly) faster than the
other: comparing byte-by-byte while going through both files, or
extracting entire lines and then comparing the lines.

Thoughts?

Cheers,
-- 
Johan


Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-09-28 Thread Daniel Shahaf
Johan Corveleyn wrote on Tue, Sep 28, 2010 at 23:37:23 +0200:
 On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf d...@daniel.shahaf.name 
 wrote:
  Index: subversion/include/svn_diff.h
  ===
  --- subversion/include/svn_diff.h     (revision 1001548)
  +++ subversion/include/svn_diff.h     (working copy)
  @@ -112,6 +112,11 @@
  (personally I prefer 'svn diff -x-p' to show the function name here)
 
 Ok, will do next time.
 

Thanks.

 I've had some progress:
 - The blame_tests.py now all pass (tests 2 and 7 provoked a core
 dump). That makes all tests pass. However, although I fixed the
 coredump, I don't fully understand the root cause (why they ended up
 in the situation where they ended up). So I'm going to study that
 first a bit more.
 - I have now included support for files with \r eol-style.
 
 I'll send a new version of the patch shortly.
 

Sounds good.

 I'm also thinking that I could try to take advantage of -x-b/-x-w or
 -x--ignore-eol-style options to make it even faster (right now the
 prefix/suffix scanning will stop at the first difference, regardless
 if it's a whitespace or eol difference that could/should be ignored).
 
...
 
 Thoughts?

None, actually; I'm not (yet) sufficiently familiar with diff internals. 
But let's please have this work in small, digestible (read: reviewable) 
pieces --- support for -x--foo is exactly the sort of additional 
optimization that can be done in a separate patch (on top of this 
starting patch).


[WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

2010-09-26 Thread Johan Corveleyn
Hi devs,

As discussed in [1], here is a patch that makes svn_diff_diff
(libsvn_diff/diff.c) skip the identical prefix and suffix of the
original and modified files, before starting the LCS (longest common
subsequence) algorithm on the non-matching part.

This makes diff a lot faster (especially for large files), thereby
also speeding up blame in environments where the client is the
bottleneck (with a fast enough server and network, blame is
constrained by the speed of diff on the client side).

This patch still has some rough edges (see below), hence the WIP
marker. Nevertheless, it works most of the time, and it can
demonstrate the speed improvement that can be expected. I will
continue working on the rough edges, but in the meantime any
feedback/review/thoughts are very much appreciated.

The rough edges:
1) While scanning through the identical prefix, I have to count the
number of lines (to have correct line offsets for the rest of the
algorithm). To do that, I currently count the number of \n's, so it
won't work for files with \r eol-style. Not so difficult to fix
(similar to how diff_file.c#datasource_get_next_token does it), but I
haven't done it yet.

2) Two tests still fail:
- blame_tests.py 2: annotate a binary file
- blame_tests.py 7: blame with different eol styles
Maybe this is because of 1), I'll have to investigate.

3) It's only implemented for 2 files. I'd like to generalize this for
an array of datasources, so it can also be used for diff3 and diff4.

4) As a small hack, I had to add a flag datasource_opened to
token.c#svn_diff__get_tokens, so I could have different behavior for
regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
no longer needed.

5) I've struggled with making the code readable/understandable. It's
likely that there is still a lot of room for improvement. I also
probably need to document it some more.

6) I've not paid too much attention to low level optimizations, so
here also there's probably room for improvement, which may be
significant for the critical loops.

7) There are two warnings left conversion from 'apr_off_t' to 'int',
in diff_file.c#find_identical_suffix. To completely eliminate this, I
should probably make all chunks of type apr_off_t instead of int
(but I have not done that yet, because the original implementation
also used int for the chunk in the svn_diff__file_baton_t struct).
Should I do this (also in the baton struct)? Or should I use an
explicit cast somewhere?

8) A bigger problem: the output of diff/blame is sometimes different
from the original implementation. It's always syntactically correct,
but sometimes the less unique lines are taken differently (like when
a new function is added, and diff thinks the new block starts from the
closing brace of the previous function, ... stuff like that). This
happens only because of the identical-suffix-scanning (it doesn't
happen if I only enable the identical-prefix-scanning).

I think I know the cause of this change in behavior: I completely
eliminate the suffix from the LCS-building algorithm. And that
probably causes it to sometimes come up with another longest common
subsequence. Right now I don't know how to solve this completely. A
workaround might be to add a certain number of suffix-lines to the
non-matching-block, so they can be part of the LCS-searching. But
probably one can always come up with examples where the number of
suffix-lines is not enough. I'll have to look into this some more.
Ideas welcome :-).

Log message:
[[[
Make svn_diff_diff skip identical prefix and suffix to make diff and blame
faster.

* subversion/include/svn_diff.h
  (svn_diff_fns_t): Added new function types datasources_open and
   get_prefix_lines to the vtable.

* subversion/libsvn_diff/diff_memory.c
  (datasources_open): New function (does nothing).
  (get_prefix_lines): New function (does nothing).
  (svn_diff__mem_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
   and suffix_offset_in_chunk.
  (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
  (find_identical_prefix, find_identical_suffix): New fuctions.
  (datasources_open): New function, to open both datasources and find their
   identical prefix and suffix.
  (get_prefix_lines): New function.
  (datasource_get_next_token): Stop at start of identical suffix.
  (svn_diff__file_vtable): Added new functions datasources_open and
   get_prefix_lines.

* subversion/libsvn_diff/diff.h
  (svn_diff__get_tokens): Added argument datasource_opened, to indicate that
   the datasource was already opened.

* subversion/libsvn_diff/token.c
  (svn_diff__get_tokens): Added argument datasource_opened. Only open the
   datasource if datasource_opened is FALSE. Set the starting offset of the
   position list to the number of prefix lines.

* subversion/libsvn_diff/diff.c
  (svn_diff__diff): Add a chunk of