Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Julian Foad Fri, 08 Oct 2010 17:57:57 -0700

On Sat, 2010-10-09, Johan Corveleyn wrote:
> Ok, third iteration of the patch in attachment. It passes make check.
> 
> As discussed in [1], this version keeps 50 lines of the identical
> suffix around, to give the algorithm a good chance to generate a diff
> output of good quality (in all but the most extreme cases, this will
> be the same as with the original svn_diff algorithm).
> 
> That's about the only difference with the previous iteration. So for
> now, I'm submitting this for review. Any feedback is very welcome :-).


Hi Johan.

I haven't reviewed it, but after seeing today's discussion I had just
scrolled quickly through the previous version of this patch.  I noticed
that the two main functions - find_identical_suffix and
find_identical_suffix - are both quite similar (but not quite similar
enough to make them easily share implementation) and both quite long,
and I noticed you wrote in an earlier email that you were finding it
hard to make the code readable.  I have a suggestion that may help.

I think the existing structure of the svn_diff__file_baton_t is
unhelpful:
{
  const svn_diff_file_options_t *options;
  const char *path[4];

  apr_file_t *file[4];
  apr_off_t size[4];

  int chunk[4];
  char *buffer[4];
  char *curp[4];
  char *endp[4];

  /* List of free tokens that may be reused. */
  svn_diff__file_token_t *tokens;

  svn_diff__normalize_state_t normalize_state[4];

  apr_pool_t *pool;
} svn_diff__file_baton_t;

All those array[4] fields are logically related, but this layout forces
the programmer to address them individually.

So I wrote a patch - attached - that refactors this into an array of 4
sub-structures, and simplifies all the code that uses them.

I think this will help you to get better code clarity because then your
increment_pointer_or_chunk() for example will be able to take a single
pointer to a file_info structure instead of a lot of pointers to
individual members of the same.

Would you take a look and let me know if you agree.  If so, I can commit
the refactoring straight away.


Responding to some of the other points you mentioned in a much earlier
mail:

> 3) It's only implemented for 2 files. I'd like to generalize this for
> an array of datasources, so it can also be used for diff3 and diff4.
> 
> 4) As a small hack, I had to add a flag "datasource_opened" to
> token.c#svn_diff__get_tokens, so I could have different behavior for
> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is
> no longer needed.

Yes, I'd like to see 3), and so hack 4) will go away.

> 5) I've struggled with making the code readable/understandable. It's
> likely that there is still a lot of room for improvement. I also
> probably need to document it some more.

You need to write a full doc string for datasources_open(), at least.
It needs especially to say how it relates to datasource_open() - why
should the caller call this function rather than that function, and are
they both necessary - and how the caller is supposed to use the
'prefix_lines' parameter.  And doc strings for
increment/decrement_...().

But this makes me think, it looks to me like this whole
prefix-suffix-skipping functionality would fit better inside the
lower-level diff algorithm rather than inside the "datasources_open"
function.  Do you agree?

It works as it is, but you were talking about wanting it to obey the
standard token-parsing rules such as "ignore white space", and so on.
It seems that things like this would be much better one level down.

> 6) I've not paid too much attention to low level optimizations, so
> here also there's probably room for improvement, which may be
> significant for the critical loops.

Maybe. Not important at this stage.

> 7) There are two warnings left "conversion from 'apr_off_t' to 'int'",
> in diff_file.c#find_identical_suffix. To completely eliminate this, I
> should probably make all "chunks" of type apr_off_t instead of int
> (but I have not done that yet, because the original implementation
> also used int for the chunk in the svn_diff__file_baton_t struct).
> Should I do this (also in the baton struct)? Or should I use an
> explicit cast somewhere?

I recommend using an explicit cast where needed, in this patch.
Changing the 'chunk' data type everywhere could be done in a separate
patch, either before or after this patch.

- Julian


> I still consider this a WIP, because of the following remaining todo's
> (which may have a lot of impact on the current implementation):
> - Generalize for more than 2 datasources (for diff3 and diff4).
> - revv svn_diff_fns_t and maybe other stuff I've changed in public API.
> - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe
> switch the implementation to read out entire lines before comparing
> (like datasources_get_next_token does).
> 
> Log message:
> [[[
> Make svn_diff_diff skip identical prefix and suffix to make diff and blame
> faster.
> 
> * subversion/include/svn_diff.h
>   (svn_diff_fns_t): Added new function types datasources_open and
>    get_prefix_lines to the vtable.
> 
> * subversion/libsvn_diff/diff_memory.c
>   (datasources_open): New function (does nothing).
>   (get_prefix_lines): New function (does nothing).
>   (svn_diff__mem_vtable): Added new functions datasources_open and
>    get_prefix_lines.
> 
> * subversion/libsvn_diff/diff_file.c
>   (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4]
>    and suffix_offset_in_chunk.
>   (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions.
>   (find_identical_prefix, find_identical_suffix): New functions.
>   (datasources_open): New function, to open both datasources and find their
>    identical prefix and suffix. From the identical suffix, 50 lines are kept 
> to
>    help the diff algorithm find the nicest possible diff representation
>    in case of ambiguity.
>   (get_prefix_lines): New function.
>   (datasource_get_next_token): Stop at start of identical suffix.
>   (svn_diff__file_vtable): Added new functions datasources_open and
>    get_prefix_lines.
> 
> * subversion/libsvn_diff/diff.h
>   (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that
>    the datasource was already opened.
> 
> * subversion/libsvn_diff/token.c
>   (svn_diff__get_tokens): Added argument "datasource_opened". Only open the
>    datasource if datasource_opened is FALSE. Set the starting offset of the
>    position list to the number of prefix lines.
> 
> * subversion/libsvn_diff/lcs.c
>   (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set
>    the offset of the sentinel position for EOF, even if one of the files
>    became empty after eliminating the identical prefix.
> 
> * subversion/libsvn_diff/diff.c
>   (svn_diff__diff): Add a chunk of "common" diff for identical prefix.
>   (svn_diff_diff): Use new function datasources_open, to open original and
>    modified at once, and find their identical prefix and suffix. Pass
>    prefix_lines to svn_diff__lcs and to svn_diff__diff.
> 
> * subversion/libsvn_diff/diff3.c
>   (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens.
>    Pass prefix_lines = 0 to svn_diff__lcs.
> 
> * subversion/libsvn_diff/diff4.c
>   (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens.
>    Pass prefix_lines = 0 to svn_diff__lcs.
> ]]]
> 
> 
> Cheers,

Re-factor a group of array variables into an array of structures, and
thereby simplify their usage.

* subversion/libsvn_diff/diff_file.c
  (svn_diff__file_baton_t): Move a set of related array variables into a new
    sub-structure array.
  (datasource_open, datasource_get_next_token, token_compare): By taking a
    pointer to the appropriate element of the sub-structure, simplify all
    references to these variables.
  (svn_diff_file_diff_2, svn_diff_file_diff3_2, svn_diff_file_diff4_2):
    Adjust references.
--This line, and those below, will be ignored--

Index: subversion/libsvn_diff/diff_file.c
===================================================================
--- subversion/libsvn_diff/diff_file.c	(revision 1006000)
+++ subversion/libsvn_diff/diff_file.c	(working copy)
@@ -67,21 +67,24 @@ typedef struct svn_diff__file_token_t
 typedef struct svn_diff__file_baton_t
 {
   const svn_diff_file_options_t *options;
-  const char *path[4];
 
-  apr_file_t *file[4];
-  apr_off_t size[4];
+  struct file_info {
+    const char *path;
 
-  int chunk[4];
-  char *buffer[4];
-  char *curp[4];
-  char *endp[4];
+    apr_file_t *file;
+    apr_off_t size;
+
+    int chunk;
+    char *buffer;
+    char *curp;
+    char *endp;
+
+    svn_diff__normalize_state_t normalize_state;
+  } files[4];
 
   /* List of free tokens that may be reused. */
   svn_diff__file_token_t *tokens;
 
-  svn_diff__normalize_state_t normalize_state[4];
-
   apr_pool_t *pool;
 } svn_diff__file_baton_t;
 
@@ -203,21 +206,19 @@ static svn_error_t *
 datasource_open(void *baton, svn_diff_datasource_e datasource)
 {
   svn_diff__file_baton_t *file_baton = baton;
-  int idx;
+  struct file_info *file = &file_baton->files[datasource_to_index(datasource)];
   apr_finfo_t finfo;
   apr_off_t length;
   char *curp;
   char *endp;
 
-  idx = datasource_to_index(datasource);
-
-  SVN_ERR(svn_io_file_open(&file_baton->file[idx], file_baton->path[idx],
+  SVN_ERR(svn_io_file_open(&file->file, file->path,
                            APR_READ, APR_OS_DEFAULT, file_baton->pool));
 
   SVN_ERR(svn_io_file_info_get(&finfo, APR_FINFO_SIZE,
-                               file_baton->file[idx], file_baton->pool));
+                               file->file, file_baton->pool));
 
-  file_baton->size[idx] = finfo.size;
+  file->size = finfo.size;
   length = finfo.size > CHUNK_SIZE ? CHUNK_SIZE : finfo.size;
 
   if (length == 0)
@@ -226,10 +227,10 @@ datasource_open(void *baton, svn_diff_da
   endp = curp = apr_palloc(file_baton->pool, (apr_size_t) length);
   endp += length;
 
-  file_baton->buffer[idx] = file_baton->curp[idx] = curp;
-  file_baton->endp[idx] = endp;
+  file->buffer = file->curp = curp;
+  file->endp = endp;
 
-  return read_chunk(file_baton->file[idx], file_baton->path[idx],
+  return read_chunk(file->file, file->path,
                     curp, length, 0, file_baton->pool);
 }
 
@@ -252,7 +253,7 @@ datasource_get_next_token(apr_uint32_t *
 {
   svn_diff__file_baton_t *file_baton = baton;
   svn_diff__file_token_t *file_token;
-  int idx;
+  struct file_info *file = &file_baton->files[datasource_to_index(datasource)];
   char *endp;
   char *curp;
   char *eol;
@@ -264,15 +265,13 @@ datasource_get_next_token(apr_uint32_t *
 
   *token = NULL;
 
-  idx = datasource_to_index(datasource);
+  curp = file->curp;
+  endp = file->endp;
 
-  curp = file_baton->curp[idx];
-  endp = file_baton->endp[idx];
-
-  last_chunk = offset_to_chunk(file_baton->size[idx]);
+  last_chunk = offset_to_chunk(file->size);
 
   if (curp == endp
-      && last_chunk == file_baton->chunk[idx])
+      && last_chunk == file->chunk)
     {
       return SVN_NO_ERROR;
     }
@@ -289,8 +288,8 @@ datasource_get_next_token(apr_uint32_t *
     }
 
   file_token->datasource = datasource;
-  file_token->offset = chunk_to_offset(file_baton->chunk[idx])
-                       + (curp - file_baton->buffer[idx]);
+  file_token->offset = chunk_to_offset(file->chunk)
+                       + (curp - file->buffer);
   file_token->raw_length = 0;
   file_token->length = 0;
 
@@ -311,7 +310,7 @@ datasource_get_next_token(apr_uint32_t *
             }
         }
 
-      if (file_baton->chunk[idx] == last_chunk)
+      if (file->chunk == last_chunk)
         {
           eol = endp;
           break;
@@ -320,21 +319,21 @@ datasource_get_next_token(apr_uint32_t *
       length = endp - curp;
       file_token->raw_length += length;
       svn_diff__normalize_buffer(&curp, &length,
-                                 &file_baton->normalize_state[idx],
+                                 &file->normalize_state,
                                  curp, file_baton->options);
       file_token->length += length;
       h = svn_diff__adler32(h, curp, length);
 
-      curp = endp = file_baton->buffer[idx];
-      file_baton->chunk[idx]++;
-      length = file_baton->chunk[idx] == last_chunk ?
-        offset_in_chunk(file_baton->size[idx]) : CHUNK_SIZE;
+      curp = endp = file->buffer;
+      file->chunk++;
+      length = file->chunk == last_chunk ?
+        offset_in_chunk(file->size) : CHUNK_SIZE;
       endp += length;
-      file_baton->endp[idx] = endp;
+      file->endp = endp;
 
-      SVN_ERR(read_chunk(file_baton->file[idx], file_baton->path[idx],
+      SVN_ERR(read_chunk(file->file, file->path,
                          curp, length,
-                         chunk_to_offset(file_baton->chunk[idx]),
+                         chunk_to_offset(file->chunk),
                          file_baton->pool));
 
       /* If the last chunk ended in a CR, we're done. */
@@ -349,7 +348,7 @@ datasource_get_next_token(apr_uint32_t *
 
   length = eol - curp;
   file_token->raw_length += length;
-  file_baton->curp[idx] = eol;
+  file->curp = eol;
 
   /* If the file length is exactly a multiple of CHUNK_SIZE, we will end up
    * with a spurious empty token.  Avoid returning it.
@@ -360,7 +359,7 @@ datasource_get_next_token(apr_uint32_t *
     {
       char *c = curp;
       svn_diff__normalize_buffer(&c, &length,
-                                 &file_baton->normalize_state[idx],
+                                 &file->normalize_state,
                                  curp, file_baton->options);
 
       file_token->norm_offset = file_token->offset;
@@ -388,13 +387,12 @@ token_compare(void *baton, void *token1,
   char buffer[2][COMPARE_CHUNK_SIZE];
   char *bufp[2];
   apr_off_t offset[2];
-  int idx[2];
+  struct file_info *file[2];
   apr_off_t length[2];
   apr_off_t total_length;
   /* How much is left to read of each token from the file. */
   apr_off_t raw_length[2];
   int i;
-  int chunk[2];
   svn_diff__normalize_state_t state[2];
 
   file_token[0] = token1;
@@ -420,17 +418,18 @@ token_compare(void *baton, void *token1,
 
   for (i = 0; i < 2; ++i)
     {
-      idx[i] = datasource_to_index(file_token[i]->datasource);
+      int idx = datasource_to_index(file_token[i]->datasource);
+
+      file[i] = &file_baton->files[idx];
       offset[i] = file_token[i]->norm_offset;
-      chunk[i] = file_baton->chunk[idx[i]];
       state[i] = svn_diff__normalize_state_normal;
 
-      if (offset_to_chunk(offset[i]) == chunk[i])
+      if (offset_to_chunk(offset[i]) == file[i]->chunk)
         {
           /* If the start of the token is in memory, the entire token is
            * in memory.
            */
-          bufp[i] = file_baton->buffer[idx[i]];
+          bufp[i] = file[i]->buffer;
           bufp[i] += offset_in_chunk(offset[i]);
 
           length[i] = total_length;
@@ -458,15 +457,15 @@ token_compare(void *baton, void *token1,
                                          NULL,
                                          _("The file '%s' changed unexpectedly"
                                            " during diff"),
-                                         file_baton->path[idx[i]]);
+                                         file[i]->path);
 
               /* Read a chunk from disk into a buffer */
               bufp[i] = buffer[i];
               length[i] = raw_length[i] > COMPARE_CHUNK_SIZE ?
                 COMPARE_CHUNK_SIZE : raw_length[i];
 
-              SVN_ERR(read_chunk(file_baton->file[idx[i]],
-                                 file_baton->path[idx[i]],
+              SVN_ERR(read_chunk(file[i]->file,
+                                 file[i]->path,
                                  bufp[i], length[i], offset[i],
                                  file_baton->pool));
               offset[i] += length[i];
@@ -623,8 +622,8 @@ svn_diff_file_diff_2(svn_diff_t **diff,
 
   memset(&baton, 0, sizeof(baton));
   baton.options = options;
-  baton.path[0] = original;
-  baton.path[1] = modified;
+  baton.files[0].path = original;
+  baton.files[1].path = modified;
   baton.pool = svn_pool_create(pool);
 
   SVN_ERR(svn_diff_diff(diff, &baton, &svn_diff__file_vtable, pool));
@@ -645,9 +644,9 @@ svn_diff_file_diff3_2(svn_diff_t **diff,
 
   memset(&baton, 0, sizeof(baton));
   baton.options = options;
-  baton.path[0] = original;
-  baton.path[1] = modified;
-  baton.path[2] = latest;
+  baton.files[0].path = original;
+  baton.files[1].path = modified;
+  baton.files[2].path = latest;
   baton.pool = svn_pool_create(pool);
 
   SVN_ERR(svn_diff_diff3(diff, &baton, &svn_diff__file_vtable, pool));
@@ -669,10 +668,10 @@ svn_diff_file_diff4_2(svn_diff_t **diff,
 
   memset(&baton, 0, sizeof(baton));
   baton.options = options;
-  baton.path[0] = original;
-  baton.path[1] = modified;
-  baton.path[2] = latest;
-  baton.path[3] = ancestor;
+  baton.files[0].path = original;
+  baton.files[1].path = modified;
+  baton.files[2].path = latest;
+  baton.files[3].path = ancestor;
   baton.pool = svn_pool_create(pool);
 
   SVN_ERR(svn_diff_diff4(diff, &baton, &svn_diff__file_vtable, pool));

Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster

Reply via email to