On Thu, Jun 3, 2021 at 12:57 PM Stefan Sperling <s...@elego.de> wrote:
>
> On Wed, Jun 02, 2021 at 02:20:12PM +0200, Stefan Sperling wrote:
> > The patch below implements an effort limit for Subversion's LCS diff.
>
> Here is an updated version of this patch. I took a closer look at
> other diff implementations again, and realized that they don't use
> a fixed limit but one which is scaled to the size of the Myers graph.
> With a square root applied to keep the effort within certain bounds
> as the files being compared become larger.
>
> Compared to the previous patch, I have added comments, and renamed
> 'max_effort' to 'max_cost' to better align with existing documentation
> of the algorithm in existing comments.
>
> We already maintain a variable 'p' which represents the cost of
> the algorithm. So instead of maintaining a separate counter I am
> now setting an upper bound on 'p'.
>
> With this new patch the limit ends up being 2048 for the files I have.
> For regular files in a working copy of SVN trunk, the loop typically
> needs less than 10 iterations.
>
> For reference, run-times of 'svn diff' on my test files with various
> limits, including no limit:
>
> limit = 1024
>     0m06.82s real     0m05.40s user     0m01.41s system
> limit = 2048
>     0m08.15s real     0m06.55s user     0m01.56s system
> limit = 4096
>     0m11.10s real     0m09.59s user     0m01.52s system
> limit = 8192
>     0m18.40s real     0m16.57s user     0m01.57s system
> limit = 65536
>     7m55.40s real     7m53.81s user     0m01.58s system
> Full run without limit:
>   163m31.49s real   163m16.77s user     0m02.37s system
>
> Git/libxdiff somehow manages to produce a smaller collection of larger
> chunks on the problematic files, rather than many small chunks with
> one giant chunk at the end which we end up produing when we bail out
> of the loop.
> I have made some attempts to produce a more readable diff result in
> case the limit is reached but I haven't been successful. Ideas welcome.
> Still, I don't think it makes sense to invest a lot of time into
> optimizing output in this case, under the assumption that nobody is
> going to be carefully reading affected diffs anyway.
>
> [[[
> * subversion/libsvn_diff/lcs.c
>   (shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
>   (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
>    avoid spinning on the CPU for a long time (30 minutes or more) for
>    some input files. Large XML documents were known to trigger this.
> ]]
>
> Index: subversion/libsvn_diff/lcs.c
> ===================================================================
> --- subversion/libsvn_diff/lcs.c        (revision 1890390)
> +++ subversion/libsvn_diff/lcs.c        (working copy)
> @@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
>    return new_lcs;
>  }
>
> +/* Integer square root approximation */
> +static int
> +shift_sqrt(apr_off_t val)
> +{
> +  int i;
>
> +  for (i = 1; val > 0; val >>= 2)
> +    i <<= 1;
> +
> +  return i;
> +}
> +
>  svn_diff__lcs_t *
>  svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail 
> (ring) */
>                svn_diff__position_t *position_list2, /* pointer to tail 
> (ring) */
> @@ -247,6 +258,9 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>    apr_off_t k;
>    apr_off_t p = 0;
>    svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
> +  int max_cost;
> +  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. 
> */
> +  const int min_cost = 64;
>
>    svn_diff__position_t sentinel_position[2];
>
> @@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>    fp[d - 1].position[0] = sentinel_position[0].next;
>    fp[d - 1].position[1] = &sentinel_position[1];
>
> +  /* Limit the effort spent looking for a diff snake. If files have
> +   * very few lines in common, the effort spent to find nice diff
> +   * snakes is just not worth it, the diff result will still be
> +   * essentially minus everything on the left, plus everything on
> +   * the right, with a few useless matches here and there.
> +   *
> +   * Very large files which have a lot of common lines interleaved with
> +   * changed lines (such as humongous auto-generated XML documents) could
> +   * easily keep us busy here for a very long time, in the order of hours.
> +   * In such cases we abort the algorithm before it is finished and use
> +   * the most recently computed intermediate result. The diff will not be
> +   * pretty but a quick suboptimal result is better than a perfect result
> +   * which takes hours to compute.
> +   */
> +  max_cost = shift_sqrt(length[0] + length[1] / 2);
> +  if (max_cost < min_cost)
> +    max_cost = min_cost;
> +
>    p = 0;
>    do
>      {
> @@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
>
>        p++;
>      }
> -  while (fp[0].position[1] != &sentinel_position[1]);
> +  while (fp[0].position[1] != &sentinel_position[1] && p < max_cost);
>
>    if (suffix_lines)
>      lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,
>

Hmm, I tried this patch with my "annotate large XML file with deep
history test", but the result isn't the same as with 1.14. I'd have to
investigate a bit more to find out where it took another turn, and
what that diff looks like (and whether or not the "other diff" is a
problem for me).

-- 
Johan

Reply via email to