On Thu, Jun 3, 2021 at 12:57 PM Stefan Sperling <s...@elego.de> wrote: > > On Wed, Jun 02, 2021 at 02:20:12PM +0200, Stefan Sperling wrote: > > The patch below implements an effort limit for Subversion's LCS diff. > > Here is an updated version of this patch. I took a closer look at > other diff implementations again, and realized that they don't use > a fixed limit but one which is scaled to the size of the Myers graph. > With a square root applied to keep the effort within certain bounds > as the files being compared become larger. > > Compared to the previous patch, I have added comments, and renamed > 'max_effort' to 'max_cost' to better align with existing documentation > of the algorithm in existing comments. > > We already maintain a variable 'p' which represents the cost of > the algorithm. So instead of maintaining a separate counter I am > now setting an upper bound on 'p'. > > With this new patch the limit ends up being 2048 for the files I have. > For regular files in a working copy of SVN trunk, the loop typically > needs less than 10 iterations. > > For reference, run-times of 'svn diff' on my test files with various > limits, including no limit: > > limit = 1024 > 0m06.82s real 0m05.40s user 0m01.41s system > limit = 2048 > 0m08.15s real 0m06.55s user 0m01.56s system > limit = 4096 > 0m11.10s real 0m09.59s user 0m01.52s system > limit = 8192 > 0m18.40s real 0m16.57s user 0m01.57s system > limit = 65536 > 7m55.40s real 7m53.81s user 0m01.58s system > Full run without limit: > 163m31.49s real 163m16.77s user 0m02.37s system > > Git/libxdiff somehow manages to produce a smaller collection of larger > chunks on the problematic files, rather than many small chunks with > one giant chunk at the end which we end up produing when we bail out > of the loop. > I have made some attempts to produce a more readable diff result in > case the limit is reached but I haven't been successful. Ideas welcome. > Still, I don't think it makes sense to invest a lot of time into > optimizing output in this case, under the assumption that nobody is > going to be carefully reading affected diffs anyway. > > [[[ > * subversion/libsvn_diff/lcs.c > (shift_sqrt): New helper function. Borrowed from Game of Trees' diff. > (svn_diff__lcs): Limit the effort spent on finding an optimal diff to > avoid spinning on the CPU for a long time (30 minutes or more) for > some input files. Large XML documents were known to trigger this. > ]] > > Index: subversion/libsvn_diff/lcs.c > =================================================================== > --- subversion/libsvn_diff/lcs.c (revision 1890390) > +++ subversion/libsvn_diff/lcs.c (working copy) > @@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines, > return new_lcs; > } > > +/* Integer square root approximation */ > +static int > +shift_sqrt(apr_off_t val) > +{ > + int i; > > + for (i = 1; val > 0; val >>= 2) > + i <<= 1; > + > + return i; > +} > + > svn_diff__lcs_t * > svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail > (ring) */ > svn_diff__position_t *position_list2, /* pointer to tail > (ring) */ > @@ -247,6 +258,9 @@ svn_diff__lcs(svn_diff__position_t *position_list1 > apr_off_t k; > apr_off_t p = 0; > svn_diff__lcs_t *lcs, *lcs_freelist = NULL; > + int max_cost; > + /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. > */ > + const int min_cost = 64; > > svn_diff__position_t sentinel_position[2]; > > @@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1 > fp[d - 1].position[0] = sentinel_position[0].next; > fp[d - 1].position[1] = &sentinel_position[1]; > > + /* Limit the effort spent looking for a diff snake. If files have > + * very few lines in common, the effort spent to find nice diff > + * snakes is just not worth it, the diff result will still be > + * essentially minus everything on the left, plus everything on > + * the right, with a few useless matches here and there. > + * > + * Very large files which have a lot of common lines interleaved with > + * changed lines (such as humongous auto-generated XML documents) could > + * easily keep us busy here for a very long time, in the order of hours. > + * In such cases we abort the algorithm before it is finished and use > + * the most recently computed intermediate result. The diff will not be > + * pretty but a quick suboptimal result is better than a perfect result > + * which takes hours to compute. > + */ > + max_cost = shift_sqrt(length[0] + length[1] / 2); > + if (max_cost < min_cost) > + max_cost = min_cost; > + > p = 0; > do > { > @@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1 > > p++; > } > - while (fp[0].position[1] != &sentinel_position[1]); > + while (fp[0].position[1] != &sentinel_position[1] && p < max_cost); > > if (suffix_lines) > lcs->next = prepend_lcs(fp[0].lcs, suffix_lines, >
Hmm, I tried this patch with my "annotate large XML file with deep history test", but the result isn't the same as with 1.14. I'd have to investigate a bit more to find out where it took another turn, and what that diff looks like (and whether or not the "other diff" is a problem for me). -- Johan