On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> Okay, I focused on the first revision causing the annotate to differ,
> and with some debug logging:
> - p went up to 139
> - length[0]=1942, length[1]=1817
>
> Now, 1942 lines on the left and 1817 on the right doesn't seem all
> that many (that's what remains after prefix-suffix stripping). I see
> no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> those sizes in a reasonable amount of time. So if p can go up to such
> a "cost" for such "reasonable" lenghts, I imagine we should put the
> actual limit much higher. I suppose we can just as well set "min_cost"
> to 1024 or even higher. 64 is way too low.
Thanks!
I have set the value back to at least 1024 with this new version of patch.
> BTW, the actual revision (and proper diff) here was not really
> unreadable. It had 86 hunks. The log message was something like
> "refactor contact parameters", and the change was removing 4 lines and
> replacing them with 1 line, throughout an entire section. If the
> lcs-limit gets hit, the entire "remaining part" (which normally would
> have been some more small hunks) becomes one giant remove+add of 1000
> lines or so (i.e. quite useless for a diff that a human might want to
> review).
Indeed, we want to avoid such changes if possible, and only punish
the cases which have problematic run-time behaviour.
> Finally, a couple of small remarks related to the patch itself:
> [[[
> + int max_cost;
> + /* The minimum cost 'p' (# moves away from k=0) we are willing to invest.
> */
> + const int min_cost = 64;
> ]]]
>
> I get confused by the naming of the variables here (unless we use
> min_max_cost or something). Perhaps something like 'cost_limit' and
> 'min_cost_limit' or something like that?
Sure, done.
> [[[
> + max_cost = shift_sqrt(length[0] + length[1] / 2);
> ]]]
>
> Is that correct WRT operator precedence? It looks like you want to
> take the sqrt of the average length, which would be (length[0] +
> length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this
> is used as an estimate for the "acceptable cost" / max_cost. But then
> again, it's been a while since I read the details of Myers algorithm.
Good catch. In my test case it doesn't make a difference. The limit ends
up begin 2048 in either case. But you're right, the code was incorrect.
[[[
* subversion/libsvn_diff/lcs.c
(shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
(svn_diff__lcs): Limit the effort spent on finding an optimal diff to
avoid spinning on the CPU for a long time (30 minutes or more) for
some input files. Large XML documents were known to trigger this.
]]
Index: subversion/libsvn_diff/lcs.c
===================================================================
--- subversion/libsvn_diff/lcs.c (revision 1890390)
+++ subversion/libsvn_diff/lcs.c (working copy)
@@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
return new_lcs;
}
+/* Integer square root approximation */
+static int
+shift_sqrt(apr_off_t val)
+{
+ int i;
+ for (i = 1; val > 0; val >>= 2)
+ i <<= 1;
+
+ return i;
+}
+
svn_diff__lcs_t *
svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring)
*/
svn_diff__position_t *position_list2, /* pointer to tail (ring)
*/
@@ -245,7 +256,10 @@ svn_diff__lcs(svn_diff__position_t *position_list1
svn_diff__snake_t *fp;
apr_off_t d;
apr_off_t k;
- apr_off_t p = 0;
+ apr_off_t p = 0; /* # moves away from k=0 */
+ /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+ const int min_cost_limit = 1024;
+ int max_cost_limit;
svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
svn_diff__position_t sentinel_position[2];
@@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
fp[d - 1].position[0] = sentinel_position[0].next;
fp[d - 1].position[1] = &sentinel_position[1];
+ /* Limit the effort spent looking for a diff snake. If files have
+ * very few lines in common, the effort spent to find nice diff
+ * snakes is just not worth it, the diff result will still be
+ * essentially minus everything on the left, plus everything on
+ * the right, with a few useless matches here and there.
+ *
+ * Very large files which have a lot of common lines interleaved with
+ * changed lines (such as humongous auto-generated XML documents) could
+ * easily keep us busy here for a very long time, in the order of hours.
+ * In such cases we abort the algorithm before it is finished and use
+ * the most recently computed intermediate result. The diff will not be
+ * pretty but a quick suboptimal result is better than a perfect result
+ * which takes hours to compute.
+ */
+ max_cost_limit = shift_sqrt((length[0] + length[1]) / 2);
+ if (max_cost_limit < min_cost_limit)
+ max_cost_limit = min_cost_limit;
+
p = 0;
do
{
@@ -353,7 +385,7 @@ svn_diff__lcs(svn_diff__position_t *position_list1
p++;
}
- while (fp[0].position[1] != &sentinel_position[1]);
+ while (fp[0].position[1] != &sentinel_position[1] && p < max_cost_limit);
if (suffix_lines)
lcs->next = prepend_lcs(fp[0].lcs, suffix_lines,