Re: [PATCH] limit diff effort to fix performance issue

Johan Corveleyn Fri, 01 Apr 2022 07:18:04 -0700

On Fri, Apr 1, 2022 at 1:12 PM Stefan Sperling <s...@apache.org> wrote:
>
> On Fri, Apr 01, 2022 at 12:44:24PM +0200, Johan Corveleyn wrote:
> > On Tue, Jun 8, 2021 at 5:58 PM Johan Corveleyn <jcor...@gmail.com> wrote:
> > > On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling <s...@stsp.name> wrote:
> > > > On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> > > > > Okay, I focused on the first revision causing the annotate to differ,
> > > > > and with some debug logging:
> > > > > - p went up to 139
> > > > > - length[0]=1942, length[1]=1817
> > > > >
> > > > > Now, 1942 lines on the left and 1817 on the right doesn't seem all
> > > > > that many (that's what remains after prefix-suffix stripping). I see
> > > > > no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> > > > > those sizes in a reasonable amount of time. So if p can go up to such
> > > > > a "cost" for such "reasonable" lenghts, I imagine we should put the
> > > > > actual limit much higher. I suppose we can just as well set "min_cost"
> > > > > to 1024 or even higher. 64 is way too low.
> > > >
> > > > Thanks!
> > > > I have set the value back to at least 1024 with this new version of 
> > > > patch.
> > >
> > > Hmmm, apparently I'm still running into the limit. Even with 8192 I'm
> > > hitting it at least once. The actual diff there is not really pretty,
> > > but it's an actual manual change made by some developer years ago, and
> > > the "minimal diff" is more or less readable / understandable. The log
> > > message is something like "Group sections related to multimedia
> > > module", and it shuffles around a lot of xml sections to group them
> > > together.
> > >
> > > In this case, length[0]==length[1]==46780. Pretty big for the LCS, but
> > > not humongous. The value of 'p' goes up to 8279.
> > >
> > > With the limit set to 16384, I'm not hitting it, and the annotate
> > > comes out as with the original.
> > >
> > > Now, I'm a bit puzzled why limit==8192 already takes 18s in your
> > > tests. In my situation, with the above diff, I get:
> > > limit==8192: 3.3s
> > > limit==16384: 3.5s
> > >
> > > (that's including downloading both versions from the server over https)
> > >
> > > So I'm wondering why it takes so long in your case. One thing to keep
> > > in mind here is that, since this is cpu intensive, optimizations might
> > > matter. I remember back in 2011 that I did most of my measurements
> > > with a "Release" build for example. But the above tests I did were
> > > with a debug build, so ... dunno.
> >
> > [ snip part about another optimization by Morten Kloster in r1140247,
> > which seemed not to work for Stefan. ]
> >
> > Coming back to this thread from last year, just wondering if you want
> > to pick this up again, Stefan (or anyone else), since a 1.15 release
> > might be happening in the near future.
> >
> > I think we already concluded that changing this behaviour in a patch
> > release would be hard to justify. But for 1.15 it might be acceptable
> > in some form.
> >
> > Just wanted to bring this back up, I have no vested interest here. I'm
> > still skeptical about changing the output of diff & blame in cases
> > where we hit "this takes too much work", but I won't stand in the way
> > if that what others consense on :-). If we want to pick a limit of the
> > maximum effort, then I'd suggest 16384 (or above), but that's a purely
> > personal / local suggestion, because the use case I had above would
> > then not be impacted.
> >
> > BTW: it's a pity that we can's say: "limit the diff effort in case of
> > 'update' or 'merge', but not in case of 'diff' or 'blame'". Because in
> > the latter case, the user might care more about the actual output, and
> > in the former they might not (until they hit a text conflict of
> > course, then they will care again ...)
> >
> > It might be interesting if someone could take a more "profiling" look
> > at why Stefan's example takes much more time, even with limit==8192
> > (18 seconds), compared to my example (3.3 seconds). Is that purely
> > because of the total size of the files / number of lines? I find that
> > difference quite strange. There might be something that can be
> > optimized without changing behaviour. But I'm afraid I don't have time
> > to dig deeply into this.
>
> My assumption by now is that the algorithm needs to be changed.
> This would essentially amount to a complete rewrite of the diff code.
>
> There must be a fundamental difference to how 'git diff' works,
> for example, which succeeds on the large files I have tested
> in a relatively short amount of time.
>
> Even 'got diff' (from my side-project gameoftrees.org), which does not
> focus on performance as a first-class feature, completes a diff of the
> problematic test case discussed in this thread in about 3 minutes.
> It uses a diff implemention written by Neels, with the Patience algorithm.
>
> I have shelved this topic for now. I have no desire to rewrite Subversion's
> diff code. And the original use case is an edge case where people are
> versioning large auto-generated XML files for reasons I don't understand.
> They worked around it by configuring 'git diff' as a diff tool for
> Subversion to use, and have not complained since.
> I suspect they could also change their process to work around this issue,
> but such changes are not always easy to get pushed through in a large
> organization. But, in the end, these are home-made problems which could
> be solved by using Subversion as it is intended to be used.

I'm shelving this too, but just one more circle-back here ...

- Perhaps the fundamental diff algorithm in SVN is fine, but it has a
performance bug / low-level inefficiency? I think that should be
explored first, because it might fix most of this problem without
requiring a discussion about the precise output, and without huge
efforts of rewrites.

- I think a lot of other diff tools out there have an "effort-limit",
like the one you proposed (like --speed-large-files, aka
"use-heuristic" for GNU diff [1], [2], which might be the default
behaviour, dunno). I'm guessing 'git diff' has one too. So agreed,
perhaps we need one too. Just not entirely sure how high we should put
the limit, and what we have to communicate around "diff output might
have changed / might not be a minimal diff" (and whether we still need
an option to fallback to "go over the limit").

- Patience diff might also be an option (though it might have its own
set of problems and peculiar edge cases, I don't know). It might even
be possible to reuse part of the current code for this (Not sure this
requires a complete rewrite. Does it still require a LCS algorithm at
its core, after picking apart the problem space? I haven't studied it
enough to know how it works internally). In this case too there is the
back-compat vs behavioral-change discussion that will take time to
settle (Do we change the default algorithm / output? Do we make it
optional, or do we already have too manu knobs? ...).

[1]
https://github.com/Distrotech/diffutils/blob/9e70e1ce7aaeff0f9c428d1abc9821589ea054f1/NEWS#L238
[2]
https://github.com/Distrotech/diffutils/blob/9e70e1ce7aaeff0f9c428d1abc9821589ea054f1/lib/diffseq.h#L255

--
Johan

Re: [PATCH] limit diff effort to fix performance issue

Reply via email to