Re: [PATCH] limit diff effort to fix performance issue

2021-06-02 Thread Johan Corveleyn
On Wed, Jun 2, 2021 at 4:05 PM Daniel Shahaf  wrote:
>
> Stefan Sperling wrote on Wed, 02 Jun 2021 12:20 +00:00:
> > The test suite is passing, which implies that trivial diffs aren't affected
> > by this change. I expect that most, if not all, diffs which people actually
> > want to read will remain unaffected. But I cannot tell with 100% certainty.
>
> Not with 100% certainty, but you could run «svn diff -c $i ^/» on the
> last few thousand revisions of /repos/asf, diff the resulting diffs, and
> review the non-empty interdiffs.

Another thing I could try is running my old "blame huge XML file with
lots of revisions", which I used to do when I worked on diff
optimisations back in 2011 (prefix/suffix-scanning and some other
things -- also some other people back then did some other optimisation
contributions which I double checked this way). I'm guessing my huge
XML file is not huge enough (or our diffs not pathological enough) for
this limit to have much impact, because after the 2011 optimisations I
can't remember running into longer times than say 5 seconds or so per
diff. Or otherwise, I can at least give some feedback about a good
"effort limit" with a real-world case where we still consider the
actual diff output somewhat important (for assigning blame :-)).

I'll try to give it a go in a couple of days (but I'll have to dust
off my local build first).

BTW: for whoever is interested, some discussion about this on IRC just
now: https://colabti.org/irclogger/irclogger_log/svn-dev?date=2021-06-02
Feel free to join on #svn-dev on irc.libera.chat ...

-- 
Johan


Re: [PATCH] limit diff effort to fix performance issue

2021-06-02 Thread Nathan Hartman
On Wed, Jun 2, 2021 at 8:20 AM Stefan Sperling  wrote:

> I've heard of several instances where users complain that 'svn update'
> takes an extraordinary amount of time to run (in terms of hours).


Ouch!

P vs NP situation? In general I think it is reasonable to limit effort
in order to generate acceptable results (in this case diffs that work)
in a reasonable time (seconds).

Thoughts below:

(snip)

As a reference, Git manages to produce a diff in about 13 seconds:
>
> $ time git diff 01.arxml 02.arxml > /dev/null
> 0m13.39s real 0m13.04s user 0m00.36s system


(snip)

The patch below implements an effort limit for Subversion's LCS diff.
> Diffing the offending XML files in a Subversion working copy becomes
> fast enough to be usable:
>
> $ time svn diff > /dev/null
> 0m06.82s real 0m05.40s user 0m01.41s system
>
> The resulting diff doesn't look as pretty as Git's diff output, but I don't
> think this will matter in practice.


One observation is that Git appears to try for twice as long in this
case, so perhaps doubling our effort limit from 1024 to 2048 will give
(some) parity between the two VCSs' prettiness and performance here?
(Of course, this case is just one data point...)

Another thought is to limit based on time rather than iterations, or
based on (time x file size). But for files that hit the limit, that
would cause non-deterministic output (inconsistent from one run to
another), which is kind of A Bad Thing. Perhaps (iterations x file
size) would be better, as that would be deterministic.

My only concern is that some corner-case diffs that are currently readable
> could become less readable. If such cases are found, we could easily
> address
> the problem by bumping the limit. Note: Any such diffs would still be
> *valid*.


(snip)

Before anyone asks, without evidence that it is really needed I don't think
> adding a config option to set this limit is warranted. Other
> implementations
> are also using hard-coded limits. We could add an environment variable to
> allow for experimentation, perhaps. In any case, I would prefer to make
> such tweaks in follow-up patches and keep this initial fix simple so it
> can be backported easily.


Agreed that it doesn't justify a new config.

I think a config doesn't make sense anyway: If the user wants to
control diff quality, a command line argument or API parameter would
allow us to keep performance snappy by default but provide for a
deep(er) think when the user specifically asks for it. (How would it
be specified? A time limit in seconds?) But, agreed that anything
like that, if done, should be a totally separate patch.

Thanks for the detailed write-up and patch.

I'll run some experiments...

Cheers,
Nathan


Re: [PATCH] limit diff effort to fix performance issue

2021-06-02 Thread Daniel Shahaf
Stefan Sperling wrote on Wed, 02 Jun 2021 12:20 +00:00:
> The test suite is passing, which implies that trivial diffs aren't affected
> by this change. I expect that most, if not all, diffs which people actually
> want to read will remain unaffected. But I cannot tell with 100% certainty.

Not with 100% certainty, but you could run «svn diff -c $i ^/» on the
last few thousand revisions of /repos/asf, diff the resulting diffs, and
review the non-empty interdiffs.


Re: [PATCH] limit diff effort to fix performance issue

2021-06-02 Thread C. Michael Pilato
On Wed, Jun 2, 2021 at 8:20 AM Stefan Sperling  wrote:

> ther diff implementations limit the effort spent by aborting after a
> certain number of search iterations, and then simply use the valid
> traversal which was most recently discovered. The libxdiff code used by Git
> does this, as does the diff implementation which Neels Hofmyer wrote for
> Game of Trees (see
> https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary)


[Emerging without years of context, he says...]

Even GNU diff offers options along these lines, right?  By default, it
shoots for a valid but relatively snappy response at the cost of
conciseness.  But can you can add --minimal if you really want it to work
harder at producing tighter output.

I say that not to propose that you make this behavior optional, but merely
to point out another (practically ancient) case of a diff
algorithm's default behavior favoring performance over conciseness.

Oh, and "hey!"

-- Mike


[PATCH] limit diff effort to fix performance issue

2021-06-02 Thread Stefan Sperling
I've heard of several instances where users complain that 'svn update'
takes an extraordinary amount of time to run (in terms of hours).

At elego we have received files which reproduce such behaviour.
These files are XML files that are almost 100MB in size. While versioning
such data is not the primary use case for Subversion, this should not get
in the way of people getting work done. So I would like to get this fixed.

The root of the performance problem is located in Subversion's diff code.

The files contain in the order of 1.200.000 lines of XML text. I cannot
share the full files, unfortunately. Here's one diff chunk from a diff
between two such files, just to give you an idea about the type of content
which the diff algorithm is trying to split into chunks. Simply imagine that
such content goes on for more than a million lines and you have a mental
image of what the file generally looks like:

  
  
 CounterDbCounter 
 ARRAY
-163  
+193^M
 
 
 
RamEccDoubleBitError_CounterDbCounter  
 

Generating a diff between two versions of such files with 'svn diff' takes
more than 40 minutes on my machine. I stopped it at that point. This is
obviously taking too long and we don't have to care about exactly how
long it will take to eventually run to completion.

As a reference, Git manages to produce a diff in about 13 seconds:

$ time git diff 01.arxml 02.arxml > /dev/null
0m13.39s real 0m13.04s user 0m00.36s system

Without my patch, the svn update/diff processes are spinning in the
following loop:

[[[
  do
{
  /* For k < 0, insertions are free */
  for (k = (d < 0 ? d : 0) - p; k < 0; k++)
{
  svn_diff__snake(fp + k, token_counts, _freelist, pool);
}
  /* for k > 0, deletions are free */
  for (k = (d > 0 ? d : 0) + p; k >= 0; k--)
{
  svn_diff__snake(fp + k, token_counts, _freelist, pool);
}

  p++;
}
  while (fp[0].position[1] != _position[1]);
]]]

What this loop is trying to do is to find an optimal "diff snake",
a partial traversal of the Myers graph which builds up step-by-step
towards some optimal traversal of the entire graph.

In the Myers diff algorithm, the problem of finding an optimal graph
traversal doesn't have a single answer. There can be several valid
traversals, resulting in differences in the generated diff.
An extreme example is a traversal which produces a patch that first
removes all lines from file A, and then adds all lines from file B.
This isn't a useful diff chunk for humans to read, but it is a valid
diff and will work when applied with a patch program.
Finding this particular traversal is also trivial and very fast :-)

The search for an optimal traversal which will result in a human-readable
diff can take time, and this is exactly where Subversion is spending too
much effort when diffing large XML files.

Other diff implementations limit the effort spent by aborting after a
certain number of search iterations, and then simply use the valid
traversal which was most recently discovered. The libxdiff code used by Git
does this, as does the diff implementation which Neels Hofmyer wrote for
Game of Trees (see https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary)

The patch below implements an effort limit for Subversion's LCS diff.
Diffing the offending XML files in a Subversion working copy becomes
fast enough to be usable:

$ time svn diff > /dev/null
0m06.82s real 0m05.40s user 0m01.41s system

The resulting diff doesn't look as pretty as Git's diff output, but I don't
think this will matter in practice.

My only concern is that some corner-case diffs that are currently readable
could become less readable. If such cases are found, we could easily address
the problem by bumping the limit. Note: Any such diffs would still be *valid*.

The test suite is passing, which implies that trivial diffs aren't affected
by this change. I expect that most, if not all, diffs which people actually
want to read will remain unaffected. But I cannot tell with 100% certainty.

Before anyone asks, without evidence that it is really needed I don't think
adding a config option to set this limit is warranted. Other implementations
are also using hard-coded limits. We could add an environment variable to
allow for experimentation, perhaps. In any case, I would prefer to make
such tweaks in follow-up patches and keep this initial fix simple so it
can be backported easily.

[[[
* subversion/libsvn_diff/lcs.c
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid