Re: [PATCH] limit diff effort to fix performance issue
On Wed, Jun 2, 2021 at 4:05 PM Daniel Shahaf wrote: > > Stefan Sperling wrote on Wed, 02 Jun 2021 12:20 +00:00: > > The test suite is passing, which implies that trivial diffs aren't affected > > by this change. I expect that most, if not all, diffs which people actually > > want to read will remain unaffected. But I cannot tell with 100% certainty. > > Not with 100% certainty, but you could run «svn diff -c $i ^/» on the > last few thousand revisions of /repos/asf, diff the resulting diffs, and > review the non-empty interdiffs. Another thing I could try is running my old "blame huge XML file with lots of revisions", which I used to do when I worked on diff optimisations back in 2011 (prefix/suffix-scanning and some other things -- also some other people back then did some other optimisation contributions which I double checked this way). I'm guessing my huge XML file is not huge enough (or our diffs not pathological enough) for this limit to have much impact, because after the 2011 optimisations I can't remember running into longer times than say 5 seconds or so per diff. Or otherwise, I can at least give some feedback about a good "effort limit" with a real-world case where we still consider the actual diff output somewhat important (for assigning blame :-)). I'll try to give it a go in a couple of days (but I'll have to dust off my local build first). BTW: for whoever is interested, some discussion about this on IRC just now: https://colabti.org/irclogger/irclogger_log/svn-dev?date=2021-06-02 Feel free to join on #svn-dev on irc.libera.chat ... -- Johan
Re: [PATCH] limit diff effort to fix performance issue
On Wed, Jun 2, 2021 at 8:20 AM Stefan Sperling wrote: > I've heard of several instances where users complain that 'svn update' > takes an extraordinary amount of time to run (in terms of hours). Ouch! P vs NP situation? In general I think it is reasonable to limit effort in order to generate acceptable results (in this case diffs that work) in a reasonable time (seconds). Thoughts below: (snip) As a reference, Git manages to produce a diff in about 13 seconds: > > $ time git diff 01.arxml 02.arxml > /dev/null > 0m13.39s real 0m13.04s user 0m00.36s system (snip) The patch below implements an effort limit for Subversion's LCS diff. > Diffing the offending XML files in a Subversion working copy becomes > fast enough to be usable: > > $ time svn diff > /dev/null > 0m06.82s real 0m05.40s user 0m01.41s system > > The resulting diff doesn't look as pretty as Git's diff output, but I don't > think this will matter in practice. One observation is that Git appears to try for twice as long in this case, so perhaps doubling our effort limit from 1024 to 2048 will give (some) parity between the two VCSs' prettiness and performance here? (Of course, this case is just one data point...) Another thought is to limit based on time rather than iterations, or based on (time x file size). But for files that hit the limit, that would cause non-deterministic output (inconsistent from one run to another), which is kind of A Bad Thing. Perhaps (iterations x file size) would be better, as that would be deterministic. My only concern is that some corner-case diffs that are currently readable > could become less readable. If such cases are found, we could easily > address > the problem by bumping the limit. Note: Any such diffs would still be > *valid*. (snip) Before anyone asks, without evidence that it is really needed I don't think > adding a config option to set this limit is warranted. Other > implementations > are also using hard-coded limits. We could add an environment variable to > allow for experimentation, perhaps. In any case, I would prefer to make > such tweaks in follow-up patches and keep this initial fix simple so it > can be backported easily. Agreed that it doesn't justify a new config. I think a config doesn't make sense anyway: If the user wants to control diff quality, a command line argument or API parameter would allow us to keep performance snappy by default but provide for a deep(er) think when the user specifically asks for it. (How would it be specified? A time limit in seconds?) But, agreed that anything like that, if done, should be a totally separate patch. Thanks for the detailed write-up and patch. I'll run some experiments... Cheers, Nathan
Re: [PATCH] limit diff effort to fix performance issue
Stefan Sperling wrote on Wed, 02 Jun 2021 12:20 +00:00: > The test suite is passing, which implies that trivial diffs aren't affected > by this change. I expect that most, if not all, diffs which people actually > want to read will remain unaffected. But I cannot tell with 100% certainty. Not with 100% certainty, but you could run «svn diff -c $i ^/» on the last few thousand revisions of /repos/asf, diff the resulting diffs, and review the non-empty interdiffs.
Re: [PATCH] limit diff effort to fix performance issue
On Wed, Jun 2, 2021 at 8:20 AM Stefan Sperling wrote: > ther diff implementations limit the effort spent by aborting after a > certain number of search iterations, and then simply use the valid > traversal which was most recently discovered. The libxdiff code used by Git > does this, as does the diff implementation which Neels Hofmyer wrote for > Game of Trees (see > https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary) [Emerging without years of context, he says...] Even GNU diff offers options along these lines, right? By default, it shoots for a valid but relatively snappy response at the cost of conciseness. But can you can add --minimal if you really want it to work harder at producing tighter output. I say that not to propose that you make this behavior optional, but merely to point out another (practically ancient) case of a diff algorithm's default behavior favoring performance over conciseness. Oh, and "hey!" -- Mike
[PATCH] limit diff effort to fix performance issue
I've heard of several instances where users complain that 'svn update' takes an extraordinary amount of time to run (in terms of hours). At elego we have received files which reproduce such behaviour. These files are XML files that are almost 100MB in size. While versioning such data is not the primary use case for Subversion, this should not get in the way of people getting work done. So I would like to get this fixed. The root of the performance problem is located in Subversion's diff code. The files contain in the order of 1.200.000 lines of XML text. I cannot share the full files, unfortunately. Here's one diff chunk from a diff between two such files, just to give you an idea about the type of content which the diff algorithm is trying to split into chunks. Simply imagine that such content goes on for more than a million lines and you have a mental image of what the file generally looks like: CounterDbCounter ARRAY -163 +193^M RamEccDoubleBitError_CounterDbCounter Generating a diff between two versions of such files with 'svn diff' takes more than 40 minutes on my machine. I stopped it at that point. This is obviously taking too long and we don't have to care about exactly how long it will take to eventually run to completion. As a reference, Git manages to produce a diff in about 13 seconds: $ time git diff 01.arxml 02.arxml > /dev/null 0m13.39s real 0m13.04s user 0m00.36s system Without my patch, the svn update/diff processes are spinning in the following loop: [[[ do { /* For k < 0, insertions are free */ for (k = (d < 0 ? d : 0) - p; k < 0; k++) { svn_diff__snake(fp + k, token_counts, _freelist, pool); } /* for k > 0, deletions are free */ for (k = (d > 0 ? d : 0) + p; k >= 0; k--) { svn_diff__snake(fp + k, token_counts, _freelist, pool); } p++; } while (fp[0].position[1] != _position[1]); ]]] What this loop is trying to do is to find an optimal "diff snake", a partial traversal of the Myers graph which builds up step-by-step towards some optimal traversal of the entire graph. In the Myers diff algorithm, the problem of finding an optimal graph traversal doesn't have a single answer. There can be several valid traversals, resulting in differences in the generated diff. An extreme example is a traversal which produces a patch that first removes all lines from file A, and then adds all lines from file B. This isn't a useful diff chunk for humans to read, but it is a valid diff and will work when applied with a patch program. Finding this particular traversal is also trivial and very fast :-) The search for an optimal traversal which will result in a human-readable diff can take time, and this is exactly where Subversion is spending too much effort when diffing large XML files. Other diff implementations limit the effort spent by aborting after a certain number of search iterations, and then simply use the valid traversal which was most recently discovered. The libxdiff code used by Git does this, as does the diff implementation which Neels Hofmyer wrote for Game of Trees (see https://git.gameoftrees.org/gitweb/?p=diff.git;a=summary) The patch below implements an effort limit for Subversion's LCS diff. Diffing the offending XML files in a Subversion working copy becomes fast enough to be usable: $ time svn diff > /dev/null 0m06.82s real 0m05.40s user 0m01.41s system The resulting diff doesn't look as pretty as Git's diff output, but I don't think this will matter in practice. My only concern is that some corner-case diffs that are currently readable could become less readable. If such cases are found, we could easily address the problem by bumping the limit. Note: Any such diffs would still be *valid*. The test suite is passing, which implies that trivial diffs aren't affected by this change. I expect that most, if not all, diffs which people actually want to read will remain unaffected. But I cannot tell with 100% certainty. Before anyone asks, without evidence that it is really needed I don't think adding a config option to set this limit is warranted. Other implementations are also using hard-coded limits. We could add an environment variable to allow for experimentation, perhaps. In any case, I would prefer to make such tweaks in follow-up patches and keep this initial fix simple so it can be backported easily. [[[ * subversion/libsvn_diff/lcs.c (svn_diff__lcs): Limit the effort spent on finding an optimal diff to avoid