Re: [PATCH] limit diff effort to fix performance issue

2021-06-08 Thread Johan Corveleyn
On Tue, Jun 8, 2021 at 3:24 PM Stefan Sperling  wrote:
>
> On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> > Okay, I focused on the first revision causing the annotate to differ,
> > and with some debug logging:
> > - p went up to 139
> > - length[0]=1942, length[1]=1817
> >
> > Now, 1942 lines on the left and 1817 on the right doesn't seem all
> > that many (that's what remains after prefix-suffix stripping). I see
> > no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> > those sizes in a reasonable amount of time. So if p can go up to such
> > a "cost" for such "reasonable" lenghts, I imagine we should put the
> > actual limit much higher. I suppose we can just as well set "min_cost"
> > to 1024 or even higher. 64 is way too low.
>
> Thanks!
> I have set the value back to at least 1024 with this new version of patch.

Hmmm, apparently I'm still running into the limit. Even with 8192 I'm
hitting it at least once. The actual diff there is not really pretty,
but it's an actual manual change made by some developer years ago, and
the "minimal diff" is more or less readable / understandable. The log
message is something like "Group sections related to multimedia
module", and it shuffles around a lot of xml sections to group them
together.

In this case, length[0]==length[1]==46780. Pretty big for the LCS, but
not humongous. The value of 'p' goes up to 8279.

With the limit set to 16384, I'm not hitting it, and the annotate
comes out as with the original.

Now, I'm a bit puzzled why limit==8192 already takes 18s in your
tests. In my situation, with the above diff, I get:
limit==8192: 3.3s
limit==16384: 3.5s

(that's including downloading both versions from the server over https)

So I'm wondering why it takes so long in your case. One thing to keep
in mind here is that, since this is cpu intensive, optimizations might
matter. I remember back in 2011 that I did most of my measurements
with a "Release" build for example. But the above tests I did were
with a debug build, so ... dunno.

Also: one of the last things that Morten Kloster committed in 2011 was
another performance improvement, namely "restarting the LCS" in some
specific cases. At the time, I didn't have enough time to review this
properly, and had some reservations (but mainly lack of time), so I
asked him to commit it on a branch (diff-improvements). This feature
branch never made it to trunk. Just as another "shot", perhaps you
could try r1140247 and see if it helps?

-- 
Johan


Re: JavaHL: SVNClient.copy throws not very helpful exception

2021-06-08 Thread Karl Fogel

On 08 Jun 2021, Thomas Singer wrote:

Hi,

A user wanted to copy a working copy to an URL and it fails with
following exception:

org.apache.subversion.javahl.ClientException: Wrong or unexpected
property value
svn: Commit failed (details follow):
svn: Invalid property value

	at org.apache.subversion.javahl.SVNClient.copy(Native 
Method)
	at at 
com.syntevo.smartsvn.client.SuSvnClient.copyImpl(SuSvnClient:1225)

...

Unfortunately, it is completely unclear what property value is
rejected by SVN. How to fix the working copy to make the copy
operation succeed?


This is just a guess, but

 https://rt.cpan.org/Public/Bug/Display.html?id=52024

discusses how the value of the "svn:ignore" property (on a 
directory) can cause this, if that value uses non-LF line endings. 
So try investigating the values of that property wherever it is 
used?


I agree that the error message you're getting is missing some 
useful information!


Best regards,
-Karl


Re: [PATCH] limit diff effort to fix performance issue

2021-06-08 Thread Nathan Hartman
On Tue, Jun 8, 2021 at 5:55 AM Stefan Sperling  wrote:
>
> On Tue, Jun 08, 2021 at 01:45:00AM -0400, Nathan Hartman wrote:
> > In order to do some testing, I needed some test data that reproduces
> > the issue; since stsp can't share the customer's 100MB XML file, and
> > we'd probably want other inputs or sizes anyway, I wrote a program
> > that attempts to generate such a thing. I'm attaching that program...
> >
> > To build, rename to .c extension and, e.g.,
> > $ gcc gen_diff_test_data.c -o gen_diff_test_data
> >
> > To run it, provide two parameters:
> >
> > The first is a 'seed' value like you'd provide to a pseudo random
> > number generator at init time.
> >
> > The second is a 'length' parameter that says how long (approximately)
> > you want the output data to be. (The program nearly always overshoots
> > this by a small amount.)
> >
> > Rather than using the system's pseudo random number generator, this
> > program includes its own implementation to ensure that users on any
> > system can get the same results when using the same parameters. So if
> > different people want to test with the same sets of input, you only
> > have to share 2 numbers, rather than send each other files >100MB of
> > useless junk.
> >
> > Example: Generate two files of approx 100 MB, containing lots of
> > differences and diff them:
> >
> > $ gen_diff_test_data 98 100m > one.txt
> > $ gen_diff_test_data 99 100m > two.txt
> > $ time diff one.txt two.txt > /dev/null
> >
> > With the above parameters, it takes my system's diff about 50 seconds
> > to come up with something that looks reasonable at a glance; svn's
> > diff has been crunching away for a while now...
>
> Thank you Nathan, this is incredibly useful!
>
> Would you consider committing this tool to our repository, e.g. somewhere
> within the tools/dev/ subtree?


Sure, done in r1890601.

It's in tools/dev/gen-test-data/gen_diff_test_data.c.

I added the gen-test-data directory in case we want to add other
sample data generators in the future.

Cheers,
Nathan


Re: [PATCH] limit diff effort to fix performance issue

2021-06-08 Thread Stefan Sperling
On Tue, Jun 08, 2021 at 02:57:34PM +0200, Johan Corveleyn wrote:
> Okay, I focused on the first revision causing the annotate to differ,
> and with some debug logging:
> - p went up to 139
> - length[0]=1942, length[1]=1817
> 
> Now, 1942 lines on the left and 1817 on the right doesn't seem all
> that many (that's what remains after prefix-suffix stripping). I see
> no reason 'svn diff' shouldn't be able to calculate a minimal diff for
> those sizes in a reasonable amount of time. So if p can go up to such
> a "cost" for such "reasonable" lenghts, I imagine we should put the
> actual limit much higher. I suppose we can just as well set "min_cost"
> to 1024 or even higher. 64 is way too low.

Thanks!
I have set the value back to at least 1024 with this new version of patch.

> BTW, the actual revision (and proper diff) here was not really
> unreadable. It had 86 hunks. The log message was something like
> "refactor contact parameters", and the change was removing 4 lines and
> replacing them with 1 line, throughout an entire section. If the
> lcs-limit gets hit, the entire "remaining part" (which normally would
> have been some more small hunks) becomes one giant remove+add of 1000
> lines or so (i.e. quite useless for a diff that a human might want to
> review).

Indeed, we want to avoid such changes if possible, and only punish
the cases which have problematic run-time behaviour.

> Finally, a couple of small remarks related to the patch itself:
> [[[
> +  int max_cost;
> +  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. 
> */
> +  const int min_cost = 64;
> ]]]
> 
> I get confused by the naming of the variables here (unless we use
> min_max_cost or something). Perhaps something like 'cost_limit' and
> 'min_cost_limit' or something like that?

Sure, done.

> [[[
> +  max_cost = shift_sqrt(length[0] + length[1] / 2);
> ]]]
> 
> Is that correct WRT operator precedence? It looks like you want to
> take the sqrt of the average length, which would be (length[0] +
> length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this
> is used as an estimate for the "acceptable cost" / max_cost. But then
> again, it's been a while since I read the details of Myers algorithm.

Good catch. In my test case it doesn't make a difference. The limit ends
up begin 2048 in either case. But you're right, the code was incorrect.

[[[
* subversion/libsvn_diff/lcs.c
  (shift_sqrt): New helper function. Borrowed from Game of Trees' diff.
  (svn_diff__lcs): Limit the effort spent on finding an optimal diff to
   avoid spinning on the CPU for a long time (30 minutes or more) for
   some input files. Large XML documents were known to trigger this.
]]

Index: subversion/libsvn_diff/lcs.c
===
--- subversion/libsvn_diff/lcs.c(revision 1890390)
+++ subversion/libsvn_diff/lcs.c(working copy)
@@ -227,7 +227,18 @@ prepend_lcs(svn_diff__lcs_t *lcs, apr_off_t lines,
   return new_lcs;
 }
 
+/* Integer square root approximation */
+static int
+shift_sqrt(apr_off_t val)
+{
+  int i;
 
+  for (i = 1; val > 0; val >>= 2)
+i <<= 1;
+
+  return i;
+}
+
 svn_diff__lcs_t *
 svn_diff__lcs(svn_diff__position_t *position_list1, /* pointer to tail (ring) 
*/
   svn_diff__position_t *position_list2, /* pointer to tail (ring) 
*/
@@ -245,7 +256,10 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   svn_diff__snake_t *fp;
   apr_off_t d;
   apr_off_t k;
-  apr_off_t p = 0;
+  apr_off_t p = 0; /* # moves away from k=0 */
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost_limit = 1024;
+  int max_cost_limit;
   svn_diff__lcs_t *lcs, *lcs_freelist = NULL;
 
   svn_diff__position_t sentinel_position[2];
@@ -337,6 +351,24 @@ svn_diff__lcs(svn_diff__position_t *position_list1
   fp[d - 1].position[0] = sentinel_position[0].next;
   fp[d - 1].position[1] = &sentinel_position[1];
 
+  /* Limit the effort spent looking for a diff snake. If files have
+   * very few lines in common, the effort spent to find nice diff
+   * snakes is just not worth it, the diff result will still be
+   * essentially minus everything on the left, plus everything on
+   * the right, with a few useless matches here and there.
+   *
+   * Very large files which have a lot of common lines interleaved with
+   * changed lines (such as humongous auto-generated XML documents) could
+   * easily keep us busy here for a very long time, in the order of hours.
+   * In such cases we abort the algorithm before it is finished and use
+   * the most recently computed intermediate result. The diff will not be
+   * pretty but a quick suboptimal result is better than a perfect result
+   * which takes hours to compute.
+   */
+  max_cost_limit = shift_sqrt((length[0] + length[1]) / 2);
+  if (max_cost_limit < min_cost_limit)
+max_cost_limit = min_cost_limit;
+
   p = 0;
   do
 {
@@ -353,7 +385,7 @@ svn_diff__l

Re: [PATCH] limit diff effort to fix performance issue

2021-06-08 Thread Johan Corveleyn
On Sun, Jun 6, 2021 at 4:57 PM Stefan Sperling  wrote:
>
> On Sat, Jun 05, 2021 at 08:56:24PM +0200, Johan Corveleyn wrote:
> > Hmm, I tried this patch with my "annotate large XML file with deep
> > history test", but the result isn't the same as with 1.14. I'd have to
> > investigate a bit more to find out where it took another turn, and
> > what that diff looks like (and whether or not the "other diff" is a
> > problem for me).
>
> It would be good to know how many iterations your file needs to produce
> a diff without a limit. You can use something like SVN_DBG(("p=%d\n", p))
> to print this number.

Okay, I focused on the first revision causing the annotate to differ,
and with some debug logging:
- p went up to 139
- length[0]=1942, length[1]=1817

Now, 1942 lines on the left and 1817 on the right doesn't seem all
that many (that's what remains after prefix-suffix stripping). I see
no reason 'svn diff' shouldn't be able to calculate a minimal diff for
those sizes in a reasonable amount of time. So if p can go up to such
a "cost" for such "reasonable" lenghts, I imagine we should put the
actual limit much higher. I suppose we can just as well set "min_cost"
to 1024 or even higher. 64 is way too low.

BTW, the actual revision (and proper diff) here was not really
unreadable. It had 86 hunks. The log message was something like
"refactor contact parameters", and the change was removing 4 lines and
replacing them with 1 line, throughout an entire section. If the
lcs-limit gets hit, the entire "remaining part" (which normally would
have been some more small hunks) becomes one giant remove+add of 1000
lines or so (i.e. quite useless for a diff that a human might want to
review).

To give you some more numbers about this large XML file:
- it has 17701 revisions
- it has almost 140,000 lines in the HEAD revision
- it's a little less than 4 MB

Annotating the full history (i.e. performing 17700 diffs) takes around
10 minutes, with or without your lcs-limit patch. So it's not like any
of those 17700 diffs are taking a long time. I just tried running it
with min_cost set to 160, and I still get differences. Next I'll try
with 1024.

Finally, a couple of small remarks related to the patch itself:
[[[
+  int max_cost;
+  /* The minimum cost 'p' (# moves away from k=0) we are willing to invest. */
+  const int min_cost = 64;
]]]

I get confused by the naming of the variables here (unless we use
min_max_cost or something). Perhaps something like 'cost_limit' and
'min_cost_limit' or something like that?

[[[
+  max_cost = shift_sqrt(length[0] + length[1] / 2);
]]]

Is that correct WRT operator precedence? It looks like you want to
take the sqrt of the average length, which would be (length[0] +
length[1]) / 2. Or maybe I misunderstand. I'm not really sure why this
is used as an estimate for the "acceptable cost" / max_cost. But then
again, it's been a while since I read the details of Myers algorithm.

-- 
Johan


Re: [PATCH] limit diff effort to fix performance issue

2021-06-08 Thread Stefan Sperling
On Tue, Jun 08, 2021 at 01:45:00AM -0400, Nathan Hartman wrote:
> In order to do some testing, I needed some test data that reproduces
> the issue; since stsp can't share the customer's 100MB XML file, and
> we'd probably want other inputs or sizes anyway, I wrote a program
> that attempts to generate such a thing. I'm attaching that program...
> 
> To build, rename to .c extension and, e.g.,
> $ gcc gen_diff_test_data.c -o gen_diff_test_data
> 
> To run it, provide two parameters:
> 
> The first is a 'seed' value like you'd provide to a pseudo random
> number generator at init time.
> 
> The second is a 'length' parameter that says how long (approximately)
> you want the output data to be. (The program nearly always overshoots
> this by a small amount.)
> 
> Rather than using the system's pseudo random number generator, this
> program includes its own implementation to ensure that users on any
> system can get the same results when using the same parameters. So if
> different people want to test with the same sets of input, you only
> have to share 2 numbers, rather than send each other files >100MB of
> useless junk.
> 
> Example: Generate two files of approx 100 MB, containing lots of
> differences and diff them:
> 
> $ gen_diff_test_data 98 100m > one.txt
> $ gen_diff_test_data 99 100m > two.txt
> $ time diff one.txt two.txt > /dev/null
> 
> With the above parameters, it takes my system's diff about 50 seconds
> to come up with something that looks reasonable at a glance; svn's
> diff has been crunching away for a while now...

Thank you Nathan, this is incredibly useful!

Would you consider committing this tool to our repository, e.g. somewhere
within the tools/dev/ subtree?

Thanks,
Stefan


JavaHL: SVNClient.copy throws not very helpful exception

2021-06-08 Thread Thomas Singer

Hi,

A user wanted to copy a working copy to an URL and it fails with 
following exception:


org.apache.subversion.javahl.ClientException: Wrong or unexpected 
property value

svn: Commit failed (details follow):
svn: Invalid property value

at org.apache.subversion.javahl.SVNClient.copy(Native Method)
at at com.syntevo.smartsvn.client.SuSvnClient.copyImpl(SuSvnClient:1225)
...

Unfortunately, it is completely unclear what property value is rejected 
by SVN. How to fix the working copy to make the copy operation succeed?


--
Best regards,
Thomas Singer
=
syntevo GmbH
www.syntevo.com