Re: Diff optimizations and generating big test files

2011-01-16 Thread Johan Corveleyn
On Wed, Jan 5, 2011 at 4:33 PM, Philip Martin
philip.mar...@wandisco.com wrote:
 Johan Corveleyn jcor...@gmail.com writes:

 Thanks for the script, it gives me some good inspiration.

 However, it doesn't fit well with the optimization that's currently
 being done on the diff-optimizations-bytes branch, because the
 differing lines are spread throughout the entire file.

 I thought you were working on two different prefix problems, but if it's
 all the same problem that's fine.  It's why I want *you* to write the
 script, then I can test your patches on my machine.  When you are
 thinking of replacing function calls with macros that's very much
 hardware/OS/compiler specific and testing on more than one platform is
 important.

Sorry it took so long (busy/interrupted with other things), but here
in attachment is finally a python script that generates two files
suitable for testing the prefix/suffix optimization of the
diff-optimizations-bytes branch:

- Without options, it generates two files file1.txt and file2.txt,
with 100,000 lines of identical prefix and 100,000 lines of identical
suffix. And in between a mis-matching section of 500 lines (with a
probability of mismatch of 50%).

- Lines are randomly generated, with random lengths between 0 and 80
(by default).

- On my machine, it generates those two files of ~8 Mb in about 17 seconds.

- Usage: see below.


Tests on my machine (Win XP 32 bit, Intel T2400 CPU @ 1.83 GHz) show
the following:

1) tools/diff/diff from trunk@1058723:
   1.020 s

2) tools/diff/diff from diff-optimizations@1058811:
   0.370 s

3) tools/diff/diff from diff-optimizations@1058811 with stefan2's
low-level optimizations [1]:
   0.290 s

4) GNU diff:
   0.157 s

(it should be noted that svn's tools/diff/diff has a much higher
startup cost than GNU diff (for whatever reason), so that alone
accounts for part of the difference with GNU diff)

For really analyzing the benefit of the low-level optimizations (an
which part of those have the most impact), maybe bigger sample data is
needed.


===
$ ./gen-big-files.py --help
Usage: Generate files for diff

Options:
  -h, --helpshow this help message and exit
  -1 FILE1, --file1=FILE1
filename of left file of the diff, default file1.txt
  -2 FILE2, --file2=FILE2
filename of right file of the diff, default file2.txt
  -p PREFIX_LINES, --prefix-lines=PREFIX_LINES
number of prefix lines, default 10
  -s SUFFIX_LINES, --suffix-lines=SUFFIX_LINES
number of suffix lines, default 10
  -m MIDDLE_LINES, --middle-lines=MIDDLE_LINES
number of lines in the middle, non-matching section,
default 500
  --percent-mismatch=PERCENT_MISMATCH
percentage of mismatches in middle section, default 50
  --min-line-length=MIN_LINE_LENGTH
minimum length of randomly generated lines, default 0
  --max-line-length=MAX_LINE_LENGTH
maximum length of randomly generated lines, default 80

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2011-01/0005.shtml - I have yet to
integrate (some of) these suggestions into the branch. That may take
me another couple of days (identifying which changes have the biggest
speed/weight gain etc).


Re: My take on the diff-optimizations-bytes branch

2011-01-16 Thread Johan Corveleyn
On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann
stefanfuhrm...@alice-dsl.de wrote:
 On 03.01.2011 02:14, Johan Corveleyn wrote:

 On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com
  wrote:

 On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com
  wrote:
 For now, some feedback on the rest of the patch:

 [[[

 -          DECREMENT_POINTERS(file, file_len, pool);
 -        }
 +        DECREMENT_POINTERS(file, file_len, pool);
      }

 +  return SVN_NO_ERROR;
 +}
 +
 +/* Find the prefix which is identical between all elements of the FILE
 array.
 + * Return the number of prefix lines in PREFIX_LINES.  REACHED_ONE_EOF
 will be
 + * set to TRUE if one of the FILEs reached its end while scanning
 prefix,
 + * i.e. at least one file consisted entirely of prefix.  Otherwise,
 + * REACHED_ONE_EOF is set to FALSE.
 + *
 + * After this function is finished, the buffers, chunks, curp's and
 endp's
 + * of the FILEs are set to point at the first byte after the prefix. */
 +static svn_error_t *
 +find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t
 *prefix_lines,
 +                      struct file_info file[], apr_size_t file_len,
 +                      apr_pool_t *pool)
 +{
 +  if (file_len == 2)
 +    SVN_ERR(find_identical_prefix_fast(reached_one_eof, prefix_lines,
 +                                       file, pool));
 +  else
 +    SVN_ERR(find_identical_prefix_slow(reached_one_eof, prefix_lines,
 +                                       file, file_len, pool));
 +

 Small problem here (not really visible in practice, but potential
 change in behavior none the less): if either of the above function
 calls exited early because of reached_all_eof, this function should
 also exit early.

 Yep, that is just an artifact caused by moving code
 into a sub-function.

Ok, that's no problem anymore, since I removed the early-exit in
r1057435. It was actually not correct, and was the main reason why
some tests kept failing when I ported it to diff3 (see that revisions
commit message for more details).

[ ... snip ... ]

 Also, it would really help if you could split up this patch (though
 this was certainly a good way to try it out and give me an idea of the
 possibilities).

 Giving you some ideas was the main point of it.

 If I find some time, I will try to reduce the patch
 (no fast / slow separation). But since the heap
 to stack conversion spreads across all of the
 code, it is hard to split the patch.

 It would be interesting to see where the biggest gains
 are coming from (I'm guessing from the per-machine-word
 reading/comparing; I'd like to try that first, maybe together with the
 allocating of the file[] on the stack; I'd like to avoid
 special-casing file_len==2, splitting up functions in *_slow and
 *_fast, because it makes it a lot more difficult imho (so I'd only
 like to do that if it's a significant contributor)). But if you want
 to leave that as an exercise for the reader, that's fine as well :-).

 Exercise is certainly not a bad thing ;)

Ok, I'll give it a go one of the coming days, now that I've got
diff3/diff4 working, and I've got a suitable
big-files-generating-script for creating reproducible test-files.

 But I think, the stack variable is certainly helpful
 and easy to do. While chunky operation gives
 a *big* boost, it is much more difficult to code if
 you need to compare multiple sources. It should
 be possible, though.

 The great advantage of the file_len==2 case is
 that this is the only time-critical operation (a
 true merge over long files is rare). Also, many
 constructs collapse if the count is fixed to 2:

 for (i = 1, is_match = TRUE; i  file_len; i++)
  is_match = is_match  *file[0].curp == *file[i].curp;
 while(is_match)
 {
 ...
  for (i = 1, is_match = TRUE; i  file_len; i++)
    is_match = is_match  *file[0].curp == *file[i].curp;
 }

 becomes

 while(*file[0].curp == *file[1].curp)
 {
 }

Hm, that's true, but I'm not sure that will have big impact (but like
you say, we'll need to do more tests). And now that I've read about a
really bad merge-case on the users-list [1], I'm thinking that diff3
should probably also get all the perf-improvement that we can give it.

Actually, what's a little bit troubling is that there are currently
only 3 possible file_len's, of which only 2 are used in practice:
diff2 and diff3 (diff4 is not used in svn core, only in
tools/diff/diff4). So, if we make a slow and a fast version, we could
just as well make a XXX2 and an XXX3 version explicitly, and dispense
with the need to pass arrays and array lenghts. Hmmm... generic (but
more complex) code vs. a couple of specialized versions.


 But basically, we will need to run some more tests.

 -- Stefan^2.



Cheers,
-- 
Johan

[1] http://svn.haxx.se/users/archive-2011-01/0245.shtml


Coding goals / requirements.

2011-01-16 Thread Gavin Beau Baumanis
Hi everyone,
I have a general question about Subversion development and instead of assuming, 
thought I better seek an answer.
Outside of my limited exposure as the PM - I have no Open Source coding 
experience - so there may well be an expectation of coding that I simply 
don't know about.

This email stems from the last paragraph in this message by Johan Corveleyn;
Which I have copied into this email.

If you're already familiar with the diff-optimisation you can get the full 
message here;
http://svn.haxx.se/dev/archive-2011-01/0241.shtml

And if you want some more background the full thread is here;
http://svn.haxx.se/dev/archive-2011-01/0004.shtml
and,
http://svn.haxx.se/dev/archive-2011-01/0005.shtml

which I was tempted to just reply to - but thought I had better not hijack the 
existing thread.

To save you from having to find the post, here is the paragraph I'm referring 
to;

 Actually, what's a little bit troubling is that there are currently
 only 3 possible file_len's, of which only 2 are used in practice:
 diff2 and diff3 (diff4 is not used in svn core, only in
 tools/diff/diff4). So, if we make a slow and a fast version, we could
 just as well make a XXX2 and an XXX3 version explicitly, and dispense
 with the need to pass arrays and array lenghts. Hmmm... generic (but
 more complex) code vs. a couple of specialized versions.


The goal of the thread  / patch submission is to increase the speed of 
diffing.
So, does the design, truly, matter?

As the underlying design of all software should be to create the simplest / 
cleanest  / loosely-coupled code that addresses the problem.
Surely the answer is;
Man-power is limited - as long as you observe the coding guidelines  / styles 
etc.

Is the answer not;
Code to the manner that provides the greatest return on a developer's time for 
THIS problem.

Assuming the specialised version is indeed faster;
If for whatever reason - it is decided that an API is required (public or 
private) - then an appropriate wrapper can be added when the specific 
requirement of having an API for those functions becomes a requirement as 
opposed to simply being  a nice design goal to aim for upfront?

As always - thanks for patience / assistance.

Gavin.