Re: Diff optimizations and generating big test files

2011-01-16 Thread Johan Corveleyn
On Wed, Jan 5, 2011 at 4:33 PM, Philip Martin
philip.mar...@wandisco.com wrote:
 Johan Corveleyn jcor...@gmail.com writes:

 Thanks for the script, it gives me some good inspiration.

 However, it doesn't fit well with the optimization that's currently
 being done on the diff-optimizations-bytes branch, because the
 differing lines are spread throughout the entire file.

 I thought you were working on two different prefix problems, but if it's
 all the same problem that's fine.  It's why I want *you* to write the
 script, then I can test your patches on my machine.  When you are
 thinking of replacing function calls with macros that's very much
 hardware/OS/compiler specific and testing on more than one platform is
 important.

Sorry it took so long (busy/interrupted with other things), but here
in attachment is finally a python script that generates two files
suitable for testing the prefix/suffix optimization of the
diff-optimizations-bytes branch:

- Without options, it generates two files file1.txt and file2.txt,
with 100,000 lines of identical prefix and 100,000 lines of identical
suffix. And in between a mis-matching section of 500 lines (with a
probability of mismatch of 50%).

- Lines are randomly generated, with random lengths between 0 and 80
(by default).

- On my machine, it generates those two files of ~8 Mb in about 17 seconds.

- Usage: see below.


Tests on my machine (Win XP 32 bit, Intel T2400 CPU @ 1.83 GHz) show
the following:

1) tools/diff/diff from trunk@1058723:
   1.020 s

2) tools/diff/diff from diff-optimizations@1058811:
   0.370 s

3) tools/diff/diff from diff-optimizations@1058811 with stefan2's
low-level optimizations [1]:
   0.290 s

4) GNU diff:
   0.157 s

(it should be noted that svn's tools/diff/diff has a much higher
startup cost than GNU diff (for whatever reason), so that alone
accounts for part of the difference with GNU diff)

For really analyzing the benefit of the low-level optimizations (an
which part of those have the most impact), maybe bigger sample data is
needed.


===
$ ./gen-big-files.py --help
Usage: Generate files for diff

Options:
  -h, --helpshow this help message and exit
  -1 FILE1, --file1=FILE1
filename of left file of the diff, default file1.txt
  -2 FILE2, --file2=FILE2
filename of right file of the diff, default file2.txt
  -p PREFIX_LINES, --prefix-lines=PREFIX_LINES
number of prefix lines, default 10
  -s SUFFIX_LINES, --suffix-lines=SUFFIX_LINES
number of suffix lines, default 10
  -m MIDDLE_LINES, --middle-lines=MIDDLE_LINES
number of lines in the middle, non-matching section,
default 500
  --percent-mismatch=PERCENT_MISMATCH
percentage of mismatches in middle section, default 50
  --min-line-length=MIN_LINE_LENGTH
minimum length of randomly generated lines, default 0
  --max-line-length=MAX_LINE_LENGTH
maximum length of randomly generated lines, default 80

Cheers,
-- 
Johan

[1] http://svn.haxx.se/dev/archive-2011-01/0005.shtml - I have yet to
integrate (some of) these suggestions into the branch. That may take
me another couple of days (identifying which changes have the biggest
speed/weight gain etc).


Re: Diff optimizations and generating big test files

2011-01-05 Thread Philip Martin
Johan Corveleyn jcor...@gmail.com writes:

 Thanks for the script, it gives me some good inspiration.

 However, it doesn't fit well with the optimization that's currently
 being done on the diff-optimizations-bytes branch, because the
 differing lines are spread throughout the entire file.

I thought you were working on two different prefix problems, but if it's
all the same problem that's fine.  It's why I want *you* to write the
script, then I can test your patches on my machine.  When you are
thinking of replacing function calls with macros that's very much
hardware/OS/compiler specific and testing on more than one platform is
important.

-- 
Philip