On Wed, Jan 5, 2011 at 4:33 PM, Philip Martin <philip.mar...@wandisco.com> wrote: > Johan Corveleyn <jcor...@gmail.com> writes: > >> Thanks for the script, it gives me some good inspiration. >> >> However, it doesn't fit well with the optimization that's currently >> being done on the diff-optimizations-bytes branch, because the >> differing lines are spread throughout the entire file. > > I thought you were working on two different prefix problems, but if it's > all the same problem that's fine. It's why I want *you* to write the > script, then I can test your patches on my machine. When you are > thinking of replacing function calls with macros that's very much > hardware/OS/compiler specific and testing on more than one platform is > important.
Sorry it took so long (busy/interrupted with other things), but here in attachment is finally a python script that generates two files suitable for testing the prefix/suffix optimization of the diff-optimizations-bytes branch: - Without options, it generates two files file1.txt and file2.txt, with 100,000 lines of identical prefix and 100,000 lines of identical suffix. And in between a mis-matching section of 500 lines (with a probability of mismatch of 50%). - Lines are randomly generated, with random lengths between 0 and 80 (by default). - On my machine, it generates those two files of ~8 Mb in about 17 seconds. - Usage: see below. Tests on my machine (Win XP 32 bit, Intel T2400 CPU @ 1.83 GHz) show the following: 1) tools/diff/diff from trunk@1058723: 1.020 s 2) tools/diff/diff from diff-optimizations@1058811: 0.370 s 3) tools/diff/diff from diff-optimizations@1058811 with stefan2's low-level optimizations [1]: 0.290 s 4) GNU diff: 0.157 s (it should be noted that svn's tools/diff/diff has a much higher startup cost than GNU diff (for whatever reason), so that alone accounts for part of the difference with GNU diff) For really analyzing the benefit of the low-level optimizations (an which part of those have the most impact), maybe bigger sample data is needed. =========== $ ./gen-big-files.py --help Usage: Generate files for diff Options: -h, --help show this help message and exit -1 FILE1, --file1=FILE1 filename of left file of the diff, default file1.txt -2 FILE2, --file2=FILE2 filename of right file of the diff, default file2.txt -p PREFIX_LINES, --prefix-lines=PREFIX_LINES number of prefix lines, default 100000 -s SUFFIX_LINES, --suffix-lines=SUFFIX_LINES number of suffix lines, default 100000 -m MIDDLE_LINES, --middle-lines=MIDDLE_LINES number of lines in the middle, non-matching section, default 500 --percent-mismatch=PERCENT_MISMATCH percentage of mismatches in middle section, default 50 --min-line-length=MIN_LINE_LENGTH minimum length of randomly generated lines, default 0 --max-line-length=MAX_LINE_LENGTH maximum length of randomly generated lines, default 80 Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2011-01/0005.shtml - I have yet to integrate (some of) these suggestions into the branch. That may take me another couple of days (identifying which changes have the biggest speed/weight gain etc).