Re: Diff optimizations and generating big test files
On Wed, Jan 5, 2011 at 4:33 PM, Philip Martin philip.mar...@wandisco.com wrote: Johan Corveleyn jcor...@gmail.com writes: Thanks for the script, it gives me some good inspiration. However, it doesn't fit well with the optimization that's currently being done on the diff-optimizations-bytes branch, because the differing lines are spread throughout the entire file. I thought you were working on two different prefix problems, but if it's all the same problem that's fine. It's why I want *you* to write the script, then I can test your patches on my machine. When you are thinking of replacing function calls with macros that's very much hardware/OS/compiler specific and testing on more than one platform is important. Sorry it took so long (busy/interrupted with other things), but here in attachment is finally a python script that generates two files suitable for testing the prefix/suffix optimization of the diff-optimizations-bytes branch: - Without options, it generates two files file1.txt and file2.txt, with 100,000 lines of identical prefix and 100,000 lines of identical suffix. And in between a mis-matching section of 500 lines (with a probability of mismatch of 50%). - Lines are randomly generated, with random lengths between 0 and 80 (by default). - On my machine, it generates those two files of ~8 Mb in about 17 seconds. - Usage: see below. Tests on my machine (Win XP 32 bit, Intel T2400 CPU @ 1.83 GHz) show the following: 1) tools/diff/diff from trunk@1058723: 1.020 s 2) tools/diff/diff from diff-optimizations@1058811: 0.370 s 3) tools/diff/diff from diff-optimizations@1058811 with stefan2's low-level optimizations [1]: 0.290 s 4) GNU diff: 0.157 s (it should be noted that svn's tools/diff/diff has a much higher startup cost than GNU diff (for whatever reason), so that alone accounts for part of the difference with GNU diff) For really analyzing the benefit of the low-level optimizations (an which part of those have the most impact), maybe bigger sample data is needed. === $ ./gen-big-files.py --help Usage: Generate files for diff Options: -h, --helpshow this help message and exit -1 FILE1, --file1=FILE1 filename of left file of the diff, default file1.txt -2 FILE2, --file2=FILE2 filename of right file of the diff, default file2.txt -p PREFIX_LINES, --prefix-lines=PREFIX_LINES number of prefix lines, default 10 -s SUFFIX_LINES, --suffix-lines=SUFFIX_LINES number of suffix lines, default 10 -m MIDDLE_LINES, --middle-lines=MIDDLE_LINES number of lines in the middle, non-matching section, default 500 --percent-mismatch=PERCENT_MISMATCH percentage of mismatches in middle section, default 50 --min-line-length=MIN_LINE_LENGTH minimum length of randomly generated lines, default 0 --max-line-length=MAX_LINE_LENGTH maximum length of randomly generated lines, default 80 Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2011-01/0005.shtml - I have yet to integrate (some of) these suggestions into the branch. That may take me another couple of days (identifying which changes have the biggest speed/weight gain etc).
Re: My take on the diff-optimizations-bytes branch
On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 03.01.2011 02:14, Johan Corveleyn wrote: On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com wrote: On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com wrote: For now, some feedback on the rest of the patch: [[[ - DECREMENT_POINTERS(file, file_len, pool); - } + DECREMENT_POINTERS(file, file_len, pool); } + return SVN_NO_ERROR; +} + +/* Find the prefix which is identical between all elements of the FILE array. + * Return the number of prefix lines in PREFIX_LINES. REACHED_ONE_EOF will be + * set to TRUE if one of the FILEs reached its end while scanning prefix, + * i.e. at least one file consisted entirely of prefix. Otherwise, + * REACHED_ONE_EOF is set to FALSE. + * + * After this function is finished, the buffers, chunks, curp's and endp's + * of the FILEs are set to point at the first byte after the prefix. */ +static svn_error_t * +find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t *prefix_lines, + struct file_info file[], apr_size_t file_len, + apr_pool_t *pool) +{ + if (file_len == 2) + SVN_ERR(find_identical_prefix_fast(reached_one_eof, prefix_lines, + file, pool)); + else + SVN_ERR(find_identical_prefix_slow(reached_one_eof, prefix_lines, + file, file_len, pool)); + Small problem here (not really visible in practice, but potential change in behavior none the less): if either of the above function calls exited early because of reached_all_eof, this function should also exit early. Yep, that is just an artifact caused by moving code into a sub-function. Ok, that's no problem anymore, since I removed the early-exit in r1057435. It was actually not correct, and was the main reason why some tests kept failing when I ported it to diff3 (see that revisions commit message for more details). [ ... snip ... ] Also, it would really help if you could split up this patch (though this was certainly a good way to try it out and give me an idea of the possibilities). Giving you some ideas was the main point of it. If I find some time, I will try to reduce the patch (no fast / slow separation). But since the heap to stack conversion spreads across all of the code, it is hard to split the patch. It would be interesting to see where the biggest gains are coming from (I'm guessing from the per-machine-word reading/comparing; I'd like to try that first, maybe together with the allocating of the file[] on the stack; I'd like to avoid special-casing file_len==2, splitting up functions in *_slow and *_fast, because it makes it a lot more difficult imho (so I'd only like to do that if it's a significant contributor)). But if you want to leave that as an exercise for the reader, that's fine as well :-). Exercise is certainly not a bad thing ;) Ok, I'll give it a go one of the coming days, now that I've got diff3/diff4 working, and I've got a suitable big-files-generating-script for creating reproducible test-files. But I think, the stack variable is certainly helpful and easy to do. While chunky operation gives a *big* boost, it is much more difficult to code if you need to compare multiple sources. It should be possible, though. The great advantage of the file_len==2 case is that this is the only time-critical operation (a true merge over long files is rare). Also, many constructs collapse if the count is fixed to 2: for (i = 1, is_match = TRUE; i file_len; i++) is_match = is_match *file[0].curp == *file[i].curp; while(is_match) { ... for (i = 1, is_match = TRUE; i file_len; i++) is_match = is_match *file[0].curp == *file[i].curp; } becomes while(*file[0].curp == *file[1].curp) { } Hm, that's true, but I'm not sure that will have big impact (but like you say, we'll need to do more tests). And now that I've read about a really bad merge-case on the users-list [1], I'm thinking that diff3 should probably also get all the perf-improvement that we can give it. Actually, what's a little bit troubling is that there are currently only 3 possible file_len's, of which only 2 are used in practice: diff2 and diff3 (diff4 is not used in svn core, only in tools/diff/diff4). So, if we make a slow and a fast version, we could just as well make a XXX2 and an XXX3 version explicitly, and dispense with the need to pass arrays and array lenghts. Hmmm... generic (but more complex) code vs. a couple of specialized versions. But basically, we will need to run some more tests. -- Stefan^2. Cheers, -- Johan [1] http://svn.haxx.se/users/archive-2011-01/0245.shtml
Coding goals / requirements.
Hi everyone, I have a general question about Subversion development and instead of assuming, thought I better seek an answer. Outside of my limited exposure as the PM - I have no Open Source coding experience - so there may well be an expectation of coding that I simply don't know about. This email stems from the last paragraph in this message by Johan Corveleyn; Which I have copied into this email. If you're already familiar with the diff-optimisation you can get the full message here; http://svn.haxx.se/dev/archive-2011-01/0241.shtml And if you want some more background the full thread is here; http://svn.haxx.se/dev/archive-2011-01/0004.shtml and, http://svn.haxx.se/dev/archive-2011-01/0005.shtml which I was tempted to just reply to - but thought I had better not hijack the existing thread. To save you from having to find the post, here is the paragraph I'm referring to; Actually, what's a little bit troubling is that there are currently only 3 possible file_len's, of which only 2 are used in practice: diff2 and diff3 (diff4 is not used in svn core, only in tools/diff/diff4). So, if we make a slow and a fast version, we could just as well make a XXX2 and an XXX3 version explicitly, and dispense with the need to pass arrays and array lenghts. Hmmm... generic (but more complex) code vs. a couple of specialized versions. The goal of the thread / patch submission is to increase the speed of diffing. So, does the design, truly, matter? As the underlying design of all software should be to create the simplest / cleanest / loosely-coupled code that addresses the problem. Surely the answer is; Man-power is limited - as long as you observe the coding guidelines / styles etc. Is the answer not; Code to the manner that provides the greatest return on a developer's time for THIS problem. Assuming the specialised version is indeed faster; If for whatever reason - it is decided that an API is required (public or private) - then an appropriate wrapper can be added when the specific requirement of having an API for those functions becomes a requirement as opposed to simply being a nice design goal to aim for upfront? As always - thanks for patience / assistance. Gavin.