Re: My take on the diff-optimizations-bytes branch
On 26.01.2011 03:09, Johan Corveleyn wrote: On Tue, Jan 25, 2011 at 1:37 AM, Stefan Fuhrmanneq...@web.de wrote: [ ... snip ...] And, as promised, here some ideas how to get more speed from the generic code. Your latest commit: +#if SVN_UNALIGNED_ACCESS_IS_OK + + /* Skip quickly over the stuff between EOLs. */ + for (i = 0, can_read_word = TRUE; ifile_len; i++) +can_read_word = can_read_word +(file[i].curp + sizeof(apr_size_t)file[i].endp); + while (can_read_word) +{ + for (i = 1, is_match = TRUE; ifile_len; i++) +is_match = is_match +( *(const apr_size_t *)file[0].curp + == *(const apr_size_t *)file[i].curp); + + if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp)) +break; + + for (i = 0; ifile_len; i++) +file[i].curp += sizeof(apr_size_t); + for (i = 0, can_read_word = TRUE; ifile_len; i++) +can_read_word = can_read_word +(file[i].curp + sizeof(apr_size_t)file[i].endp); +} + +#endif could be changed to something like the following. Please note that I haven't tested any of this: Thanks. There was one error in your suggestion, which I found out after testing. See below. I was afraid so. I just hacked the code directly into Thunderbird and was unsure whether the exit behavior was correct. /* Determine how far we may advance with chunky ops without reaching * endp for any of the files. * Signedness is important here if curp gets close to endp. */ apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t); for (i = 1; ifile_len; i++) { apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t); if (deltamax_delta) max_delta = delta; } /* the former while() loop */ is_match = TRUE; for (delta = 0; deltamax_deltais_match; delta += sizeof(apr_size_t)) { apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta); if (contains_eol(chunk)) break; for (i = 1; ifile_len; i++) if (chunk != *(const apr_size_t *)(file[i].curp + delta)) { is_match = FALSE; Here, I inserted: delta -= sizeof(apr_size_t); because otherwise, delta will be increased too far (it will still be increased by the counting expression of the outer for-loop (after which it will stop because of !is_match)). Maybe there is a cleaner/clearer way to break out of the outer for-loop here, without incrementing delta again, but for now, I've committed it with this change (r1063565). Yeah, I really wished for a goto there. The clean solution, I guess, would be moving that section to some sub-routine returning the final value of delta. break; } } /* We either found a mismatch or an EOL at or shortly behind curp+delta * or we cannot proceed with chunky ops without exceeding endp. * In any way, everything up to curp + delta is equal and not an EOL. */ for (i = 0; ifile_len; i++) file[i].curp += delta; Thanks. This gives on my machine/example another 15-20% performance increase (datasources_open time going down from ~21 s to ~17 s). We should probably do the same for suffix scanning, but I'm too tired right now :-) (and suffix scanning is more difficult to grok, so not a good idea to do at 3 am). Don't get yourself burned out ;) -- Stefan^2.
Re: My take on the diff-optimizations-bytes branch
On Tue, Jan 25, 2011 at 1:37 AM, Stefan Fuhrmann eq...@web.de wrote: [ ... snip ...] And, as promised, here some ideas how to get more speed from the generic code. Your latest commit: +#if SVN_UNALIGNED_ACCESS_IS_OK + + /* Skip quickly over the stuff between EOLs. */ + for (i = 0, can_read_word = TRUE; i file_len; i++) + can_read_word = can_read_word + (file[i].curp + sizeof(apr_size_t) file[i].endp); + while (can_read_word) + { + for (i = 1, is_match = TRUE; i file_len; i++) + is_match = is_match + ( *(const apr_size_t *)file[0].curp + == *(const apr_size_t *)file[i].curp); + + if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp)) + break; + + for (i = 0; i file_len; i++) + file[i].curp += sizeof(apr_size_t); + for (i = 0, can_read_word = TRUE; i file_len; i++) + can_read_word = can_read_word + (file[i].curp + sizeof(apr_size_t) file[i].endp); + } + +#endif could be changed to something like the following. Please note that I haven't tested any of this: Thanks. There was one error in your suggestion, which I found out after testing. See below. /* Determine how far we may advance with chunky ops without reaching * endp for any of the files. * Signedness is important here if curp gets close to endp. */ apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t); for (i = 1; i file_len; i++) { apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t); if (delta max_delta) max_delta = delta; } /* the former while() loop */ is_match = TRUE; for (delta = 0; delta max_delta is_match; delta += sizeof(apr_size_t)) { apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta); if (contains_eol(chunk)) break; for (i = 1; i file_len; i++) if (chunk != *(const apr_size_t *)(file[i].curp + delta)) { is_match = FALSE; Here, I inserted: delta -= sizeof(apr_size_t); because otherwise, delta will be increased too far (it will still be increased by the counting expression of the outer for-loop (after which it will stop because of !is_match)). Maybe there is a cleaner/clearer way to break out of the outer for-loop here, without incrementing delta again, but for now, I've committed it with this change (r1063565). break; } } /* We either found a mismatch or an EOL at or shortly behind curp+delta * or we cannot proceed with chunky ops without exceeding endp. * In any way, everything up to curp + delta is equal and not an EOL. */ for (i = 0; i file_len; i++) file[i].curp += delta; Thanks. This gives on my machine/example another 15-20% performance increase (datasources_open time going down from ~21 s to ~17 s). We should probably do the same for suffix scanning, but I'm too tired right now :-) (and suffix scanning is more difficult to grok, so not a good idea to do at 3 am). Cheers, -- Johan
Re: My take on the diff-optimizations-bytes branch
On 23.01.2011 22:46, Johan Corveleyn wrote: On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 03.01.2011 02:14, Johan Corveleyn wrote: It would be interesting to see where the biggest gains are coming from (I'm guessing from the per-machine-word reading/comparing; I'd like to try that first, maybe together with the allocating of the file[] on the stack; I'd like to avoid special-casing file_len==2, splitting up functions in *_slow and *_fast, because it makes it a lot more difficult imho (so I'd only like to do that if it's a significant contributor)). But if you want to leave that as an exercise for the reader, that's fine as well :-). Exercise is certainly not a bad thing ;) But I think, the stack variable is certainly helpful and easy to do. While chunky operation gives a *big* boost, it is much more difficult to code if you need to compare multiple sources. It should be possible, though. Ok, I finished the exercise :-). - File info structures on stack instead of on heap: committed in r1060625. Gives 10% boost. - chunky operation for multiple sources: committed in r1062532. Gives ~250% boost on my machine. On 64 bit machines, it should give another 50 to 100% throughput. I think that kind of speed-up in an important function justifies the somewhat unfavorable #if sections, mask voodoo etc. For suffix scanning, I made a couple of changes to the logic you sent in your patch (as you remember, there was a problem with the implementation of suffix scanning [1]). It was (in scan_backward_fast): [[[ +#if SVN_UNALIGNED_ACCESS_IS_OK + + if ( (file[0].curp - sizeof(apr_size_t) file[0].curp) + (file[1].curp - sizeof(apr_size_t) file[1].curp)) Args! Copy'n'pasto. +{ + do +{ + file[0].curp -= sizeof(apr_size_t); + file[1].curp -= sizeof(apr_size_t); +} + while ( (file[0].curp= min_curp0) + (file[1].curp= min_curp1) + ( *(const apr_size_t*)(file[0].curp + 1) + == *(const apr_size_t*)(file[1].curp + 1))); + + file[0].curp += sizeof(apr_size_t); + file[1].curp += sizeof(apr_size_t); +} ]]] snip Now, you suggested in a previous post in this thread that hardcoding file_len to 2 should speed it up with up to 50%. And you are correct (on my machine, it's sped up by ~40% if I replace all file_len's in find_identical_* with 2). So, that leads me to conclude that I should probably throw away the generic code, and replace it with 3 variants: find_identical_*2, find_identical_*3 and find_identical_*4 (once I write *2, the rest should be mostly copy-n-paste). As you said, this should also vastly simplify the code (drop the for loops, simpler conditionals, ...). As discussed on IRC yesterday, we should first exploit further optimization potential in the general code. And only after that is maxed out, use that as a template for the specific implementations. And, as promised, here some ideas how to get more speed from the generic code. Your latest commit: +#if SVN_UNALIGNED_ACCESS_IS_OK + + /* Skip quickly over the stuff between EOLs. */ + for (i = 0, can_read_word = TRUE; i file_len; i++) +can_read_word = can_read_word + (file[i].curp + sizeof(apr_size_t) file[i].endp); + while (can_read_word) +{ + for (i = 1, is_match = TRUE; i file_len; i++) +is_match = is_match + ( *(const apr_size_t *)file[0].curp + == *(const apr_size_t *)file[i].curp); + + if (!is_match || contains_eol(*(const apr_size_t *)file[0].curp)) +break; + + for (i = 0; i file_len; i++) +file[i].curp += sizeof(apr_size_t); + for (i = 0, can_read_word = TRUE; i file_len; i++) +can_read_word = can_read_word + (file[i].curp + sizeof(apr_size_t) file[i].endp); +} + +#endif could be changed to something like the following. Please note that I haven't tested any of this: /* Determine how far we may advance with chunky ops without reaching * endp for any of the files. * Signedness is important here if curp gets close to endp. */ apr_ssize_t max_delta = file[0].endp - file[0].curp - sizeof(apr_size_t); for (i = 1; i file_len; i++) { apr_ssize_t delta = file[i].endp - file[i].curp - sizeof(apr_size_t); if (delta max_delta) max_delta = delta; } /* the former while() loop */ is_match = TRUE; for (delta = 0; delta max_delta is_match; delta += sizeof(apr_size_t)) { apr_size_t chunk = *(const apr_size_t *)(file[0].curp + delta); if (contains_eol(chunk)) break; for (i = 1; i file_len; i++) if (chunk != *(const apr_size_t *)(file[i].curp + delta)) { is_match = FALSE; break; } } /* We either found a mismatch or an EOL at or shortly behind curp+delta * or we cannot proceed with chunky ops without exceeding endp. * In any way, everything up
Re: My take on the diff-optimizations-bytes branch
On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 03.01.2011 02:14, Johan Corveleyn wrote: It would be interesting to see where the biggest gains are coming from (I'm guessing from the per-machine-word reading/comparing; I'd like to try that first, maybe together with the allocating of the file[] on the stack; I'd like to avoid special-casing file_len==2, splitting up functions in *_slow and *_fast, because it makes it a lot more difficult imho (so I'd only like to do that if it's a significant contributor)). But if you want to leave that as an exercise for the reader, that's fine as well :-). Exercise is certainly not a bad thing ;) But I think, the stack variable is certainly helpful and easy to do. While chunky operation gives a *big* boost, it is much more difficult to code if you need to compare multiple sources. It should be possible, though. Ok, I finished the exercise :-). - File info structures on stack instead of on heap: committed in r1060625. Gives 10% boost. - chunky operation for multiple sources: committed in r1062532. Gives ~250% boost on my machine. For suffix scanning, I made a couple of changes to the logic you sent in your patch (as you remember, there was a problem with the implementation of suffix scanning [1]). It was (in scan_backward_fast): [[[ +#if SVN_UNALIGNED_ACCESS_IS_OK + + if ( (file[0].curp - sizeof(apr_size_t) file[0].curp) + (file[1].curp - sizeof(apr_size_t) file[1].curp)) +{ + do +{ + file[0].curp -= sizeof(apr_size_t); + file[1].curp -= sizeof(apr_size_t); +} + while ( (file[0].curp = min_curp0) + (file[1].curp = min_curp1) + ( *(const apr_size_t*)(file[0].curp + 1) + == *(const apr_size_t*)(file[1].curp + 1))); + + file[0].curp += sizeof(apr_size_t); + file[1].curp += sizeof(apr_size_t); +} ]]] I changed this to (the N-file equivalent of): [[[ +#if SVN_UNALIGNED_ACCESS_IS_OK + + if ( (file[0].curp - sizeof(apr_size_t) = min_curp0) + (file[1].curp - sizeof(apr_size_t) = min_curp1)) +{ + do +{ + file[0].curp -= sizeof(apr_size_t); + file[1].curp -= sizeof(apr_size_t); +} + while ( (file[0].curp - sizeof(apr_size_t) = min_curp0) + (file[1].curp - sizeof(apr_size_t) = min_curp1) + ( *(const apr_size_t*)(file[0].curp + 1) + == *(const apr_size_t*)(file[1].curp + 1))); + + file[0].curp += sizeof(apr_size_t); + file[1].curp += sizeof(apr_size_t); +} ]]] (so, changed the if-comparisons before the do-while-loop, to make sure there is room to subtract a sizeof(apr_size_t) (the original didn't really make sense to me), and changed the comparisons with min_curpX in the while condition (check if there is still room to subtract a sizeof(apr_size_t)). After that, all tests went fine. Now, you suggested in a previous post in this thread that hardcoding file_len to 2 should speed it up with up to 50%. And you are correct (on my machine, it's sped up by ~40% if I replace all file_len's in find_identical_* with 2). So, that leads me to conclude that I should probably throw away the generic code, and replace it with 3 variants: find_identical_*2, find_identical_*3 and find_identical_*4 (once I write *2, the rest should be mostly copy-n-paste). As you said, this should also vastly simplify the code (drop the for loops, simpler conditionals, ...). Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2011-01/0022.shtml
Re: My take on the diff-optimizations-bytes branch
On Sun, Jan 23, 2011 at 10:46 PM, Johan Corveleyn jcor...@gmail.com wrote: On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 03.01.2011 02:14, Johan Corveleyn wrote: It would be interesting to see where the biggest gains are coming from (I'm guessing from the per-machine-word reading/comparing; I'd like to try that first, maybe together with the allocating of the file[] on the stack; I'd like to avoid special-casing file_len==2, splitting up functions in *_slow and *_fast, because it makes it a lot more difficult imho (so I'd only like to do that if it's a significant contributor)). But if you want to leave that as an exercise for the reader, that's fine as well :-). Exercise is certainly not a bad thing ;) But I think, the stack variable is certainly helpful and easy to do. While chunky operation gives a *big* boost, it is much more difficult to code if you need to compare multiple sources. It should be possible, though. Ok, I finished the exercise :-). - File info structures on stack instead of on heap: committed in r1060625. Gives 10% boost. - chunky operation for multiple sources: committed in r1062532. Gives ~250% boost on my machine. Oops, as some people pointed out on IRC, that should be ~60% (I meant 2.5 times faster, going from ~50s to ~20s, so that's about ~60% less time spent in datasources_open). -- Johan
Re: My take on the diff-optimizations-bytes branch
On Wed, Jan 19, 2011 at 1:19 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 18.01.2011 12:56, Johan Corveleyn wrote: On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleynjcor...@gmail.com wrote: On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: [ ... snip ... ] But I think, the stack variable is certainly helpful and easy to do. Ok, I've done this (locally, still have to clean up a little and then commit). It gives a nice 10% speedup of prefix/suffix scanning, which is great. I have a couple of small questions though, about other small optimizations you did: Index: subversion/libsvn_diff/diff_file.c === --- subversion/libsvn_diff/diff_file.c (revision 1054044) +++ subversion/libsvn_diff/diff_file.c (working copy) ... @@ -258,10 +259,10 @@ \ for (i = 0; i files_len; i++) \ { \ - if (all_files[i]-curp all_files[i]-endp - 1) \ - all_files[i]-curp++; \ + if (all_files[i].curp + 1 all_files[i].endp) \ + all_files[i].curp++; \ Just curious: why did you change that condition (from 'X Y - 1' to 'X + 1 Y')? You mention in your log message: minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS). Why does this help, and what's the potential impact? There is a hidden common sub-expression. Variant 1 in pseudo-code: lhs = all_files[i].curp; temp1 = all_files[i].endp; rhs = temp1 - 1; if (lhs rhs) { temp2 = lhs + 1; all_files[i].curp = temp2; } Variant 2: lhs = all_files[i].curp; temp = lhs + 1; rhs = all_files[i].endp; if (lhs rhs) all_files[i].curp = temp; The problem is that the compiler cannot figure that out on its own because (x+1) and (y-1) don't overflow / wrap around for the same values. In particular 0 (0-1) is true and (0+1) 0 is false in unsigned arithmetic. Thanks for the explanation. So, I understand why this should be more efficient, except ... that it's not, for some reason. At least not after testing it on my x86 machine. It's about 1% slower, tested with two test files generated with the gen-big-files.py script (sent in the other thread), with 1,000,000 prefix and suffix lines, and 500 mismatching lines in the middle: $ ./gen-big-files.py -1 big1.txt -2 big2.txt -p 100 -s 100 So for now, I didn't commit this change. Maybe it's better on other platforms, so it may still make sense to do it, but then again, at around 1% it's probably not that important ... Second question: +find_identical_prefix_slow(svn_boolean_t *reached_one_eof, + apr_off_t *prefix_lines, + struct file_info file[], apr_size_t file_len, + apr_pool_t *pool) +{ + svn_boolean_t had_cr = FALSE; svn_boolean_t is_match, reached_all_eof; apr_size_t i; + apr_off_t lines = 0; *prefix_lines = 0; for (i = 1, is_match = TRUE; i file_len; i++) - is_match = is_match *file[0]-curp == *file[i]-curp; + is_match = is_match *file[0].curp == *file[i].curp; + while (is_match) { /* ### TODO: see if we can take advantage of diff options like ignore_eol_style or ignore_space. */ /* check for eol, and count */ - if (*file[0]-curp == '\r') + if (*file[0].curp == '\r') { - (*prefix_lines)++; + lines++; Here you added a local variable 'lines', which is incremented in the function, and copied to *prefix_lines at the end (you do the same in fine_identical_prefix_fast). The original code just sets and increments *prefix_lines directly. So, out of curiousness: why does this help, and how much? Incrementing the temporary variable requires one indirection less than the original code and uses only one instead of two registers (in typical cases). The 'how much?' part depends on compiler / optimizer and even more on the target platform (x86 has very few registers to keep data in fly; x64 has about twice as many). Ok, this apparently sped it up for around 1% on my machine. So I committed this in r1062275. Thanks. Cheers, -- Johan
Re: My take on the diff-optimizations-bytes branch
On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleyn jcor...@gmail.com wrote: On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: [ ... snip ... ] But I think, the stack variable is certainly helpful and easy to do. Ok, I've done this (locally, still have to clean up a little and then commit). It gives a nice 10% speedup of prefix/suffix scanning, which is great. I have a couple of small questions though, about other small optimizations you did: Index: subversion/libsvn_diff/diff_file.c === --- subversion/libsvn_diff/diff_file.c(revision 1054044) +++ subversion/libsvn_diff/diff_file.c(working copy) ... @@ -258,10 +259,10 @@ \ for (i = 0; i files_len; i++) \ { \ - if (all_files[i]-curp all_files[i]-endp - 1) \ -all_files[i]-curp++; \ + if (all_files[i].curp + 1 all_files[i].endp) \ +all_files[i].curp++; \ Just curious: why did you change that condition (from 'X Y - 1' to 'X + 1 Y')? You mention in your log message: minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS). Why does this help, and what's the potential impact? Second question: +find_identical_prefix_slow(svn_boolean_t *reached_one_eof, + apr_off_t *prefix_lines, + struct file_info file[], apr_size_t file_len, + apr_pool_t *pool) +{ + svn_boolean_t had_cr = FALSE; svn_boolean_t is_match, reached_all_eof; apr_size_t i; + apr_off_t lines = 0; *prefix_lines = 0; for (i = 1, is_match = TRUE; i file_len; i++) -is_match = is_match *file[0]-curp == *file[i]-curp; +is_match = is_match *file[0].curp == *file[i].curp; + while (is_match) { /* ### TODO: see if we can take advantage of diff options like ignore_eol_style or ignore_space. */ /* check for eol, and count */ - if (*file[0]-curp == '\r') + if (*file[0].curp == '\r') { - (*prefix_lines)++; + lines++; Here you added a local variable 'lines', which is incremented in the function, and copied to *prefix_lines at the end (you do the same in fine_identical_prefix_fast). The original code just sets and increments *prefix_lines directly. So, out of curiousness: why does this help, and how much? Thanks, -- Johan
Re: My take on the diff-optimizations-bytes branch
On 18.01.2011 12:56, Johan Corveleyn wrote: On Mon, Jan 17, 2011 at 3:12 AM, Johan Corveleynjcor...@gmail.com wrote: On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: [ ... snip ... ] But I think, the stack variable is certainly helpful and easy to do. Ok, I've done this (locally, still have to clean up a little and then commit). It gives a nice 10% speedup of prefix/suffix scanning, which is great. I have a couple of small questions though, about other small optimizations you did: Index: subversion/libsvn_diff/diff_file.c === --- subversion/libsvn_diff/diff_file.c (revision 1054044) +++ subversion/libsvn_diff/diff_file.c (working copy) ... @@ -258,10 +259,10 @@ \ for (i = 0; i files_len; i++) \ {\ - if (all_files[i]-curp all_files[i]-endp - 1) \ -all_files[i]-curp++;\ + if (all_files[i].curp + 1 all_files[i].endp) \ +all_files[i].curp++; \ Just curious: why did you change that condition (from 'X Y - 1' to 'X + 1 Y')? You mention in your log message: minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS). Why does this help, and what's the potential impact? There is a hidden common sub-expression. Variant 1 in pseudo-code: lhs = all_files[i].curp; temp1 = all_files[i].endp; rhs = temp1 - 1; if (lhs rhs) { temp2 = lhs + 1; all_files[i].curp = temp2; } Variant 2: lhs = all_files[i].curp; temp = lhs + 1; rhs = all_files[i].endp; if (lhs rhs) all_files[i].curp = temp; The problem is that the compiler cannot figure that out on its own because (x+1) and (y-1) don't overflow / wrap around for the same values. In particular 0 (0-1) is true and (0+1) 0 is false in unsigned arithmetic. Second question: +find_identical_prefix_slow(svn_boolean_t *reached_one_eof, + apr_off_t *prefix_lines, + struct file_info file[], apr_size_t file_len, + apr_pool_t *pool) +{ + svn_boolean_t had_cr = FALSE; svn_boolean_t is_match, reached_all_eof; apr_size_t i; + apr_off_t lines = 0; *prefix_lines = 0; for (i = 1, is_match = TRUE; i file_len; i++) -is_match = is_match *file[0]-curp == *file[i]-curp; +is_match = is_match *file[0].curp == *file[i].curp; + while (is_match) { /* ### TODO: see if we can take advantage of diff options like ignore_eol_style or ignore_space. */ /* check for eol, and count */ - if (*file[0]-curp == '\r') + if (*file[0].curp == '\r') { - (*prefix_lines)++; + lines++; Here you added a local variable 'lines', which is incremented in the function, and copied to *prefix_lines at the end (you do the same in fine_identical_prefix_fast). The original code just sets and increments *prefix_lines directly. So, out of curiousness: why does this help, and how much? Incrementing the temporary variable requires one indirection less than the original code and uses only one instead of two registers (in typical cases). The 'how much?' part depends on compiler / optimizer and even more on the target platform (x86 has very few registers to keep data in fly; x64 has about twice as many). So, both changes are more or less general coding patterns that will improve performance somewhat without compromising readability. -- Stefan^2.
Re: My take on the diff-optimizations-bytes branch
On Sat, Jan 8, 2011 at 6:50 AM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 03.01.2011 02:14, Johan Corveleyn wrote: On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com wrote: On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com wrote: For now, some feedback on the rest of the patch: [[[ - DECREMENT_POINTERS(file, file_len, pool); - } + DECREMENT_POINTERS(file, file_len, pool); } + return SVN_NO_ERROR; +} + +/* Find the prefix which is identical between all elements of the FILE array. + * Return the number of prefix lines in PREFIX_LINES. REACHED_ONE_EOF will be + * set to TRUE if one of the FILEs reached its end while scanning prefix, + * i.e. at least one file consisted entirely of prefix. Otherwise, + * REACHED_ONE_EOF is set to FALSE. + * + * After this function is finished, the buffers, chunks, curp's and endp's + * of the FILEs are set to point at the first byte after the prefix. */ +static svn_error_t * +find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t *prefix_lines, + struct file_info file[], apr_size_t file_len, + apr_pool_t *pool) +{ + if (file_len == 2) + SVN_ERR(find_identical_prefix_fast(reached_one_eof, prefix_lines, + file, pool)); + else + SVN_ERR(find_identical_prefix_slow(reached_one_eof, prefix_lines, + file, file_len, pool)); + Small problem here (not really visible in practice, but potential change in behavior none the less): if either of the above function calls exited early because of reached_all_eof, this function should also exit early. Yep, that is just an artifact caused by moving code into a sub-function. Ok, that's no problem anymore, since I removed the early-exit in r1057435. It was actually not correct, and was the main reason why some tests kept failing when I ported it to diff3 (see that revisions commit message for more details). [ ... snip ... ] Also, it would really help if you could split up this patch (though this was certainly a good way to try it out and give me an idea of the possibilities). Giving you some ideas was the main point of it. If I find some time, I will try to reduce the patch (no fast / slow separation). But since the heap to stack conversion spreads across all of the code, it is hard to split the patch. It would be interesting to see where the biggest gains are coming from (I'm guessing from the per-machine-word reading/comparing; I'd like to try that first, maybe together with the allocating of the file[] on the stack; I'd like to avoid special-casing file_len==2, splitting up functions in *_slow and *_fast, because it makes it a lot more difficult imho (so I'd only like to do that if it's a significant contributor)). But if you want to leave that as an exercise for the reader, that's fine as well :-). Exercise is certainly not a bad thing ;) Ok, I'll give it a go one of the coming days, now that I've got diff3/diff4 working, and I've got a suitable big-files-generating-script for creating reproducible test-files. But I think, the stack variable is certainly helpful and easy to do. While chunky operation gives a *big* boost, it is much more difficult to code if you need to compare multiple sources. It should be possible, though. The great advantage of the file_len==2 case is that this is the only time-critical operation (a true merge over long files is rare). Also, many constructs collapse if the count is fixed to 2: for (i = 1, is_match = TRUE; i file_len; i++) is_match = is_match *file[0].curp == *file[i].curp; while(is_match) { ... for (i = 1, is_match = TRUE; i file_len; i++) is_match = is_match *file[0].curp == *file[i].curp; } becomes while(*file[0].curp == *file[1].curp) { } Hm, that's true, but I'm not sure that will have big impact (but like you say, we'll need to do more tests). And now that I've read about a really bad merge-case on the users-list [1], I'm thinking that diff3 should probably also get all the perf-improvement that we can give it. Actually, what's a little bit troubling is that there are currently only 3 possible file_len's, of which only 2 are used in practice: diff2 and diff3 (diff4 is not used in svn core, only in tools/diff/diff4). So, if we make a slow and a fast version, we could just as well make a XXX2 and an XXX3 version explicitly, and dispense with the need to pass arrays and array lenghts. Hmmm... generic (but more complex) code vs. a couple of specialized versions. But basically, we will need to run some more tests. -- Stefan^2. Cheers, -- Johan [1] http://svn.haxx.se/users/archive-2011-01/0245.shtml
Re: My take on the diff-optimizations-bytes branch
On 03.01.2011 13:16, Johan Corveleyn wrote: On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 03.01.2011 02:14, Johan Corveleyn wrote: On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com wrote: [... snip ...] And it's fast too! It's taking only 58 seconds in diff, vs. 72 for the normal version. Splitting that up further, it's taking only 28 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for some reason, the lcs processing took a little longer, but that's probably unrelated (only did a single run)). And that's with scan_backward_slow. That's pretty amazing. On x64, datasource_open() got about 10x faster but my test-case was the TSVN changelog where we (mainly) insert at the beginning instead of close to end of the file. So, one broken scan direction seems completely plausible ;) Strange, if you mainly insert at the beginning, it should be mostly broken, because then it's mostly identical suffix (the part which is broken). But maybe it's only broken under certain edge conditions, so that could still explain it... It has also been a relatively short file (100k) and therefore, maybe, only a single hunk. Maybe it's best to refer to datasource*s*_open(), as opposed to the original datasource_open(). In the original diff algorithm, datasources were opened each one by one (and then extracting tokens, inserting them into token tree, and then move on to the next datasource). In diff-optimizations-bytes branch, I introduced datasource*s*_open, to open them together, in order to chop off the identical prefix/suffix. It's probably not a good function name, because it's so similar to the old one (maybe datasource_open_all or something would be better). But OTOH: when I started with this, I thought I could remove/deprecate the datasource_open, once all diffN algorithms would use datasource*s*_open. Oh well ... I didn't even notice that there were two such functions and simply worked on the one that consumed a lot of CPU time ;) The code of scan_backward_fast is pretty difficult to understand for me. So I'm not sure if I can spot the problem. I'll continue staring at it for a little while ... Ok, I can't find it right now. There must be some functional difference between scan_backward_fast and scan_backward_slow. One way to get closer to the problem: * create a backup of the current rename scan_backward_fast code * copy scan_backward_slow to scan_backward_fast * replace all file_len with a hard-coded 2 - can give a 50% boost depending on optimizer cleverness * step-by-step apply changes from the old scan_backward_fast code Ok, if I have some time tonight, I will try to get closer ... For now, some feedback on the rest of the patch: [[[ [... snip ...] + if (had_cr) +{ + /* Check if we ended in the middle of a \r\n for one file, but \r for + another. If so, back up one byte, so the next loop will back up + the entire line. Also decrement *prefix_lines, since we counted one + too many for the \r. */ + if ((*file[0].curp == '\n') || (*file[1].curp == '\n')) Here (or below DECREMENT, but within this same if condition) you need: (*prefix_lines)--; just like in the original (and *_slow) case. As is explained in the comment above: if we are here, we actually counted a full prefix line too much, because we already counted one for a matching '\r', but the character after that was '\n' in one file, but a different character in the other. So the eol-termination is different, so this is a non-matching line which needs to be fully rolled back (hence also the extra DECREMENT here). You are right. I had some intermediate code in place where the increment would happen later (thus sparing the extra decrement). I'm not even sure the original code is correct at all. For instance, it does not handle '\r' (Mac style) EOLs properly. Not really sure how to fix that. Are you sure it's not correct for Mac style EOLs? Can you give an example? You are correct, the extra check for '\n' before decrementing the line count fixes up the line-count for WIN EOLs only. find_prefix (a\rb, a\rc) - prefix_lines = 1, had_cr = TRUE find_prefix (a\nb, a\nc) - prefix_lines = 1, had_cr = FALSE The other case that I was thinking about seems to be unsupported by SVN anyways: find_prefix (a\r\nb, a\rc) - prefix_lines = 0, had_cr = TRUE find_prefix (a\n\rb, a\nc) - prefix_lines = 1, had_cr = FALSE So, you are right, I'm wrong ;) I thought I made it work correctly with '\r' EOLs (the very first iterations of this (when I was still sending it as patches to the dev-list) only worked for \r\n and \n, but somewhere along the way, I made it handle \r EOLs as well). That's what the entire block below /* check for eol, and count */ is for: - if \r: set had_cr = TRUE, count an extra prefix_line. - if \n: only count an additional prefix_line if the had_cr == FALSE, because otherwise it's the second half of a \r\n EOL,
Re: My take on the diff-optimizations-bytes branch
On 03.01.2011 02:14, Johan Corveleyn wrote: On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com wrote: On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com wrote: On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: Hi Johan, Thursday night I did something stupid and had a look at how svn blame could be made faster based on the HEAD code in your branch. One night and most of the following day later, I think I made it a few percent faster now. Some of the code I committed directly to /trunk and you need to pull these changes into your branch to compile the attached patch. Feel free to test it, take it apart and morph it any way you like -- I know the patch isn't very pretty in the first place. I tested the patch on x64 LINUX and would like to hear whether it at least somewhat improved performance on your system for your svn blame config.xml use-case. -- Stefan^2. [[[ Speed-up of datasource_open() and its sub-routines by a series of optimizations: * allocate the file[] array on stack instead of heap (all members become directly addressible through the stack pointer because all static sub-functions will actually be in-lined) * minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS) * move hot spots to seperate functions and provide a specialized version for file_len == 2 (many loops collapse to a single execution, other values can be hard-coded, etc.) * use seperate loops to process data within a chunk so we don't need to call INCREMENT_POINTERS friends that frequently * scan / compare strings in machine-word granularity instead of bytes ]]] Hi Stefan, Thanks for taking a look. I really appreciate it. When I saw your first couple of commits, to the adler32 stuff, I already thought: hmmm, he's up to something :-). And after I saw your change to eol.c#svn_eol__find_eol_start, reading per machine-word, I was thinking: hey, I could really use something like that in the prefix/suffix scanning. Nice ... :-) (I had some trouble applying your patch. It doesn't apply cleanly anymore to HEAD of the branch (but most hunks were applied correctly, and I could manually change the rest, so no problem)). However, first tests are not so great. In fact, it's about 25% slower (when blaming my settings.xml file with 2200 revisions, it's spending about 90 seconds in diff, vs. around 72 with HEAD of diff-optimizations-bytes). Analyzing further, I can see it's spending significantly less time in prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens (which is where the original algorithm gets the tokens/lines and inserts them into the token tree). This tells me it's not working correctly: either prefix/suffix scanning fails too soon, so there's much more left for the regular algorithm. Or it's just not functioning correctly. Looking at the result of the blame operation, it seems it's the latter. The final result of the blame is not correct anymore. I'll try to look some more into it, to see what's going wrong. Maybe you can also see it with a simple diff of a large file ... (for a good example in svn's own codebase, you could try subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700 revisions)). Ok, by eliminating parts of the new stuff, I found out the problem is in scan_backward_fast (part of suffix scanning). If I bypass that, and always take scan_backward_slow, it works. And it's fast too! It's taking only 58 seconds in diff, vs. 72 for the normal version. Splitting that up further, it's taking only 28 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for some reason, the lcs processing took a little longer, but that's probably unrelated (only did a single run)). And that's with scan_backward_slow. That's pretty amazing. The code of scan_backward_fast is pretty difficult to understand for me. So I'm not sure if I can spot the problem. I'll continue staring at it for a little while ... Ok, I can't find it right now. There must be some functional difference between scan_backward_fast and scan_backward_slow. For now, some feedback on the rest of the patch: [[[ - DECREMENT_POINTERS(file, file_len, pool); -} +DECREMENT_POINTERS(file, file_len, pool); } + return SVN_NO_ERROR; +} + +/* Find the prefix which is identical between all elements of the FILE array. + * Return the number of prefix lines in PREFIX_LINES. REACHED_ONE_EOF will be + * set to TRUE if one of the FILEs reached its end while scanning prefix, + * i.e. at least one file consisted entirely of prefix. Otherwise, + * REACHED_ONE_EOF is set to FALSE. + * + * After this function is finished, the buffers, chunks, curp's and endp's + * of the FILEs are set to point at the first byte after the prefix. */ +static svn_error_t * +find_identical_prefix(svn_boolean_t *reached_one_eof, apr_off_t *prefix_lines, + struct
Re: My take on the diff-optimizations-bytes branch
On 03.01.2011 02:14, Johan Corveleyn wrote: On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com wrote: On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleynjcor...@gmail.com wrote: On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: Hi Johan, Thursday night I did something stupid and had a look at how svn blame could be made faster based on the HEAD code in your branch. One night and most of the following day later, I think I made it a few percent faster now. Some of the code I committed directly to /trunk and you need to pull these changes into your branch to compile the attached patch. Feel free to test it, take it apart and morph it any way you like -- I know the patch isn't very pretty in the first place. I tested the patch on x64 LINUX and would like to hear whether it at least somewhat improved performance on your system for your svn blame config.xml use-case. -- Stefan^2. [[[ Speed-up of datasource_open() and its sub-routines by a series of optimizations: * allocate the file[] array on stack instead of heap (all members become directly addressible through the stack pointer because all static sub-functions will actually be in-lined) * minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS) * move hot spots to seperate functions and provide a specialized version for file_len == 2 (many loops collapse to a single execution, other values can be hard-coded, etc.) * use seperate loops to process data within a chunk so we don't need to call INCREMENT_POINTERS friends that frequently * scan / compare strings in machine-word granularity instead of bytes ]]] Hi Stefan, Thanks for taking a look. I really appreciate it. When I saw your first couple of commits, to the adler32 stuff, I already thought: hmmm, he's up to something :-). And after I saw your change to eol.c#svn_eol__find_eol_start, reading per machine-word, I was thinking: hey, I could really use something like that in the prefix/suffix scanning. Nice ... :-) (I had some trouble applying your patch. It doesn't apply cleanly anymore to HEAD of the branch (but most hunks were applied correctly, and I could manually change the rest, so no problem)). However, first tests are not so great. In fact, it's about 25% slower (when blaming my settings.xml file with 2200 revisions, it's spending about 90 seconds in diff, vs. around 72 with HEAD of diff-optimizations-bytes). Analyzing further, I can see it's spending significantly less time in prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens (which is where the original algorithm gets the tokens/lines and inserts them into the token tree). This tells me it's not working correctly: either prefix/suffix scanning fails too soon, so there's much more left for the regular algorithm. Or it's just not functioning correctly. Looking at the result of the blame operation, it seems it's the latter. The final result of the blame is not correct anymore. I'll try to look some more into it, to see what's going wrong. Maybe you can also see it with a simple diff of a large file ... (for a good example in svn's own codebase, you could try subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700 revisions)). Ok, by eliminating parts of the new stuff, I found out the problem is in scan_backward_fast (part of suffix scanning). If I bypass that, and always take scan_backward_slow, it works. And it's fast too! It's taking only 58 seconds in diff, vs. 72 for the normal version. Splitting that up further, it's taking only 28 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for some reason, the lcs processing took a little longer, but that's probably unrelated (only did a single run)). And that's with scan_backward_slow. That's pretty amazing. On x64, datasource_open() got about 10x faster but my test-case was the TSVN changelog where we (mainly) insert at the beginning instead of close to end of the file. So, one broken scan direction seems completely plausible ;) The code of scan_backward_fast is pretty difficult to understand for me. So I'm not sure if I can spot the problem. I'll continue staring at it for a little while ... Ok, I can't find it right now. There must be some functional difference between scan_backward_fast and scan_backward_slow. One way to get closer to the problem: * create a backup of the current rename scan_backward_fast code * copy scan_backward_slow to scan_backward_fast * replace all file_len with a hard-coded 2 - can give a 50% boost depending on optimizer cleverness * step-by-step apply changes from the old scan_backward_fast code For now, some feedback on the rest of the patch: [[[ Index: subversion/libsvn_diff/diff_file.c === --- subversion/libsvn_diff/diff_file.c (revision 1054044) +++ subversion/libsvn_diff/diff_file.c (working copy) ...
Re: My take on the diff-optimizations-bytes branch
On Mon, Jan 3, 2011 at 12:03 PM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: On 03.01.2011 02:14, Johan Corveleyn wrote: On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleynjcor...@gmail.com wrote: [... snip ...] And it's fast too! It's taking only 58 seconds in diff, vs. 72 for the normal version. Splitting that up further, it's taking only 28 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for some reason, the lcs processing took a little longer, but that's probably unrelated (only did a single run)). And that's with scan_backward_slow. That's pretty amazing. On x64, datasource_open() got about 10x faster but my test-case was the TSVN changelog where we (mainly) insert at the beginning instead of close to end of the file. So, one broken scan direction seems completely plausible ;) Strange, if you mainly insert at the beginning, it should be mostly broken, because then it's mostly identical suffix (the part which is broken). But maybe it's only broken under certain edge conditions, so that could still explain it... Maybe it's best to refer to datasource*s*_open(), as opposed to the original datasource_open(). In the original diff algorithm, datasources were opened each one by one (and then extracting tokens, inserting them into token tree, and then move on to the next datasource). In diff-optimizations-bytes branch, I introduced datasource*s*_open, to open them together, in order to chop off the identical prefix/suffix. It's probably not a good function name, because it's so similar to the old one (maybe datasource_open_all or something would be better). But OTOH: when I started with this, I thought I could remove/deprecate the datasource_open, once all diffN algorithms would use datasource*s*_open. Oh well ... The code of scan_backward_fast is pretty difficult to understand for me. So I'm not sure if I can spot the problem. I'll continue staring at it for a little while ... Ok, I can't find it right now. There must be some functional difference between scan_backward_fast and scan_backward_slow. One way to get closer to the problem: * create a backup of the current rename scan_backward_fast code * copy scan_backward_slow to scan_backward_fast * replace all file_len with a hard-coded 2 - can give a 50% boost depending on optimizer cleverness * step-by-step apply changes from the old scan_backward_fast code Ok, if I have some time tonight, I will try to get closer ... For now, some feedback on the rest of the patch: [[[ [... snip ...] + if (had_cr) + { + /* Check if we ended in the middle of a \r\n for one file, but \r for + another. If so, back up one byte, so the next loop will back up + the entire line. Also decrement *prefix_lines, since we counted one + too many for the \r. */ + if ((*file[0].curp == '\n') || (*file[1].curp == '\n')) Here (or below DECREMENT, but within this same if condition) you need: (*prefix_lines)--; just like in the original (and *_slow) case. As is explained in the comment above: if we are here, we actually counted a full prefix line too much, because we already counted one for a matching '\r', but the character after that was '\n' in one file, but a different character in the other. So the eol-termination is different, so this is a non-matching line which needs to be fully rolled back (hence also the extra DECREMENT here). I'm not even sure the original code is correct at all. For instance, it does not handle '\r' (Mac style) EOLs properly. Not really sure how to fix that. Are you sure it's not correct for Mac style EOLs? Can you give an example? I thought I made it work correctly with '\r' EOLs (the very first iterations of this (when I was still sending it as patches to the dev-list) only worked for \r\n and \n, but somewhere along the way, I made it handle \r EOLs as well). That's what the entire block below /* check for eol, and count */ is for: - if \r: set had_cr = TRUE, count an extra prefix_line. - if \n: only count an additional prefix_line if the had_cr == FALSE, because otherwise it's the second half of a \r\n EOL, and we already counted it for the \r. - all the rest: don't care, just move on (setting had_cr = FALSE). - In the special case where prefix scanning ended with a mismatch between a '\r\n' and '\rsome other character', decrement prefix_line and go back one byte, because this line wasn't identical after all. (my older versions just counted the number of \n's, which is ok for both \n and \r\n). The only reason to look at EOLs is to count the number of prefix_lines (and to rollback the last non-matching line until the previous EOL, but that's done by going back until finding either a \r or an \n). All the rest is just comparing bytes. (note that it doesn't support ignoring EOL-style differences (see the comment /* ### TODO: see if we can take advantage of diff options like ignore_eol_style or ignore_space. */). Maybe
Re: My take on the diff-optimizations-bytes branch
On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: Hi Johan, Thursday night I did something stupid and had a look at how svn blame could be made faster based on the HEAD code in your branch. One night and most of the following day later, I think I made it a few percent faster now. Some of the code I committed directly to /trunk and you need to pull these changes into your branch to compile the attached patch. Feel free to test it, take it apart and morph it any way you like -- I know the patch isn't very pretty in the first place. I tested the patch on x64 LINUX and would like to hear whether it at least somewhat improved performance on your system for your svn blame config.xml use-case. -- Stefan^2. [[[ Speed-up of datasource_open() and its sub-routines by a series of optimizations: * allocate the file[] array on stack instead of heap (all members become directly addressible through the stack pointer because all static sub-functions will actually be in-lined) * minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS) * move hot spots to seperate functions and provide a specialized version for file_len == 2 (many loops collapse to a single execution, other values can be hard-coded, etc.) * use seperate loops to process data within a chunk so we don't need to call INCREMENT_POINTERS friends that frequently * scan / compare strings in machine-word granularity instead of bytes ]]] Hi Stefan, Thanks for taking a look. I really appreciate it. When I saw your first couple of commits, to the adler32 stuff, I already thought: hmmm, he's up to something :-). And after I saw your change to eol.c#svn_eol__find_eol_start, reading per machine-word, I was thinking: hey, I could really use something like that in the prefix/suffix scanning. Nice ... :-) (I had some trouble applying your patch. It doesn't apply cleanly anymore to HEAD of the branch (but most hunks were applied correctly, and I could manually change the rest, so no problem)). However, first tests are not so great. In fact, it's about 25% slower (when blaming my settings.xml file with 2200 revisions, it's spending about 90 seconds in diff, vs. around 72 with HEAD of diff-optimizations-bytes). Analyzing further, I can see it's spending significantly less time in prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens (which is where the original algorithm gets the tokens/lines and inserts them into the token tree). This tells me it's not working correctly: either prefix/suffix scanning fails too soon, so there's much more left for the regular algorithm. Or it's just not functioning correctly. Looking at the result of the blame operation, it seems it's the latter. The final result of the blame is not correct anymore. I'll try to look some more into it, to see what's going wrong. Maybe you can also see it with a simple diff of a large file ... (for a good example in svn's own codebase, you could try subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700 revisions)). Cheers, -- Johan
Re: My take on the diff-optimizations-bytes branch
On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleyn jcor...@gmail.com wrote: On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: Hi Johan, Thursday night I did something stupid and had a look at how svn blame could be made faster based on the HEAD code in your branch. One night and most of the following day later, I think I made it a few percent faster now. Some of the code I committed directly to /trunk and you need to pull these changes into your branch to compile the attached patch. Feel free to test it, take it apart and morph it any way you like -- I know the patch isn't very pretty in the first place. I tested the patch on x64 LINUX and would like to hear whether it at least somewhat improved performance on your system for your svn blame config.xml use-case. -- Stefan^2. [[[ Speed-up of datasource_open() and its sub-routines by a series of optimizations: * allocate the file[] array on stack instead of heap (all members become directly addressible through the stack pointer because all static sub-functions will actually be in-lined) * minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS) * move hot spots to seperate functions and provide a specialized version for file_len == 2 (many loops collapse to a single execution, other values can be hard-coded, etc.) * use seperate loops to process data within a chunk so we don't need to call INCREMENT_POINTERS friends that frequently * scan / compare strings in machine-word granularity instead of bytes ]]] Hi Stefan, Thanks for taking a look. I really appreciate it. When I saw your first couple of commits, to the adler32 stuff, I already thought: hmmm, he's up to something :-). And after I saw your change to eol.c#svn_eol__find_eol_start, reading per machine-word, I was thinking: hey, I could really use something like that in the prefix/suffix scanning. Nice ... :-) (I had some trouble applying your patch. It doesn't apply cleanly anymore to HEAD of the branch (but most hunks were applied correctly, and I could manually change the rest, so no problem)). However, first tests are not so great. In fact, it's about 25% slower (when blaming my settings.xml file with 2200 revisions, it's spending about 90 seconds in diff, vs. around 72 with HEAD of diff-optimizations-bytes). Analyzing further, I can see it's spending significantly less time in prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens (which is where the original algorithm gets the tokens/lines and inserts them into the token tree). This tells me it's not working correctly: either prefix/suffix scanning fails too soon, so there's much more left for the regular algorithm. Or it's just not functioning correctly. Looking at the result of the blame operation, it seems it's the latter. The final result of the blame is not correct anymore. I'll try to look some more into it, to see what's going wrong. Maybe you can also see it with a simple diff of a large file ... (for a good example in svn's own codebase, you could try subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700 revisions)). Ok, by eliminating parts of the new stuff, I found out the problem is in scan_backward_fast (part of suffix scanning). If I bypass that, and always take scan_backward_slow, it works. And it's fast too! It's taking only 58 seconds in diff, vs. 72 for the normal version. Splitting that up further, it's taking only 28 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for some reason, the lcs processing took a little longer, but that's probably unrelated (only did a single run)). And that's with scan_backward_slow. That's pretty amazing. The code of scan_backward_fast is pretty difficult to understand for me. So I'm not sure if I can spot the problem. I'll continue staring at it for a little while ... Cheers, -- Johan
Re: My take on the diff-optimizations-bytes branch
On Mon, Jan 3, 2011 at 12:13 AM, Johan Corveleyn jcor...@gmail.com wrote: On Sun, Jan 2, 2011 at 10:04 PM, Johan Corveleyn jcor...@gmail.com wrote: On Sat, Jan 1, 2011 at 11:25 PM, Stefan Fuhrmann stefanfuhrm...@alice-dsl.de wrote: Hi Johan, Thursday night I did something stupid and had a look at how svn blame could be made faster based on the HEAD code in your branch. One night and most of the following day later, I think I made it a few percent faster now. Some of the code I committed directly to /trunk and you need to pull these changes into your branch to compile the attached patch. Feel free to test it, take it apart and morph it any way you like -- I know the patch isn't very pretty in the first place. I tested the patch on x64 LINUX and would like to hear whether it at least somewhat improved performance on your system for your svn blame config.xml use-case. -- Stefan^2. [[[ Speed-up of datasource_open() and its sub-routines by a series of optimizations: * allocate the file[] array on stack instead of heap (all members become directly addressible through the stack pointer because all static sub-functions will actually be in-lined) * minor re-arragements in arithmetic expressions to maximize reuse of results (e.g. in INCREMENT_POINTERS) * move hot spots to seperate functions and provide a specialized version for file_len == 2 (many loops collapse to a single execution, other values can be hard-coded, etc.) * use seperate loops to process data within a chunk so we don't need to call INCREMENT_POINTERS friends that frequently * scan / compare strings in machine-word granularity instead of bytes ]]] Hi Stefan, Thanks for taking a look. I really appreciate it. When I saw your first couple of commits, to the adler32 stuff, I already thought: hmmm, he's up to something :-). And after I saw your change to eol.c#svn_eol__find_eol_start, reading per machine-word, I was thinking: hey, I could really use something like that in the prefix/suffix scanning. Nice ... :-) (I had some trouble applying your patch. It doesn't apply cleanly anymore to HEAD of the branch (but most hunks were applied correctly, and I could manually change the rest, so no problem)). However, first tests are not so great. In fact, it's about 25% slower (when blaming my settings.xml file with 2200 revisions, it's spending about 90 seconds in diff, vs. around 72 with HEAD of diff-optimizations-bytes). Analyzing further, I can see it's spending significantly less time in prefix/suffix scanning, but more time in token.c#svn_diff__get_tokens (which is where the original algorithm gets the tokens/lines and inserts them into the token tree). This tells me it's not working correctly: either prefix/suffix scanning fails too soon, so there's much more left for the regular algorithm. Or it's just not functioning correctly. Looking at the result of the blame operation, it seems it's the latter. The final result of the blame is not correct anymore. I'll try to look some more into it, to see what's going wrong. Maybe you can also see it with a simple diff of a large file ... (for a good example in svn's own codebase, you could try subversion/tests/cmdline/merge-tests.py (pretty large, and almost 700 revisions)). Ok, by eliminating parts of the new stuff, I found out the problem is in scan_backward_fast (part of suffix scanning). If I bypass that, and always take scan_backward_slow, it works. And it's fast too! It's taking only 58 seconds in diff, vs. 72 for the normal version. Splitting that up further, it's taking only 28 seconds in prefix/suffix scanning, vs. ~47 for the normal version (for some reason, the lcs processing took a little longer, but that's probably unrelated (only did a single run)). And that's with scan_backward_slow. That's pretty amazing. The code of scan_backward_fast is pretty difficult to understand for me. So I'm not sure if I can spot the problem. I'll continue staring at it for a little while ... Ok, I can't find it right now. There must be some functional difference between scan_backward_fast and scan_backward_slow. For now, some feedback on the rest of the patch: [[[ Index: subversion/libsvn_diff/diff_file.c === --- subversion/libsvn_diff/diff_file.c(revision 1054044) +++ subversion/libsvn_diff/diff_file.c(working copy) ... snip ... @@ -363,53 +364,152 @@ /* Check whether one of the FILEs has its pointers at EOF (this is the case if * one of them has curp == endp (this can only happen at the last chunk)) */ static svn_boolean_t -is_one_at_eof(struct file_info *file[], apr_size_t file_len) +is_one_at_eof(struct file_info file[], apr_size_t file_len) { apr_size_t i; for (i = 0; i file_len; i++) -if (file[i]-curp == file[i]-endp) +if (file[i].curp == file[i].endp) return TRUE; return FALSE;