Diff optimization: implement prefix/suffix-skipping in token-handling code (was: Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster)
On Tue, Oct 12, 2010 at 12:35 PM, Julian Foad wrote: > On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote: >> On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn wrote: >> > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad >> > wrote: >> >> But this makes me think, it looks to me like this whole >> >> prefix-suffix-skipping functionality would fit better inside the >> >> lower-level diff algorithm rather than inside the "datasources_open" >> >> function. Do you agree? >> >> >> >> It works as it is, but you were talking about wanting it to obey the >> >> standard token-parsing rules such as "ignore white space", and so on. >> >> It seems that things like this would be much better one level down. >> > >> > Yes, I've been struggling with this. But I can't easily see it fit in >> > the lower levels right now. Problem is that everything in those lower >> > levels always acts on 1 datasource at a time (getting all the tokens, >> > inserting them into the token tree, ... then the same for the next >> > datasource). The datasource_open seemed to me to be the easiest place >> > to combine datasources to do things for both of them "concurrently" >> > (with least impact on the rest of the code). >> > >> > Maybe those lower-level functions could also be made >> > "multi-datasource", but I have to think a little bit more about that. >> >> I've thought a little bit more about this, going through the code, and >> yeah, it might be a good idea to push the prefix/suffix scanning into >> the lower-level parts of the diff-algorithm (the token parsing / >> building the token tree). >> >> Something like: >> - Make token.c#svn_diff__get_tokens take multiple datasources. >> - In diff.c, diff3.c and diff4.c replace the multiple calls to this >> function to one call passing multiple datasources. >> - token.c#svn_diff__get_tokens (with multiple datasources) will take >> care of identical prefix and suffix scanning based on "tokens" (so can >> take advantage of the standard token-parsing rules with ignore-* >> options, by simply calling svn_diff_fns_t.datasource_get_next_token). > > One of the improvements we're looking for is to make use of the already > existing diff options - ignore-white-space etc. > > We can get that improvement with a much smaller change: simply by > calling the 'get_next_token' routine, or rather a part of it, from > within your current 'find_identical_prefix' function, without touching > any of the generic diffN.c/token.c code and APIs. > >> Some things needed to support this: >> - svn_diff_fns_t.datasource_get_next_token: calculation of the hash >> should be made conditional (I don't want to waste time for the adler32 >> hash for lines that are not going to be put in the token tree). > > Yes. If you take this "smaller change" approach, then the way to do > this would be to factor out the non-adler part of > 'datasource_get_next_token' into a separate private function, and call > that. > >> - For suffix scanning, I'll need another variation of >> datasource_get_next_token to get tokens from the end going backwards >> (say "datasource_get_previous_token"). No hash needed. > > Yes. > >> Does that make sense? Or am I missing it completely? > > I can't really comment on the question of moving this right down into > the diffN.c/token.c code, as I don't have time to study that right now. > The possible benefits I can see are: > > * Sharing this feature across all kinds of diff implementations > (currently: file and in-memory-string). Good from a practical POV if > and only if really long strings are being compared in memory. Good from > a code-structural POV. Yes, that seems like a nice improvement, code-structurally. I don't know if it will have noticeable performance impact. If I understand the code correctly, the in-memory-diff code is used for diffing properties. Some properties can be quite large (e.g. svn:mergeinfo), but I don't know if they are large enough to have an impact (unless some high-level operations perform a million property diffs or something). > * I'm curious about whether it would be possible to integrate prefix > skipping into the main algorithm in such a way that the algorithm would > see the skipped prefix as a possible match, just like any other chunk > (including its adler32), but in a much quicker way than the algorithm > currently finds such a prefix. If so, it would be able to handle better > the cases where one file has added a prefix that is a duplicate of > existing text. Same for the suffix. Maybe that's possible, but I don't think it will help much. For one thing, it introduces some overhead (adler32 calculation of the prefix). And it will only help if the added piece of text is *exactly* the same as the prefix (every line of it). If the added piece of text misses the last line of the prefix, it will not match (different adler32 hash). If you need to be able to match at different spots inside the prefix, you'll have to hash (and compare) every line separately, which void
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad wrote: > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote: >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad >> wrote: >> > On Sat, 2010-10-09, Johan Corveleyn wrote: >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad >> >> wrote: >> >> > So I wrote a patch - attached - that refactors this into an array of 4 >> >> > sub-structures, and simplifies all the code that uses them. >> > [...] >> >> Yes, great idea! That would indeed vastly simplify a lot of the code. >> >> So please go ahead and commit the refactoring. >> > >> > OK, committed in r1021282. >> >> Thanks, looks much more manageable now. > > I'd like to see a simplified version of your last patch, taking > advantage of that, before you go exploring other options. Ok, here's a new version of the patch, taking advantage of your file_info refactoring. This vastly simplifies the code, so that it might actually be understandable now :-). Other things I've done in this version: 1) Generalized everything to handle an array of datasources/files, instead of just two. This makes it slightly more complex here and there (using for loops everywhere), but I think it's ok, and it's also more consistent/generic. If anyone has better ideas to do those for loops, suggestions welcome. This makes the algorithm usable by diff3 as well (and diff4 if needed (?)). I have not yet enabled it for diff3, because I haven't yet understood how it handles the generation of its diff output (needs to take into account the prefix_lines. I tried some quick hacks, but lots of tests were failing, so I'll have to look more into it -> that's for a follow up patch). When I can enable it for diff3 (and diff4), I can remove datasource_open (with one datasource). 2) Removed get_prefix_lines from svn_diff_fns_t (and its implementations in diff_file.c and diff_memory.c). Instead I pass prefix_lines directly to token.c#svn_diff__get_tokens. 3) If prefix scanning ended in the last chunk, the suffix scanning now reuses that buffer which already contains the last chunk. As a special case, this also avoids reading the file twice if it's smaller than 128 Kb. 4) Added doc strings everywhere. Feel free to edit those, I'm new at documenting things in svn. Still TODO: - revv svn_diff_fns_t and maybe other stuff I've changed in public API. - See if implementing the critical parts of increment_pointers and decrement_pointers in a macro improves performance. - Add support for -x-b, -x-w, and -x--ignore-eol-style options. For this (and for other reasons), I'd still like to investigate pushing this optimization into the token parsing/handling layer, to extract entire tokens etc., even if this means the current patch has to be thrown away. I'll take this up in a separate thread. Log message: [[[ Make svn_diff skip identical prefix and suffix to make diff and blame faster. * subversion/include/svn_diff.h (svn_diff_fns_t): Added new function type datasources_open to the vtable. * subversion/libsvn_diff/diff_memory.c (datasources_open): New function (does nothing). (svn_diff__mem_vtable): Added new function datasources_open. * subversion/libsvn_diff/diff_file.c (svn_diff__file_baton_t): Added member prefix_lines, and inside the struct file_info the members suffix_start_chunk and suffix_offset_in_chunk. (increment_pointers, decrement_pointers): New functions. (is_one_at_bof, is_one_at_eof): New functions. (find_identical_prefix, find_identical_suffix): New functions. (datasources_open): New function, to open multiple datasources and find their identical prefix and suffix, so these can be excluded from the rest of the diff algorithm, as a performance optimization. From the identical suffix, 50 lines are kept to help the diff algorithm find the nicest possible diff representation in case of ambiguity. (datasource_get_next_token): Stop at start of identical suffix. (svn_diff__file_vtable): Added new function datasources_open. * subversion/libsvn_diff/diff.h (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that the datasource was already opened, and argument "prefix_lines", the number of identical prefix lines.and use this as the starting offset for the token we're getting. * subversion/libsvn_diff/token.c (svn_diff__get_tokens): Added arguments "datasource_opened" and "prefix_lines". Only open the datasource if datasource_opened is FALSE. Set the starting offset of the position list to the number of prefix_lines. * subversion/libsvn_diff/lcs.c (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set the offset of the sentinel position for EOF, even if one of the files became empty after eliminating the identical prefix. * subversion/libsvn_diff/diff.c (svn_diff__diff): Add a chunk of "common" diff for identical prefix. (svn_diff_diff): Use new function datasources_open to open original and modified at once and find their identical prefix a
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Tue, 2010-10-12, Johan Corveleyn wrote: > On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad > wrote: > > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote: > >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad > >> wrote: > >> > On Sat, 2010-10-09, Johan Corveleyn wrote: > >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad > >> >> wrote: > >> >> > So I wrote a patch - attached - that refactors this into an array of 4 > >> >> > sub-structures, and simplifies all the code that uses them. > >> > [...] > >> >> Yes, great idea! That would indeed vastly simplify a lot of the code. > >> >> So please go ahead and commit the refactoring. > >> > > >> > OK, committed in r1021282. > >> > >> Thanks, looks much more manageable now. > > > > I'd like to see a simplified version of your last patch, taking > > advantage of that, before you go exploring other options. > > Yes, I'll certainly do that in the coming days. > > >> >> Also, maybe the last_chunk number could be included in the file_info > >> >> struct? Now it's calculated in several places: "last_chunk0 = > >> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on > >> >> every time as an extra argument. Seems like the sort of info that > >> >> could be part of file_info. > >> > > >> > It looks to me like you only need to calculate it in exactly one place: > >> > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating > >> > for efficiency, as it's a trivial calculation. > >> > >> Hm, I don't know. That's recalculating it for every byte in the > >> prefix. In the files I'm testing there could be 1 Mb or more of > > > > No, no, not every byte - only 1 in 131072 of the calls need to calculate > > it - only when it reaches the end of a block. > > Oh right. I guess it's ok then. > > > May I ask whether you've tested the code that crosses from one chunk to > > another? I noticed the following, just now: > > > >> + *at_least_one_end_reached = > >> +file_baton->curp[idx0] == file_baton->endp[idx0] > >> +|| file_baton->curp[idx1] == file_baton->endp[idx1]; > >> +} > >> + > >> + /* If both files reached their end (i.e. are fully identical), > >> we're done */ > >> + if (file_baton->curp[idx0] == file_baton->endp[idx0] > >> +&& file_baton->curp[idx1] == file_baton->endp[idx1]) > > > > Those tests seem to be saying "if we're at the end of the current chunk > > then we're at the end of the file". That looks wrong. What do you > > think? > > Well, it does work :-). The code is correct, but it's kind of > difficult to see, so that's my fault. > > The reason is that increment_pointer_or_chunk takes care that curp > never equals endp (if curp==endp-1 it reads the next chunk), unless it > reaches end of file. See this snippet of increment_pointer_or_chunk: > + if (*chunk_number == last_chunk_number) > +(*curp)++; /* *curp == *endp with last chunk signals end of file */ > > If you think this is too obscure to handle this, I agree, but I > couldn't think of a better/easier way at that time. If you have any > suggestions, feel free :-). But maybe I should refactor the patch > first, so it's easier to see how this would work out best. > > The same "obscurity" also applies to reaching "beginning of file" > (which can happen in decrement_pointer_or_chunk, either during prefix > scanning (rolling back the first line) or during suffix scanning (if > there was no prefix, but still a difference in the first line)). There > it's done by setting the chunk number to -1, which is checked for in > find_identical_prefix and find_identical_suffix: > + if (*chunk_number == 0) > +(*chunk_number)--; /* *chunk_number == -1 signals beginning of file > */ > > Not ideal, but it works ... Oh OK, thanks for explaining. No further comment, m'lord :-) - Julian
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Tue, Oct 12, 2010 at 12:10 PM, Julian Foad wrote: > On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote: >> On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad >> wrote: >> > On Sat, 2010-10-09, Johan Corveleyn wrote: >> >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad >> >> wrote: >> >> > So I wrote a patch - attached - that refactors this into an array of 4 >> >> > sub-structures, and simplifies all the code that uses them. >> > [...] >> >> Yes, great idea! That would indeed vastly simplify a lot of the code. >> >> So please go ahead and commit the refactoring. >> > >> > OK, committed in r1021282. >> >> Thanks, looks much more manageable now. > > I'd like to see a simplified version of your last patch, taking > advantage of that, before you go exploring other options. Yes, I'll certainly do that in the coming days. >> >> Also, maybe the last_chunk number could be included in the file_info >> >> struct? Now it's calculated in several places: "last_chunk0 = >> >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on >> >> every time as an extra argument. Seems like the sort of info that >> >> could be part of file_info. >> > >> > It looks to me like you only need to calculate it in exactly one place: >> > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating >> > for efficiency, as it's a trivial calculation. >> >> Hm, I don't know. That's recalculating it for every byte in the >> prefix. In the files I'm testing there could be 1 Mb or more of > > No, no, not every byte - only 1 in 131072 of the calls need to calculate > it - only when it reaches the end of a block. Oh right. I guess it's ok then. > May I ask whether you've tested the code that crosses from one chunk to > another? I noticed the following, just now: > >> + *at_least_one_end_reached = >> + file_baton->curp[idx0] == file_baton->endp[idx0] >> + || file_baton->curp[idx1] == file_baton->endp[idx1]; >> + } >> + >> + /* If both files reached their end (i.e. are fully identical), >> we're done */ >> + if (file_baton->curp[idx0] == file_baton->endp[idx0] >> + && file_baton->curp[idx1] == file_baton->endp[idx1]) > > Those tests seem to be saying "if we're at the end of the current chunk > then we're at the end of the file". That looks wrong. What do you > think? Well, it does work :-). The code is correct, but it's kind of difficult to see, so that's my fault. The reason is that increment_pointer_or_chunk takes care that curp never equals endp (if curp==endp-1 it reads the next chunk), unless it reaches end of file. See this snippet of increment_pointer_or_chunk: + if (*chunk_number == last_chunk_number) +(*curp)++; /* *curp == *endp with last chunk signals end of file */ If you think this is too obscure to handle this, I agree, but I couldn't think of a better/easier way at that time. If you have any suggestions, feel free :-). But maybe I should refactor the patch first, so it's easier to see how this would work out best. The same "obscurity" also applies to reaching "beginning of file" (which can happen in decrement_pointer_or_chunk, either during prefix scanning (rolling back the first line) or during suffix scanning (if there was no prefix, but still a difference in the first line)). There it's done by setting the chunk number to -1, which is checked for in find_identical_prefix and find_identical_suffix: + if (*chunk_number == 0) +(*chunk_number)--; /* *chunk_number == -1 signals beginning of file */ Not ideal, but it works ... Cheers, -- Johan
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sun, 2010-10-10 at 23:43 +0200, Johan Corveleyn wrote: > On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn wrote: > > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad > > wrote: > >> But this makes me think, it looks to me like this whole > >> prefix-suffix-skipping functionality would fit better inside the > >> lower-level diff algorithm rather than inside the "datasources_open" > >> function. Do you agree? > >> > >> It works as it is, but you were talking about wanting it to obey the > >> standard token-parsing rules such as "ignore white space", and so on. > >> It seems that things like this would be much better one level down. > > > > Yes, I've been struggling with this. But I can't easily see it fit in > > the lower levels right now. Problem is that everything in those lower > > levels always acts on 1 datasource at a time (getting all the tokens, > > inserting them into the token tree, ... then the same for the next > > datasource). The datasource_open seemed to me to be the easiest place > > to combine datasources to do things for both of them "concurrently" > > (with least impact on the rest of the code). > > > > Maybe those lower-level functions could also be made > > "multi-datasource", but I have to think a little bit more about that. > > I've thought a little bit more about this, going through the code, and > yeah, it might be a good idea to push the prefix/suffix scanning into > the lower-level parts of the diff-algorithm (the token parsing / > building the token tree). > > Something like: > - Make token.c#svn_diff__get_tokens take multiple datasources. > - In diff.c, diff3.c and diff4.c replace the multiple calls to this > function to one call passing multiple datasources. > - token.c#svn_diff__get_tokens (with multiple datasources) will take > care of identical prefix and suffix scanning based on "tokens" (so can > take advantage of the standard token-parsing rules with ignore-* > options, by simply calling svn_diff_fns_t.datasource_get_next_token). One of the improvements we're looking for is to make use of the already existing diff options - ignore-white-space etc. We can get that improvement with a much smaller change: simply by calling the 'get_next_token' routine, or rather a part of it, from within your current 'find_identical_prefix' function, without touching any of the generic diffN.c/token.c code and APIs. > Some things needed to support this: > - svn_diff_fns_t.datasource_get_next_token: calculation of the hash > should be made conditional (I don't want to waste time for the adler32 > hash for lines that are not going to be put in the token tree). Yes. If you take this "smaller change" approach, then the way to do this would be to factor out the non-adler part of 'datasource_get_next_token' into a separate private function, and call that. > - For suffix scanning, I'll need another variation of > datasource_get_next_token to get tokens from the end going backwards > (say "datasource_get_previous_token"). No hash needed. Yes. > Does that make sense? Or am I missing it completely? I can't really comment on the question of moving this right down into the diffN.c/token.c code, as I don't have time to study that right now. The possible benefits I can see are: * Sharing this feature across all kinds of diff implementations (currently: file and in-memory-string). Good from a practical POV if and only if really long strings are being compared in memory. Good from a code-structural POV. * I'm curious about whether it would be possible to integrate prefix skipping into the main algorithm in such a way that the algorithm would see the skipped prefix as a possible match, just like any other chunk (including its adler32), but in a much quicker way than the algorithm currently finds such a prefix. If so, it would be able to handle better the cases where one file has added a prefix that is a duplicate of existing text. Same for the suffix. I'm really not sure whether it's worth pursuing that line of investigation. If you do, and it turns out well, that would be great. If you want to just stick to the implementation in diff_file.c, that seems like a more practical place to stop, now that you have basically solved the problem. - Julian
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Tue, 2010-10-12 at 00:31 +0200, Johan Corveleyn wrote: > On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad > wrote: > > On Sat, 2010-10-09, Johan Corveleyn wrote: > >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad > >> wrote: > >> > So I wrote a patch - attached - that refactors this into an array of 4 > >> > sub-structures, and simplifies all the code that uses them. > > [...] > >> Yes, great idea! That would indeed vastly simplify a lot of the code. > >> So please go ahead and commit the refactoring. > > > > OK, committed in r1021282. > > Thanks, looks much more manageable now. I'd like to see a simplified version of your last patch, taking advantage of that, before you go exploring other options. > >> Also, maybe the last_chunk number could be included in the file_info > >> struct? Now it's calculated in several places: "last_chunk0 = > >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on > >> every time as an extra argument. Seems like the sort of info that > >> could be part of file_info. > > > > It looks to me like you only need to calculate it in exactly one place: > > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating > > for efficiency, as it's a trivial calculation. > > Hm, I don't know. That's recalculating it for every byte in the > prefix. In the files I'm testing there could be 1 Mb or more of No, no, not every byte - only 1 in 131072 of the calls need to calculate it - only when it reaches the end of a block. May I ask whether you've tested the code that crosses from one chunk to another? I noticed the following, just now: > + *at_least_one_end_reached = > +file_baton->curp[idx0] == file_baton->endp[idx0] > +|| file_baton->curp[idx1] == file_baton->endp[idx1]; > +} > + > + /* If both files reached their end (i.e. are fully identical), > we're done */ > + if (file_baton->curp[idx0] == file_baton->endp[idx0] > +&& file_baton->curp[idx1] == file_baton->endp[idx1]) Those tests seem to be saying "if we're at the end of the current chunk then we're at the end of the file". That looks wrong. What do you think? In order to test this more easily than creating huge files, I wonder if you can change CHUNK_SIZE and CHUNK_SHIFT to something much smaller such as 8 bytes. I'm trying this now - not with your patch, first, but with just a plain trunk build, to see if it works. - Julian > > And in a later email you wrote: > >> [...] Maybe I can give it a go, > >> first suffix then prefix, and see if I can find real-life problems ... > > > > There's no need to re-visit that. I think it's fine as is it, using a > > separate pair of buffers. > > I've been thinking some more about that. For relatively small files, > smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127 > Kb, it all fits in one chunk, yet I still read it twice, once for > suffix and once for prefix (and further token processing). Ok, caching > by the OS might make this insignificant, but still. > > Since most common source files in repositories are probably smaller > than 128 Kb I might be hurting the common case, in cases where it's > not compensated by performance gain from identical prefix/suffix. > > Then again, maybe this can be solved with a specific optimization, > without changing the current code too much: if file size is smaller > than chunk size, I point the suffix buffer to the prefix buffer and > I'm done.
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Mon, Oct 11, 2010 at 11:53 AM, Julian Foad wrote: > On Sat, 2010-10-09, Johan Corveleyn wrote: >> On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad wrote: >> > So I wrote a patch - attached - that refactors this into an array of 4 >> > sub-structures, and simplifies all the code that uses them. > [...] >> Yes, great idea! That would indeed vastly simplify a lot of the code. >> So please go ahead and commit the refactoring. > > OK, committed in r1021282. Thanks, looks much more manageable now. >> Also, maybe the last_chunk number could be included in the file_info >> struct? Now it's calculated in several places: "last_chunk0 = >> offset_to_chunk(file_baton->size[idx0])", or I have to pass it on >> every time as an extra argument. Seems like the sort of info that >> could be part of file_info. > > It looks to me like you only need to calculate it in exactly one place: > in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating > for efficiency, as it's a trivial calculation. Hm, I don't know. That's recalculating it for every byte in the prefix. In the files I'm testing there could be 1 Mb or more of prefix, so 1 million times this calculation (and that for every diff in say 2000 diffs for a blame operation). Unless the compiler can optimize that to a single calculation, because it never changes (I don't know enough about compilers to know if this is possible), it might have an impact, even if it's a trivial calculation. OTOH, I might be in premature-optimization-land here, the impact might be negligible. If I find some time I'll test it. > And in a later email you wrote: >> [...] Maybe I can give it a go, >> first suffix then prefix, and see if I can find real-life problems ... > > There's no need to re-visit that. I think it's fine as is it, using a > separate pair of buffers. I've been thinking some more about that. For relatively small files, smaller than 1 chunk (128 Kb), this seems wasteful. If the file is 127 Kb, it all fits in one chunk, yet I still read it twice, once for suffix and once for prefix (and further token processing). Ok, caching by the OS might make this insignificant, but still. Since most common source files in repositories are probably smaller than 128 Kb I might be hurting the common case, in cases where it's not compensated by performance gain from identical prefix/suffix. Then again, maybe this can be solved with a specific optimization, without changing the current code too much: if file size is smaller than chunk size, I point the suffix buffer to the prefix buffer and I'm done. Cheers, -- Johan
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sat, 2010-10-09, Johan Corveleyn wrote: > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad wrote: > > On Sat, 2010-10-09, Johan Corveleyn wrote: > >> Ok, third iteration of the patch in attachment. It passes make check. > >> > >> As discussed in [1], this version keeps 50 lines of the identical > >> suffix around, to give the algorithm a good chance to generate a diff > >> output of good quality (in all but the most extreme cases, this will > >> be the same as with the original svn_diff algorithm). > >> > >> That's about the only difference with the previous iteration. So for > >> now, I'm submitting this for review. Any feedback is very welcome :-). > > > > Hi Johan. > > Hi Julian, > > Thanks for taking a look. > > > I haven't reviewed it, but after seeing today's discussion I had just > > scrolled quickly through the previous version of this patch. I noticed > > that the two main functions - find_identical_suffix and > > find_identical_suffix - are both quite similar (but not quite similar > > enough to make them easily share implementation) and both quite long, > > and I noticed you wrote in an earlier email that you were finding it > > hard to make the code readable. I have a suggestion that may help. [...] > > So I wrote a patch - attached - that refactors this into an array of 4 > > sub-structures, and simplifies all the code that uses them. [...] > Yes, great idea! That would indeed vastly simplify a lot of the code. > So please go ahead and commit the refactoring. OK, committed in r1021282. > Also, maybe the last_chunk number could be included in the file_info > struct? Now it's calculated in several places: "last_chunk0 = > offset_to_chunk(file_baton->size[idx0])", or I have to pass it on > every time as an extra argument. Seems like the sort of info that > could be part of file_info. It looks to me like you only need to calculate it in exactly one place: in increment_pointer_or_chunk(). I wouldn't worry about pre-calculating for efficiency, as it's a trivial calculation. > One more thing: you might have noticed that for find_identical_suffix > I use other buffers, chunks, curp's, endp's, ... than for the prefix. > For prefix scanning I can just use the stuff from the diff_baton, > because after prefix scanning has finished, everything is buffered and > pointing correctly for the normal algorithm to continue (i.e. > everything points at the first byte of the first non-identical line). > For suffix scanning I need to use other structures (newly alloc'd > buffer etc), so as to not mess with those pointers/buffers from the > diff_baton. > Maybe I can give it a go, > first suffix then prefix, and see if I can find real-life problems ... > So: I think I'll need the file_info struct to be available out of the > diff_baton_t struct as well, so I can use this in suffix scanning > also. Yes; I gave the struct type a name - "struct file_info" - so you can use it in other places. > (side-note: I considered first doing suffix scanning, then prefix > scanning, so I could reuse the buffers/pointers from diff_baton all > the time, and still have everything pointing correctly after > eliminating prefix/suffix. But that could give vastly different > results in some cases, for instance when original file is entirely > identical to both the prefix and the suffix of the modified file. So I > decided it's best to stick with first prefix, then suffix). And in a later email you wrote: > [...] Maybe I can give it a go, > first suffix then prefix, and see if I can find real-life problems ... There's no need to re-visit that. I think it's fine as is it, using a separate pair of buffers. > > Responding to some of the other points you mentioned in a much earlier > > mail: > > > >> 3) It's only implemented for 2 files. I'd like to generalize this for > >> an array of datasources, so it can also be used for diff3 and diff4. > >> > >> 4) As a small hack, I had to add a flag "datasource_opened" to > >> token.c#svn_diff__get_tokens, so I could have different behavior for > >> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is > >> no longer needed. > > > > Yes, I'd like to see 3), and so hack 4) will go away. > > I'm wondering though how I should represent the datasources to pass > into datasources_open. An array combined with a length parameter? > Something like: > > static svn_error_t * > datasources_open(void *baton, apr_off_t *prefix_lines, > svn_diff_datasource_e[] datasources, int datasources_len) > > ? And then use for loops everywhere I now do things twice for the two > datasources? I haven't looked at how datasources_open() API is used or what form of API would be best. For the internal functions, you can now pass around an array of file_info pointers rather than indexes. > >> 5) I've struggled with making the code readable/understandable. It's > >> likely that there is still a lot of room for improvement. I also > >> probably need to document it
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sat, Oct 9, 2010 at 2:21 PM, Johan Corveleyn wrote: > On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad wrote: >> But this makes me think, it looks to me like this whole >> prefix-suffix-skipping functionality would fit better inside the >> lower-level diff algorithm rather than inside the "datasources_open" >> function. Do you agree? >> >> It works as it is, but you were talking about wanting it to obey the >> standard token-parsing rules such as "ignore white space", and so on. >> It seems that things like this would be much better one level down. > > Yes, I've been struggling with this. But I can't easily see it fit in > the lower levels right now. Problem is that everything in those lower > levels always acts on 1 datasource at a time (getting all the tokens, > inserting them into the token tree, ... then the same for the next > datasource). The datasource_open seemed to me to be the easiest place > to combine datasources to do things for both of them "concurrently" > (with least impact on the rest of the code). > > Maybe those lower-level functions could also be made > "multi-datasource", but I have to think a little bit more about that. I've thought a little bit more about this, going through the code, and yeah, it might be a good idea to push the prefix/suffix scanning into the lower-level parts of the diff-algorithm (the token parsing / building the token tree). Something like: - Make token.c#svn_diff__get_tokens take multiple datasources. - In diff.c, diff3.c and diff4.c replace the multiple calls to this function to one call passing multiple datasources. - token.c#svn_diff__get_tokens (with multiple datasources) will take care of identical prefix and suffix scanning based on "tokens" (so can take advantage of the standard token-parsing rules with ignore-* options, by simply calling svn_diff_fns_t.datasource_get_next_token). Some things needed to support this: - svn_diff_fns_t.datasource_get_next_token: calculation of the hash should be made conditional (I don't want to waste time for the adler32 hash for lines that are not going to be put in the token tree). - For suffix scanning, I'll need another variation of datasource_get_next_token to get tokens from the end going backwards (say "datasource_get_previous_token"). No hash needed. Does that make sense? Or am I missing it completely? It would mean I'd have to throw away the current patch and rewrite it in the token layer (but I don't really mind, as long as a good solution takes its place :-)). And your file_info refactoring would also not be used (although it's certainly a worthwhile refactoring by itself, making the current code clearer). Cheers, -- Johan
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sat, Oct 9, 2010 at 5:19 PM, Daniel Shahaf wrote: > Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200: >> (side-note: I considered first doing suffix scanning, then prefix >> scanning, so I could reuse the buffers/pointers from diff_baton all >> the time, and still have everything pointing correctly after >> eliminating prefix/suffix. But that could give vastly different >> results in some cases, for instance when original file is entirely >> identical to both the prefix and the suffix of the modified file. So I >> decided it's best to stick with first prefix, then suffix). > > What Hyrum said. How common /is/ this case? And, anyway, in that case > both "everything was appended" and "everything was prepended" are > equally legitimate diffs. Hm, I'm not sure about this one. I just wanted to try the maximum reasonably possible to keep the results identical to what they were. Using another buffer for suffix scanning didn't seem that big of a deal (only slight increase in memory use (2 chunks of 128K in current implementation)). I made that decision pretty early, before I knew of the other problem of suffix scanning, and the keep-50-suffix-lines compromise we decided upon. There may be more subtle cases than the one I described, I don't know. OTOH, now that we have the keep-50-suffix-lines, that may help also in this case. I'll have to think about that. Maybe I can give it a go, first suffix then prefix, and see if I can find real-life problems ... -- Johan
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
Johan Corveleyn wrote on Sat, Oct 09, 2010 at 14:21:09 +0200: > (side-note: I considered first doing suffix scanning, then prefix > scanning, so I could reuse the buffers/pointers from diff_baton all > the time, and still have everything pointing correctly after > eliminating prefix/suffix. But that could give vastly different > results in some cases, for instance when original file is entirely > identical to both the prefix and the suffix of the modified file. So I > decided it's best to stick with first prefix, then suffix). What Hyrum said. How common /is/ this case? And, anyway, in that case both "everything was appended" and "everything was prepended" are equally legitimate diffs.
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sat, Oct 9, 2010 at 2:57 AM, Julian Foad wrote: > On Sat, 2010-10-09, Johan Corveleyn wrote: >> Ok, third iteration of the patch in attachment. It passes make check. >> >> As discussed in [1], this version keeps 50 lines of the identical >> suffix around, to give the algorithm a good chance to generate a diff >> output of good quality (in all but the most extreme cases, this will >> be the same as with the original svn_diff algorithm). >> >> That's about the only difference with the previous iteration. So for >> now, I'm submitting this for review. Any feedback is very welcome :-). > > Hi Johan. Hi Julian, Thanks for taking a look. > I haven't reviewed it, but after seeing today's discussion I had just > scrolled quickly through the previous version of this patch. I noticed > that the two main functions - find_identical_suffix and > find_identical_suffix - are both quite similar (but not quite similar > enough to make them easily share implementation) and both quite long, > and I noticed you wrote in an earlier email that you were finding it > hard to make the code readable. I have a suggestion that may help. > > I think the existing structure of the svn_diff__file_baton_t is > unhelpful: > { > const svn_diff_file_options_t *options; > const char *path[4]; > > apr_file_t *file[4]; > apr_off_t size[4]; > > int chunk[4]; > char *buffer[4]; > char *curp[4]; > char *endp[4]; > > /* List of free tokens that may be reused. */ > svn_diff__file_token_t *tokens; > > svn_diff__normalize_state_t normalize_state[4]; > > apr_pool_t *pool; > } svn_diff__file_baton_t; > > All those array[4] fields are logically related, but this layout forces > the programmer to address them individually. > > So I wrote a patch - attached - that refactors this into an array of 4 > sub-structures, and simplifies all the code that uses them. > > I think this will help you to get better code clarity because then your > increment_pointer_or_chunk() for example will be able to take a single > pointer to a file_info structure instead of a lot of pointers to > individual members of the same. > > Would you take a look and let me know if you agree. If so, I can commit > the refactoring straight away. Yes, great idea! That would indeed vastly simplify a lot of the code. So please go ahead and commit the refactoring. Also, maybe the last_chunk number could be included in the file_info struct? Now it's calculated in several places: "last_chunk0 = offset_to_chunk(file_baton->size[idx0])", or I have to pass it on every time as an extra argument. Seems like the sort of info that could be part of file_info. One more thing: you might have noticed that for find_identical_suffix I use other buffers, chunks, curp's, endp's, ... than for the prefix. For prefix scanning I can just use the stuff from the diff_baton, because after prefix scanning has finished, everything is buffered and pointing correctly for the normal algorithm to continue (i.e. everything points at the first byte of the first non-identical line). For suffix scanning I need to use other structures (newly alloc'd buffer etc), so as to not mess with those pointers/buffers from the diff_baton. So: I think I'll need the file_info struct to be available out of the diff_baton_t struct as well, so I can use this in suffix scanning also. (side-note: I considered first doing suffix scanning, then prefix scanning, so I could reuse the buffers/pointers from diff_baton all the time, and still have everything pointing correctly after eliminating prefix/suffix. But that could give vastly different results in some cases, for instance when original file is entirely identical to both the prefix and the suffix of the modified file. So I decided it's best to stick with first prefix, then suffix). > Responding to some of the other points you mentioned in a much earlier > mail: > >> 3) It's only implemented for 2 files. I'd like to generalize this for >> an array of datasources, so it can also be used for diff3 and diff4. >> >> 4) As a small hack, I had to add a flag "datasource_opened" to >> token.c#svn_diff__get_tokens, so I could have different behavior for >> regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is >> no longer needed. > > Yes, I'd like to see 3), and so hack 4) will go away. I'm wondering though how I should represent the datasources to pass into datasources_open. An array combined with a length parameter? Something like: static svn_error_t * datasources_open(void *baton, apr_off_t *prefix_lines, svn_diff_datasource_e[] datasources, int datasources_len) ? And then use for loops everywhere I now do things twice for the two datasources? >> 5) I've struggled with making the code readable/understandable. It's >> likely that there is still a lot of room for improvement. I also >> probably need to document it some more. > > You need to write a full doc string for datasources_open(), at least. > It needs especially to say ho
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sat, 2010-10-09, Johan Corveleyn wrote: > Ok, third iteration of the patch in attachment. It passes make check. > > As discussed in [1], this version keeps 50 lines of the identical > suffix around, to give the algorithm a good chance to generate a diff > output of good quality (in all but the most extreme cases, this will > be the same as with the original svn_diff algorithm). > > That's about the only difference with the previous iteration. So for > now, I'm submitting this for review. Any feedback is very welcome :-). Hi Johan. I haven't reviewed it, but after seeing today's discussion I had just scrolled quickly through the previous version of this patch. I noticed that the two main functions - find_identical_suffix and find_identical_suffix - are both quite similar (but not quite similar enough to make them easily share implementation) and both quite long, and I noticed you wrote in an earlier email that you were finding it hard to make the code readable. I have a suggestion that may help. I think the existing structure of the svn_diff__file_baton_t is unhelpful: { const svn_diff_file_options_t *options; const char *path[4]; apr_file_t *file[4]; apr_off_t size[4]; int chunk[4]; char *buffer[4]; char *curp[4]; char *endp[4]; /* List of free tokens that may be reused. */ svn_diff__file_token_t *tokens; svn_diff__normalize_state_t normalize_state[4]; apr_pool_t *pool; } svn_diff__file_baton_t; All those array[4] fields are logically related, but this layout forces the programmer to address them individually. So I wrote a patch - attached - that refactors this into an array of 4 sub-structures, and simplifies all the code that uses them. I think this will help you to get better code clarity because then your increment_pointer_or_chunk() for example will be able to take a single pointer to a file_info structure instead of a lot of pointers to individual members of the same. Would you take a look and let me know if you agree. If so, I can commit the refactoring straight away. Responding to some of the other points you mentioned in a much earlier mail: > 3) It's only implemented for 2 files. I'd like to generalize this for > an array of datasources, so it can also be used for diff3 and diff4. > > 4) As a small hack, I had to add a flag "datasource_opened" to > token.c#svn_diff__get_tokens, so I could have different behavior for > regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is > no longer needed. Yes, I'd like to see 3), and so hack 4) will go away. > 5) I've struggled with making the code readable/understandable. It's > likely that there is still a lot of room for improvement. I also > probably need to document it some more. You need to write a full doc string for datasources_open(), at least. It needs especially to say how it relates to datasource_open() - why should the caller call this function rather than that function, and are they both necessary - and how the caller is supposed to use the 'prefix_lines' parameter. And doc strings for increment/decrement_...(). But this makes me think, it looks to me like this whole prefix-suffix-skipping functionality would fit better inside the lower-level diff algorithm rather than inside the "datasources_open" function. Do you agree? It works as it is, but you were talking about wanting it to obey the standard token-parsing rules such as "ignore white space", and so on. It seems that things like this would be much better one level down. > 6) I've not paid too much attention to low level optimizations, so > here also there's probably room for improvement, which may be > significant for the critical loops. Maybe. Not important at this stage. > 7) There are two warnings left "conversion from 'apr_off_t' to 'int'", > in diff_file.c#find_identical_suffix. To completely eliminate this, I > should probably make all "chunks" of type apr_off_t instead of int > (but I have not done that yet, because the original implementation > also used int for the chunk in the svn_diff__file_baton_t struct). > Should I do this (also in the baton struct)? Or should I use an > explicit cast somewhere? I recommend using an explicit cast where needed, in this patch. Changing the 'chunk' data type everywhere could be done in a separate patch, either before or after this patch. - Julian > I still consider this a WIP, because of the following remaining todo's > (which may have a lot of impact on the current implementation): > - Generalize for more than 2 datasources (for diff3 and diff4). > - revv svn_diff_fns_t and maybe other stuff I've changed in public API. > - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe > switch the implementation to read out entire lines before comparing > (like datasources_get_next_token does). > > Log message: > [[[ > Make svn_diff_diff skip identical prefix and suffix to make diff and blame > faster. > > * subversion/include/svn_diff.h > (svn_diff_fns_t): Add
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
Ok, third iteration of the patch in attachment. It passes make check. As discussed in [1], this version keeps 50 lines of the identical suffix around, to give the algorithm a good chance to generate a diff output of good quality (in all but the most extreme cases, this will be the same as with the original svn_diff algorithm). That's about the only difference with the previous iteration. So for now, I'm submitting this for review. Any feedback is very welcome :-). I still consider this a WIP, because of the following remaining todo's (which may have a lot of impact on the current implementation): - Generalize for more than 2 datasources (for diff3 and diff4). - revv svn_diff_fns_t and maybe other stuff I've changed in public API. - Add support for -x-b, -x-w, and -x--ignore-eol-style options. Maybe switch the implementation to read out entire lines before comparing (like datasources_get_next_token does). Log message: [[[ Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster. * subversion/include/svn_diff.h (svn_diff_fns_t): Added new function types datasources_open and get_prefix_lines to the vtable. * subversion/libsvn_diff/diff_memory.c (datasources_open): New function (does nothing). (get_prefix_lines): New function (does nothing). (svn_diff__mem_vtable): Added new functions datasources_open and get_prefix_lines. * subversion/libsvn_diff/diff_file.c (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4] and suffix_offset_in_chunk. (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions. (find_identical_prefix, find_identical_suffix): New functions. (datasources_open): New function, to open both datasources and find their identical prefix and suffix. From the identical suffix, 50 lines are kept to help the diff algorithm find the nicest possible diff representation in case of ambiguity. (get_prefix_lines): New function. (datasource_get_next_token): Stop at start of identical suffix. (svn_diff__file_vtable): Added new functions datasources_open and get_prefix_lines. * subversion/libsvn_diff/diff.h (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that the datasource was already opened. * subversion/libsvn_diff/token.c (svn_diff__get_tokens): Added argument "datasource_opened". Only open the datasource if datasource_opened is FALSE. Set the starting offset of the position list to the number of prefix lines. * subversion/libsvn_diff/lcs.c (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set the offset of the sentinel position for EOF, even if one of the files became empty after eliminating the identical prefix. * subversion/libsvn_diff/diff.c (svn_diff__diff): Add a chunk of "common" diff for identical prefix. (svn_diff_diff): Use new function datasources_open, to open original and modified at once, and find their identical prefix and suffix. Pass prefix_lines to svn_diff__lcs and to svn_diff__diff. * subversion/libsvn_diff/diff3.c (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs. * subversion/libsvn_diff/diff4.c (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs. ]]] Cheers, -- Johan [1] http://svn.haxx.se/dev/archive-2010-10/0141.shtml Index: subversion/include/svn_diff.h === --- subversion/include/svn_diff.h (revision 1006020) +++ subversion/include/svn_diff.h (working copy) @@ -112,6 +112,11 @@ typedef struct svn_diff_fns_t svn_error_t *(*datasource_open)(void *diff_baton, svn_diff_datasource_e datasource); + /** Open the datasources of type @a datasources. */ + svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines, + svn_diff_datasource_e datasource0, + svn_diff_datasource_e datasource1); + /** Close the datasource of type @a datasource. */ svn_error_t *(*datasource_close)(void *diff_baton, svn_diff_datasource_e datasource); @@ -124,6 +129,9 @@ typedef struct svn_diff_fns_t void *diff_baton, svn_diff_datasource_e datasource); + /** Get the number of identical prefix lines from the @a diff_baton. */ + apr_off_t (*get_prefix_lines)(void *diff_baton); + /** A function for ordering the tokens, resembling 'strcmp' in functionality. * @a compare should contain the return value of the comparison: * If @a ltoken and @a rtoken are "equal", return 0. If @a ltoken is Index: subversion/libsvn_diff/diff_file.c === --- subversion/libsvn_diff/diff_file.c (revision 1006020) +++
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
On Sun, Oct 3, 2010 at 1:46 AM, Johan Corveleyn wrote: > Hi, > > Here is a second iteration of the patch. It now passes make check. > > Differences from the previous version are: > - Support for \r eol-style (\n and \r\n was already ok). > - The number of prefix_lines is now passed to svn_diff__lcs, so it can > use that value to set the position offset of the "EOF" marker > correctly, in case one of both files has become empty after skipping > the prefix. This fixes the crashes in blame_tests.py 2 and 7. > > The patch is pretty big, so please let me know if I should split it up > to make it more reviewable (I could easily split it up between the > prefix-finding and the suffix-finding, at the cost of having overview > over the entire algorithm). > > Still to do: > - Think about why results are sometimes different (because of > eliminated suffix, the LCS can sometimes be slightly different), and > what can be done about it. Hm, this problem has made me reconsider whether my patch is a good solution (more details below). I'm thinking of other ways of implementing this kind of optimization, so for now I'm pulling this patch. Please don't review it yet, as I might go for a radically different approach (sorry for any time spent that may be lost). The details: The problem happens if there are two (or more) non-matching parts with an identical part in between (so the prefix scanning stops early, doesn't meet the suffix in one of the files), and the suffix scanning goes too far because the end of the last change is identical to the original at that point. For example (best viewed with fixed-width font): [[[ original | modified --- This is line 1 | This is line 1 This is line 2 | This line is *changed* This is line 3 | This is line 3 ... (more identical lines) | ... existing_function() | existing_function() {| { // do stuff| // do stuff return SVN_NO_ERROR; | return SVN_NO_ERROR; }| } | This is the end of the file | new_function() -| { | // do other stuff | return SVN_NO_ERROR; | } | | This is the end of the file |- ]]] The following identical suffix will be eliminated: [[[ return SVN_NO_ERROR; } This is the end of the file ]]] which means the LCS algorithm cannot do better than to say that the following lines are new: [[[ + return SVN_NO_ERROR; +} + +new_function() +{ + // do other stuff ]]] Not quite what we want/expect. If the suffix is not eliminated, the LCS algorithm gives the correct/expected result. Also if the change in the beginning of the file isn't there, the result is correct (because the prefix scanning goes first, and eliminates the identical stuff from the start (I'll leave it as an exercise to the reader to see that the result is better)). Interestingly, GNU diff also has this problem. It mitigates it a bit by keeping a number of identical prefix/suffix lines (at least the number of context lines, or a higher number if supplied by --horizon-lines=NUM). This cannot completely solve the problem though: for any given number of "horizon-lines" one can always come up with an example that doesn't give the correct result. So I need to find a way to not "eliminate" the identical suffix, but to mark those lines as identical and then include them in the position_list that's going to be used by the LCS algorithm. I'd like to do the same for the prefix. This means I need to approach this problem on a different level. To be continued ... Cheers, -- Johan
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
Hi, Here is a second iteration of the patch. It now passes make check. Differences from the previous version are: - Support for \r eol-style (\n and \r\n was already ok). - The number of prefix_lines is now passed to svn_diff__lcs, so it can use that value to set the position offset of the "EOF" marker correctly, in case one of both files has become empty after skipping the prefix. This fixes the crashes in blame_tests.py 2 and 7. The patch is pretty big, so please let me know if I should split it up to make it more reviewable (I could easily split it up between the prefix-finding and the suffix-finding, at the cost of having overview over the entire algorithm). Still to do: - Think about why results are sometimes different (because of eliminated suffix, the LCS can sometimes be slightly different), and what can be done about it. - Generalize for more than 2 datasources (for diff3 and diff4). - revv svn_diff_fns_t and maybe other stuff I've changed in public API. - Add support for -x-b, -x-w, and -x--ignore-eol-style options. But I'd like to do those things in follow-up patches, after this one has been reviewed and digested a little bit. So at this point: review, feedback, ... very welcome :-). Log message: [[[ Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster. * subversion/include/svn_diff.h (svn_diff_fns_t): Added new function types datasources_open and get_prefix_lines to the vtable. * subversion/libsvn_diff/diff_memory.c (datasources_open): New function (does nothing). (get_prefix_lines): New function (does nothing). (svn_diff__mem_vtable): Added new functions datasources_open and get_prefix_lines. * subversion/libsvn_diff/diff_file.c (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4] and suffix_offset_in_chunk. (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions. (find_identical_prefix, find_identical_suffix): New functions. (datasources_open): New function, to open both datasources and find their identical prefix and suffix. (get_prefix_lines): New function. (datasource_get_next_token): Stop at start of identical suffix. (svn_diff__file_vtable): Added new functions datasources_open and get_prefix_lines. * subversion/libsvn_diff/diff.h (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that the datasource was already opened. * subversion/libsvn_diff/token.c (svn_diff__get_tokens): Added argument "datasource_opened". Only open the datasource if datasource_opened is FALSE. Set the starting offset of the position list to the number of prefix lines. * subversion/libsvn_diff/lcs.c (svn_diff__lcs): Added argument "prefix_lines". Use this to correctly set the offset of the sentinel position for EOF, even if one of the files became empty after eliminating the identical prefix. * subversion/libsvn_diff/diff.c (svn_diff__diff): Add a chunk of "common" diff for identical prefix. (svn_diff_diff): Use new function datasources_open, to open original and modified at once, and find their identical prefix and suffix. Pass prefix_lines to svn_diff__lcs and to svn_diff__diff. * subversion/libsvn_diff/diff3.c (svn_diff_diff3): Pass datasource_opened = FALSE to svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs. * subversion/libsvn_diff/diff4.c (svn_diff_diff4): Pass datasource_opened = FALSE to svn_diff__get_tokens. Pass prefix_lines = 0 to svn_diff__lcs. ]]] Cheers, -- Johan Index: subversion/include/svn_diff.h === --- subversion/include/svn_diff.h (revision 1003326) +++ subversion/include/svn_diff.h (working copy) @@ -112,6 +112,11 @@ typedef struct svn_diff_fns_t svn_error_t *(*datasource_open)(void *diff_baton, svn_diff_datasource_e datasource); + /** Open the datasources of type @a datasources. */ + svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines, + svn_diff_datasource_e datasource0, + svn_diff_datasource_e datasource1); + /** Close the datasource of type @a datasource. */ svn_error_t *(*datasource_close)(void *diff_baton, svn_diff_datasource_e datasource); @@ -124,6 +129,9 @@ typedef struct svn_diff_fns_t void *diff_baton, svn_diff_datasource_e datasource); + /** Get the number of identical prefix lines from the @a diff_baton. */ + apr_off_t (*get_prefix_lines)(void *diff_baton); + /** A function for ordering the tokens, resembling 'strcmp' in functionality. * @a compare should contain the return value of the comparison: * If @a ltoken and @a rtoken are "equal", return 0. If @a ltoken is Index: subversion/libsvn_diff/diff_file.c ==
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
Johan Corveleyn wrote on Tue, Sep 28, 2010 at 23:37:23 +0200: > On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf > wrote: > >> Index: subversion/include/svn_diff.h > >> === > >> --- subversion/include/svn_diff.h (revision 1001548) > >> +++ subversion/include/svn_diff.h (working copy) > >> @@ -112,6 +112,11 @@ > > (personally I prefer 'svn diff -x-p' to show the function name here) > > Ok, will do next time. > Thanks. > I've had some progress: > - The blame_tests.py now all pass (tests 2 and 7 provoked a core > dump). That makes all tests pass. However, although I fixed the > coredump, I don't fully understand the root cause (why they ended up > in the situation where they ended up). So I'm going to study that > first a bit more. > - I have now included support for files with \r eol-style. > > I'll send a new version of the patch shortly. > Sounds good. > I'm also thinking that I could try to take advantage of -x-b/-x-w or > -x--ignore-eol-style options to make it even faster (right now the > prefix/suffix scanning will stop at the first difference, regardless > if it's a whitespace or eol difference that could/should be ignored). > ... > > Thoughts? None, actually; I'm not (yet) sufficiently familiar with diff internals. But let's please have this work in small, digestible (read: reviewable) pieces --- support for -x--foo is exactly the sort of additional optimization that can be done in a separate patch (on top of this starting patch).
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
Hi Daniel, Thanks for the feedback. On Tue, Sep 28, 2010 at 4:11 PM, Daniel Shahaf wrote: >> Index: subversion/include/svn_diff.h >> === >> --- subversion/include/svn_diff.h (revision 1001548) >> +++ subversion/include/svn_diff.h (working copy) >> @@ -112,6 +112,11 @@ > (personally I prefer 'svn diff -x-p' to show the function name here) Ok, will do next time. >> svn_error_t *(*datasource_open)(void *diff_baton, >> svn_diff_datasource_e datasource); >> >> + /** Open the datasources of type @a datasources. */ >> + svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t >> *prefix_lines, >> + svn_diff_datasource_e datasource0, >> + svn_diff_datasource_e datasource1); >> + > > So, you're extending the svn_diff_fns_t struct, which is defined in > a public header. > > It's a public struct with no constructor function, so I believe you have > to revv it (into svn_diff_fns2_t) in order to extend it (for binary > compatibility: people allocating this struct and then using a newer > Subversion library at runtime). > > If it did have a constructor function, you'd have to extend it only at > the end, and even then make sure that if the added elements are NULL (eg > because an old caller didn't know to fill them) then everything still > worked. > > Probably more important to get the logic right than to revv it right > away, though; the latter is a SMOP. Doh! I'm sure that observation was in the back of my head somewhere, but I forgot about it while working on the solution. Anyway, you're right: there is first some more work to get the algorithm right. I've had some progress: - The blame_tests.py now all pass (tests 2 and 7 provoked a core dump). That makes all tests pass. However, although I fixed the coredump, I don't fully understand the root cause (why they ended up in the situation where they ended up). So I'm going to study that first a bit more. - I have now included support for files with \r eol-style. I'll send a new version of the patch shortly. I'm also thinking that I could try to take advantage of -x-b/-x-w or -x--ignore-eol-style options to make it even faster (right now the prefix/suffix scanning will stop at the first difference, regardless if it's a whitespace or eol difference that could/should be ignored). However, I'm not sure if I should implement this myself, as part of the find_identical_prefix and find_identical_suffix functions, or switch to the usage of datasource_get_next_token (which is the function that is used by the "real" diff algorithm to extract the lines, and which uses util.c#svn_diff__normalize_buffer to normalize whitespace and eol's if needed). Right now, I don't read entire lines (tokens) but compare each byte as I go. But I could do it line-based as well (extract line from file1, extract line from file2, memcmp lines). I would have to make the calculation of the adler32 hash in datasource_get_next_token conditional on some boolean, or factor out the part of the function that's useful to me into a new static function. There is one caveat to this approach: I'm not sure if it would work backwards (for suffix scanning). Well, the normalization function wouldn't have to be changed, but the extraction of lines would have to go backward. Surely it's possible, but I have no idea how much I'd have to change the code in get_next_token to get lines backwards... I'm also not sure if one would be (significantly) faster than the other: comparing byte-by-byte while going through both files, or extracting entire lines and then comparing the lines. Thoughts? Cheers, -- Johan
Re: [WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
> Index: subversion/include/svn_diff.h > === > --- subversion/include/svn_diff.h (revision 1001548) > +++ subversion/include/svn_diff.h (working copy) > @@ -112,6 +112,11 @@ (personally I prefer 'svn diff -x-p' to show the function name here) >svn_error_t *(*datasource_open)(void *diff_baton, >svn_diff_datasource_e datasource); > > + /** Open the datasources of type @a datasources. */ > + svn_error_t *(*datasources_open)(void *diff_baton, apr_off_t *prefix_lines, > + svn_diff_datasource_e datasource0, > + svn_diff_datasource_e datasource1); > + So, you're extending the svn_diff_fns_t struct, which is defined in a public header. It's a public struct with no constructor function, so I believe you have to revv it (into svn_diff_fns2_t) in order to extend it (for binary compatibility: people allocating this struct and then using a newer Subversion library at runtime). If it did have a constructor function, you'd have to extend it only at the end, and even then make sure that if the added elements are NULL (eg because an old caller didn't know to fill them) then everything still worked. Probably more important to get the logic right than to revv it right away, though; the latter is a SMOP. >/** Close the datasource of type @a datasource. */ >svn_error_t *(*datasource_close)(void *diff_baton, > svn_diff_datasource_e datasource); > @@ -124,6 +129,9 @@ > void *diff_baton, > svn_diff_datasource_e > datasource); > > + /** Get the number of identical prefix lines from the @a diff_baton. */ > + apr_off_t (*get_prefix_lines)(void *diff_baton); > + >/** A function for ordering the tokens, resembling 'strcmp' in > functionality. > * @a compare should contain the return value of the comparison: > * If @a ltoken and @a rtoken are "equal", return 0. If @a ltoken is > Index: subversion/libsvn_diff/diff_memory.c > === > --- subversion/libsvn_diff/diff_memory.c (revision 1001548) > +++ subversion/libsvn_diff/diff_memory.c (working copy) > @@ -95,7 +95,23 @@ >return SVN_NO_ERROR; > } > > +/* Implements svn_diff_fns_t::datasources_open */ > +static svn_error_t * > +datasources_open(void *baton, apr_off_t *prefix_lines, > + svn_diff_datasource_e datasource0, > + svn_diff_datasource_e datasource1) > +{ > + /* Do nothing: everything is already there and initialized to 0 */ > + return SVN_NO_ERROR; > +} > > +/* Implements svn_diff_fns_t::datasource_get_prefix_lines */ > +static apr_off_t > +get_prefix_lines(void *baton) > +{ > + return 0; > +} > + > /* Implements svn_diff_fns_t::datasource_close */ > static svn_error_t * > datasource_close(void *baton, svn_diff_datasource_e datasource) > @@ -189,8 +205,10 @@ > static const svn_diff_fns_t svn_diff__mem_vtable = > { >datasource_open, > + datasources_open, >datasource_close, >datasource_get_next_token, > + get_prefix_lines, >token_compare, >token_discard, >token_discard_all > Index: subversion/libsvn_diff/diff_file.c > === > --- subversion/libsvn_diff/diff_file.c(revision 1001548) > +++ subversion/libsvn_diff/diff_file.c(working copy) > @@ -77,6 +77,10 @@ >char *curp[4]; >char *endp[4]; > > + apr_off_t prefix_lines; > + int suffix_start_chunk[4]; > + apr_off_t suffix_offset_in_chunk[4]; > + >/* List of free tokens that may be reused. */ >svn_diff__file_token_t *tokens; > > @@ -233,7 +237,330 @@ > curp, length, 0, file_baton->pool); > } > > +static svn_error_t * > +increment_pointer_or_chunk(svn_diff__file_baton_t *file_baton, > + char **curp, char **endp, int *chunk_number, > + char *buffer, apr_off_t last_chunk_number, int > idx) > +{ > + apr_off_t length; > > + if ((*curp) == (*endp) - 1) > +{ > + if (*chunk_number == last_chunk_number) > +(*curp)++; /* *curp == *endp with last chunk signals end of file */ > + else > +{ > + (*chunk_number)++; > + length = *chunk_number == last_chunk_number ? > +offset_in_chunk(file_baton->size[idx]) : CHUNK_SIZE; > + SVN_ERR(read_chunk(file_baton->file[idx], > + file_baton->path[idx], > + buffer, length, > + chunk_to_offset(*chunk_number), > + file_baton->pool)); > + *endp = buffer + length; > + *curp = buffer; > +} > +} > + else > +{ > + (*curp)++; > +} > + > + ret
[WIP PATCH] Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster
Hi devs, As discussed in [1], here is a patch that makes svn_diff_diff (libsvn_diff/diff.c) skip the identical prefix and suffix of the original and modified files, before starting the LCS (longest common subsequence) algorithm on the "non-matching" part. This makes diff a lot faster (especially for large files), thereby also speeding up blame in environments where the client is the bottleneck (with a fast enough server and network, blame is constrained by the speed of diff on the client side). This patch still has some rough edges (see below), hence the WIP marker. Nevertheless, it works "most of the time", and it can demonstrate the speed improvement that can be expected. I will continue working on the "rough edges", but in the meantime any feedback/review/thoughts are very much appreciated. The rough edges: 1) While scanning through the identical prefix, I have to count the number of lines (to have correct line offsets for the rest of the algorithm). To do that, I currently count the number of \n's, so it won't work for files with \r eol-style. Not so difficult to fix (similar to how diff_file.c#datasource_get_next_token does it), but I haven't done it yet. 2) Two tests still fail: - blame_tests.py 2: annotate a binary file - blame_tests.py 7: blame with different eol styles Maybe this is because of 1), I'll have to investigate. 3) It's only implemented for 2 files. I'd like to generalize this for an array of datasources, so it can also be used for diff3 and diff4. 4) As a small hack, I had to add a flag "datasource_opened" to token.c#svn_diff__get_tokens, so I could have different behavior for regular diff vs. diff3 and diff4. If 3) gets implemented, this hack is no longer needed. 5) I've struggled with making the code readable/understandable. It's likely that there is still a lot of room for improvement. I also probably need to document it some more. 6) I've not paid too much attention to low level optimizations, so here also there's probably room for improvement, which may be significant for the critical loops. 7) There are two warnings left "conversion from 'apr_off_t' to 'int'", in diff_file.c#find_identical_suffix. To completely eliminate this, I should probably make all "chunks" of type apr_off_t instead of int (but I have not done that yet, because the original implementation also used int for the chunk in the svn_diff__file_baton_t struct). Should I do this (also in the baton struct)? Or should I use an explicit cast somewhere? 8) A bigger problem: the output of diff/blame is sometimes different from the original implementation. It's always syntactically correct, but sometimes the "less unique lines" are taken differently (like when a new function is added, and diff thinks the new block starts from the closing brace of the previous function, ... stuff like that). This happens only because of the identical-suffix-scanning (it doesn't happen if I only enable the identical-prefix-scanning). I think I know the cause of this change in behavior: I completely eliminate the suffix from the LCS-building algorithm. And that probably causes it to sometimes come up with another "longest common subsequence". Right now I don't know how to solve this completely. A workaround might be to add a certain number of suffix-lines to the "non-matching-block", so they can be part of the LCS-searching. But probably one can always come up with examples where the number of suffix-lines is not enough. I'll have to look into this some more. Ideas welcome :-). Log message: [[[ Make svn_diff_diff skip identical prefix and suffix to make diff and blame faster. * subversion/include/svn_diff.h (svn_diff_fns_t): Added new function types datasources_open and get_prefix_lines to the vtable. * subversion/libsvn_diff/diff_memory.c (datasources_open): New function (does nothing). (get_prefix_lines): New function (does nothing). (svn_diff__mem_vtable): Added new functions datasources_open and get_prefix_lines. * subversion/libsvn_diff/diff_file.c (svn_diff__file_baton_t): Added members prefix_lines, suffix_start_chunk[4] and suffix_offset_in_chunk. (increment_pointer_or_chunk, decrement_pointer_or_chunk): New functions. (find_identical_prefix, find_identical_suffix): New fuctions. (datasources_open): New function, to open both datasources and find their identical prefix and suffix. (get_prefix_lines): New function. (datasource_get_next_token): Stop at start of identical suffix. (svn_diff__file_vtable): Added new functions datasources_open and get_prefix_lines. * subversion/libsvn_diff/diff.h (svn_diff__get_tokens): Added argument "datasource_opened", to indicate that the datasource was already opened. * subversion/libsvn_diff/token.c (svn_diff__get_tokens): Added argument "datasource_opened". Only open the datasource if datasource_opened is FALSE. Set the starting offset of the position list to the number of prefix lines. * subversion/libsvn_diff/diff.c (svn_diff__