On Wed, May 18, 2011 at 7:31 PM, Morten Kloster <mor...@gmail.com> wrote: > Thanks, Daniel. > > Johan: > I've found your notes trunk/notes/diff-optimizations.txt. As you may > have realized, my patch is a slightly different approach to the > problem you discuss in section I.5, "Discarding non-matching > lines before running the LCS (Longest Common Subsequence) > algorithm." - rather than discarding the lines before calling LCS, I > create count information that lets LCS skip the lines without > counting them as (avoidable) changes. I believe my approach > is easier to implement (not least now, when it's already done ;) > and has less overhead, although the potential speed-up is > somewhat less:
Yes, I noticed the similarity :-). Your approach is a nice one, and indeed easier to implement. With the discarding-approach, we'd have to give each line an alternative line number (a "virtual line number", of the file without its non-matching lines), then run the LCS on those lines, and then when producing the diff output go back to using the original line numbers. Or something like that. But I like your idea too, it sounds more light-weight, and might provide most of the benefits. > LCS has running time of O( N + P D ), where N is #tokens in > longest source, P is number of mismatches in smallest source > and D is total number of mismatches in both sources. > Discarding the non-matching tokens would replace both P and D > by their discarded versions P' and D', giving a potential speed-up > of the SQUARE of the inverse of the fraction of changed tokens > that are not unique to one source, whereas my approach > effectively only replaces P by P', not D, so you only get one > power, not two (at least I think so; I haven't checked carefully). You seem to know your stuff :-). For me it's easier to understand by explaining D as "the length of the shortest edit script (additions+deletes) from largest to smallest source", and P as "the number of deleted tokens in that shortest edit script". That's the terminology used in the article referred to by lcs.c [1]. Or instead of "shortest edit script", I usually use the term "minimal diff". I'm not sure about your calculation though. I should probably study your patch first, because I haven't quite grokked it yet. But could you explain why your approach doesn't reduce D (the deleted lines)? Also: the discarding-approach reduces N as well, just by making both files smaller. In the "average case", where expected running time is O(N + PD), that's probably not a big deal. But according to the article the worst-case running time is O(NP). So reducing N also has a big impact on the worst-case running time. Anyway, I speaking purely theoretical here, I have no idea when this worst-case running time can occur in practice. > There is another possible enhancement that is somewhat > complementary to the one I've implemented: If the number of > changed tokens is smaller than the number of uniquely > matching tokens, and the changes are grouped in well- > separated sections, then whenever you have found more > unique matches than the number of changes you have "used", > you can "restart" the problem from that point: you are > guaranteed that there is no better match for the earlier > part of either source. I'm not sure about that. I have to chew on it a little bit :-). Can you describe this idea in some more detail? Maybe give with an example or something? > This reduces the running time from > O( N + P' D ) to something like O( N + sum(P'n Dn)), I think, > where the sum is over the different changed sections. I > haven't worked out the details on how to best implement > this approach, but it should work well with my initial patch. -- Johan