On Sun, Jun 26, 2011 at 11:23 AM, Morten Kloster <mor...@gmail.com> wrote: > On Tue, Jun 14, 2011 at 11:12 PM, Johan Corveleyn <jcor...@gmail.com> wrote: >> On Tue, Jun 14, 2011 at 7:32 PM, Morten Kloster <mor...@gmail.com> wrote: >>> On Tue, Jun 14, 2011 at 2:49 AM, Johan Corveleyn <jcor...@gmail.com> wrote: >>>> On Tue, Jun 7, 2011 at 10:16 PM, Morten Kloster <mor...@gmail.com> wrote: >>>>> >>> [] >>>>> Here's an updated patch based on the current HEAD, and with >>>>> one or two minor improvements. While, as stated, the patch >>>>> can give many-fold speed increases for ideal cases, it also slows >>>>> down the LCS algorithm significantly for poorly matching files; I >>>>> got about 18% increase in running time on my system for files >>>>> that were very poor matches but had no lines unique to either file. >>>> >>>> Hi Morten, >>>> >>>> Hmmm, that sounds like a lot of overhead in some cases (I actually >>>> wonder why -- I haven't fully understood the implementation, but on >>>> first sight it doesn't seem that computationally intensive). But the >>>> most important question that's bugging me: do those "ideal cases" >>>> occur often in real-world examples? >>>> >>> >>> It's not surprising to get that kind of speed reduction, since it uses >>> an extra test within the loops that take the bulk of the time when >>> the files don't match. >> >> Indeed, that might be it :-). Just wondering though: it isn't by any >> chance the case that, worst case, clear_fp is called too often (more >> often than is useful, each time resetting the LCS run for only >> marginal gain)? >> >> Can you verify that those 18% are spent because of those extra >> comparisons, and not because of calls to clear_fp? IIUC, with those >> poorly-matching files, there should never be a call to clear_fp, >> because the unique-matching optimization never kicks in, right? >> > > As you say, clear_fp is never called in this case. But there is > actually a risk of slowdown for well-matching files due to > calls to clear_fp (during the loop over k for p=0; clear_fp will > clear more than it really needs to). It shouldn't be too hard > to avoid that, though. > >> Also, those 18% you measured: is that the increase in >> svn_diff_file_diff time, or in lcs time? Or is that approximately the >> same, because the lcs time dominates because of the poor matching? >> > > They are essentially the same in this case. > >> >> I have two more, possibly stupid, questions on the implementation in the >> patch: >> >> - You mentioned in your theoretical explanation that you still have to >> scan the entire d-range for every p, even after you've found a "reset >> point". But can't you just "break" from the for-loops, whenever you >> encounter a reset point? >> > > No, I don't have to scan after I find a reset point (the loops > just continue in post-reset mode; there's no waste there). The > issue is that svn_diff__lcs still looks for a p=0 solution even after > a unique possible match has failed, which clearly is no longer > possible (it's not restricted to p=0, that's just the simplest case). > >> - You calculated 'limit' as: >> limit = (d >= 0 ? 2 * p + d : 2 * p - d); >> and made the "reset condition" be: >> if (fp[k].uniquematchcount > limit + k) /* with k < 0 */ >> or: >> if (fp[k].uniquematchcount > limit - k) /* with k >= 0 */ >> >> I don't fully understand. Can you explain a bit more why this is the >> right value for 'limit', and how this is the correct "reset >> condition"? >> > > Those limits are actually overkill. The correct condition is that the > number of unique matches must be larger than (or equal to, I guess) > the number of mismatches that could conceivably be corrected by > matching the files otherwise. I have here used the total number of > mismatches in BOTH files, but the larger number of mismatches in > one file would be sufficient.
Hi Morten, Thanks for explaining. Maybe you can start a feature branch for this optimization (and maybe other further LCS/diff-optimizations). That would give it some more visibility, and would allow for some more experimentation and discussion. Now that you have your account with (partial) commit access... Speaking of which: have you added your name to the COMMITTERS file yet? You should probably do that first. -- Johan