Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Johan Corveleyn Thu, 02 Dec 2010 12:22:08 -0800

On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <[email protected]> wrote:
> Note: This email only tangentially relates to svn diff and more about
> reverse token scanning in general:
>
> As someone who has implemented suffix reverse token scanning before:


Thanks for the input. It's nice to see other people have also
struggled with this :-).

> * It simply isn't possible in DBCS code pages. Stick to byte only here.
>    SBCS and UTF-16 make reverse token stuff relatively
> straightforward. UTF-8 is a little trickier but still tractable.
>   At least UTF-8 is tractable in a way that DBCS isn't. You always
> know which part of a Unicode code point you are in. (i.e. byte 4 vs.
> byte 3 vs. etc...)

Ok, this further supports the decision to focus on the byte-based
approach. We'll only consider stuff identical if all bytes are
identical. That's the simplest route, and since it's only an
optimization anyway ...

> * I would recommend only supporting a subset of the diff options for
> reverse token scanning. i.e. ignore whitespace/ignore eol but skip
> ignore case (if svn has that, I forget...)

svn diff doesn't have an ignore-case option, so that's ok :-).

>    If tokens include keyword expansion operations then stop once you
> hit one. The possible source of bugs outways the perf gain in my mind
> here.

Haven't thought about keyword expansion yet. But as you suggest: I'm
not going to bother doing special stuff for (expanded) keywords. If we
find a mismatch, we'll stop with the optimized scanning, and fall back
to the default algorithm.

> * Suffix scanning does really require a seekable stream, if it isn't
> seekable then don't perform the reverse scanning.  It is only an
> optimization after all.

Hm, yes, we'll need to be careful about that. I'll start another mail
thread asking for known implementors of the svn_diff_fns_t functions,
to find out whether seeking around like that for suffix would be
supported.

> Additional ignore whitespace related comment:
> * IIRC, Perforce had an interesting twist on ignoring whitespace. You
> could ignore just line leading/ending whitespace instead of all
> whitespace differences but pay attention to any whitespace change
> after the "trim" operation had completed.
>
> e.g.:
> * "    aaa bbb   " vs "aaa bbb" would compare as equal
> * "    aaa  bbb  " vs "aaa  bbb" would compare as equal
> * "    aaa  bbb  " vs "aaa bbb" would compare as non-equal due to the
> white space change in the middle of the line

Cool (svn doesn't have that option). But I'm not sure what that would
be useful for (as a user, I can't immediately imagine an important use
case). Anyway, could still be a nice option...

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Reply via email to