On Thu, Dec 2, 2010 at 6:18 PM, Bill Tutt <b...@tutts.org> wrote: > Note: This email only tangentially relates to svn diff and more about > reverse token scanning in general: > > As someone who has implemented suffix reverse token scanning before:
Thanks for the input. It's nice to see other people have also struggled with this :-). > * It simply isn't possible in DBCS code pages. Stick to byte only here. > SBCS and UTF-16 make reverse token stuff relatively > straightforward. UTF-8 is a little trickier but still tractable. > At least UTF-8 is tractable in a way that DBCS isn't. You always > know which part of a Unicode code point you are in. (i.e. byte 4 vs. > byte 3 vs. etc...) Ok, this further supports the decision to focus on the byte-based approach. We'll only consider stuff identical if all bytes are identical. That's the simplest route, and since it's only an optimization anyway ... > * I would recommend only supporting a subset of the diff options for > reverse token scanning. i.e. ignore whitespace/ignore eol but skip > ignore case (if svn has that, I forget...) svn diff doesn't have an ignore-case option, so that's ok :-). > If tokens include keyword expansion operations then stop once you > hit one. The possible source of bugs outways the perf gain in my mind > here. Haven't thought about keyword expansion yet. But as you suggest: I'm not going to bother doing special stuff for (expanded) keywords. If we find a mismatch, we'll stop with the optimized scanning, and fall back to the default algorithm. > * Suffix scanning does really require a seekable stream, if it isn't > seekable then don't perform the reverse scanning. It is only an > optimization after all. Hm, yes, we'll need to be careful about that. I'll start another mail thread asking for known implementors of the svn_diff_fns_t functions, to find out whether seeking around like that for suffix would be supported. > Additional ignore whitespace related comment: > * IIRC, Perforce had an interesting twist on ignoring whitespace. You > could ignore just line leading/ending whitespace instead of all > whitespace differences but pay attention to any whitespace change > after the "trim" operation had completed. > > e.g.: > * " aaa bbb " vs "aaa bbb" would compare as equal > * " aaa bbb " vs "aaa bbb" would compare as equal > * " aaa bbb " vs "aaa bbb" would compare as non-equal due to the > white space change in the middle of the line Cool (svn doesn't have that option). But I'm not sure what that would be useful for (as a user, I can't immediately imagine an important use case). Anyway, could still be a nice option... Cheers, -- Johan