Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Bill Tutt Thu, 02 Dec 2010 09:18:50 -0800

Note: This email only tangentially relates to svn diff and more about
reverse token scanning in general:


As someone who has implemented suffix reverse token scanning before:

* It simply isn't possible in DBCS code pages. Stick to byte only here.
   SBCS and UTF-16 make reverse token stuff relatively
straightforward. UTF-8 is a little trickier but still tractable.
   At least UTF-8 is tractable in a way that DBCS isn't. You always
know which part of a Unicode code point you are in. (i.e. byte 4 vs.
byte 3 vs. etc...)

* I would recommend only supporting a subset of the diff options for
reverse token scanning. i.e. ignore whitespace/ignore eol but skip
ignore case (if svn has that, I forget...)
   If tokens include keyword expansion operations then stop once you
hit one. The possible source of bugs outways the perf gain in my mind
here.
* Suffix scanning does really require a seekable stream, if it isn't
seekable then don't perform the reverse scanning.  It is only an
optimization after all.

Additional ignore whitespace related comment:
* IIRC, Perforce had an interesting twist on ignoring whitespace. You
could ignore just line leading/ending whitespace instead of all
whitespace differences but pay attention to any whitespace change
after the "trim" operation had completed.

e.g.:
* "    aaa bbb   " vs "aaa bbb" would compare as equal
* "    aaa  bbb  " vs "aaa  bbb" would compare as equal
* "    aaa  bbb  " vs "aaa bbb" would compare as non-equal due to the
white space change in the middle of the line

Fyi,
Bill

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Reply via email to