Note: This email only tangentially relates to svn diff and more about reverse token scanning in general:
As someone who has implemented suffix reverse token scanning before: * It simply isn't possible in DBCS code pages. Stick to byte only here. SBCS and UTF-16 make reverse token stuff relatively straightforward. UTF-8 is a little trickier but still tractable. At least UTF-8 is tractable in a way that DBCS isn't. You always know which part of a Unicode code point you are in. (i.e. byte 4 vs. byte 3 vs. etc...) * I would recommend only supporting a subset of the diff options for reverse token scanning. i.e. ignore whitespace/ignore eol but skip ignore case (if svn has that, I forget...) If tokens include keyword expansion operations then stop once you hit one. The possible source of bugs outways the perf gain in my mind here. * Suffix scanning does really require a seekable stream, if it isn't seekable then don't perform the reverse scanning. It is only an optimization after all. Additional ignore whitespace related comment: * IIRC, Perforce had an interesting twist on ignoring whitespace. You could ignore just line leading/ending whitespace instead of all whitespace differences but pay attention to any whitespace change after the "trim" operation had completed. e.g.: * " aaa bbb " vs "aaa bbb" would compare as equal * " aaa bbb " vs "aaa bbb" would compare as equal * " aaa bbb " vs "aaa bbb" would compare as non-equal due to the white space change in the middle of the line Fyi, Bill