Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Johan Corveleyn Wed, 01 Dec 2010 01:08:45 -0800

On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <[email protected]> wrote:
> Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
>> I am now considering to abandon the tokens-approach, for the following 
>> reasons:
> ...
>> So, unless someone can convince me otherwise, I'm probably going to
>> stop with the token approach. Because of 2), I don't think it's worth
>> it to spend the effort needed for 1), especially because the
>> byte-based approach already works.
>>
>
> In other words, you're saying that the token-based approach: (b) won't be
> as fast as the bytes-based approach can be, and (a) requires much effort
> to be spent on implementing the reverse reading of tokens?  (i.e.,
> a backwards gets())


Yes.

The reverse reading is quite hard (in the diff_file.c case) because of
the chunked reading of files. A line may be split by a "chunk
boundary" (it may even be split in the middle of an eol sequence
(between \r and \n), and it all still needs to be
canonicalized/normalized correctly (that's why we'll also need a
reverse normalization function). The current forward get_next_token
does this very well, and the reverse should be analogous, but I just
find it hard to reason about, and to get it implemented correctly. It
will take me a lot of time, and with a high probability of introducing
subtle bugs for special edge cases.

>> Any thoughts?
>>
>
> -tokens/BRANCH-README mentions one of the advantages of the tokens
> approach being that the comparison is done only after whitespace and
> newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in
> effect).  IIRC, the -bytes approach doesn't currently take advantage of
> these -x flags?
>
> What is the practical effect of the fact the -bytes approach doesn't
> take advantage of these flags?  If a file (with a moderately long
> history) has had all its newlines changed in rN, then I assume that your
> -bytes optimizations will speed up all the diffs that 'blame' performs
> on that file, except for the single diff between rN and its predecessor?

Yes, I thought that would be an important advantage of the tokens
approach, but as you suggest: the practical effect will most likely be
limited. Indeed, only this single diff between rN and its predecessor
(which may be 1 in 1000 diffs) will not benifit from
prefix/suffix-optimization. Even if the rate of such changes is like
1/100th (let's say you change leading tabs to spaces, and vice versa,
every 100 revisions), it will hardly be noticeable.

The perfectionist in me still thinks: hey, you can also implement this
normalization with the byte-based approach (in a streamy way). But
that will probably be quite hard, because I'd have to take into
account that I might have to read more bytes from one file than from
the other file (right now I can always simply read a single byte from
both files). And other than that, it will at least be "algorithm
duplication": it will be a (streamy) re-implementation of the
normalization algorithm that's already used in the token layer. So all
in all it's probably not worth it ...

Cheers,
-- 
Johan

Re: diff-optimizations-tokens branch: I think I'm going to abandon it

Reply via email to