Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100: > On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d...@daniel.shahaf.name> wrote: > > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100: > >> I am now considering to abandon the tokens-approach, for the following > >> reasons: > > ... > >> So, unless someone can convince me otherwise, I'm probably going to > >> stop with the token approach. Because of 2), I don't think it's worth > >> it to spend the effort needed for 1), especially because the > >> byte-based approach already works. > >> > > > > In other words, you're saying that the token-based approach: (b) won't be > > as fast as the bytes-based approach can be, and (a) requires much effort > > to be spent on implementing the reverse reading of tokens? (i.e., > > a backwards gets()) > > Yes. > > The reverse reading is quite hard (in the diff_file.c case) because of > the chunked reading of files. A line may be split by a "chunk > boundary" (it may even be split in the middle of an eol sequence > (between \r and \n), and it all still needs to be > canonicalized/normalized correctly (that's why we'll also need a > reverse normalization function). The current forward get_next_token > does this very well, and the reverse should be analogous, but I just > find it hard to reason about, and to get it implemented correctly. It > will take me a lot of time, and with a high probability of introducing > subtle bugs for special edge cases. >
OK, so a reverse get_next_token() could be tricky to implement. But, worse, won't having it mean that implementors of svn_diff_fns_t can't have streamy access to their source? Since they would be required to provide sometimes a token from the start and sometimes a token from the end... Well, it's not streamy in our usual sense of the word, but it's "double-streamy" (no one requests the middle byte until either all bytes before or all bytes after it were transmitted already) > >> Any thoughts? > >> > > > > -tokens/BRANCH-README mentions one of the advantages of the tokens > > approach being that the comparison is done only after whitespace and > > newlines have been canonicalized (if -x-w or -x--ignore-eol-style is in > > effect). IIRC, the -bytes approach doesn't currently take advantage of > > these -x flags? > > > > What is the practical effect of the fact the -bytes approach doesn't > > take advantage of these flags? If a file (with a moderately long > > history) has had all its newlines changed in rN, then I assume that your > > -bytes optimizations will speed up all the diffs that 'blame' performs > > on that file, except for the single diff between rN and its predecessor? > > Yes, I thought that would be an important advantage of the tokens > approach, but as you suggest: the practical effect will most likely be > limited. Indeed, only this single diff between rN and its predecessor > (which may be 1 in 1000 diffs) will not benifit from > prefix/suffix-optimization. Even if the rate of such changes is like > 1/100th (let's say you change leading tabs to spaces, and vice versa, > every 100 revisions), it will hardly be noticeable. > > The perfectionist in me still thinks: hey, you can also implement this > normalization with the byte-based approach (in a streamy way). But > that will probably be quite hard, because I'd have to take into We can add that additional optimization after the basic -bytes support is working? No need to implement all possible optimizations in one shot (let alone those of them that might have little relative benefit). > account that I might have to read more bytes from one file than from > the other file (right now I can always simply read a single byte from > both files). And other than that, it will at least be "algorithm > duplication": it will be a (streamy) re-implementation of the > normalization algorithm that's already used in the token layer. So all > in all it's probably not worth it ... > > Cheers, > -- > Johan