On Thu, Aug 12, 2010 at 5:30 PM, Greg Hudson <ghud...@mit.edu> wrote: > On Thu, 2010-08-12 at 10:57 -0400, Julian Foad wrote: >> I'm wary of embedding any client functionality in the server, but I >> guess it's worth considering if it would be that useful. If so, let's >> take great care to ensure it's only lightly coupled to the core server >> logic. > > Again, it's possible that binary diffs between sequential revisions > could be used for blame purposes (not the binary deltas we have now, but > edit-stream-style binary diffs), which would decouple the > line-processing logic from the server. > > (But again, I haven't thought through the problem in enough detail to be > certain.)
If such edit-stream-style binary diffs could do the job, and they are "fast enough" (I'm guessing that line based vs. binary wouldn't make that much of a difference for the eventual blame processing), it seems like a good compromise: we get the performance benefits of blame-oriented delta's (supposedly fast and easy to calculate blame info from), possibly cached on the server, while still not introducing unnecessary coupling of the server to line-processing logic. Greg, could you explain a bit more what you mean with "edit-stream-style binary diffs", vs. the binary deltas we have now? Could you perhaps give an example similar to Julian's? Wouldn't you have the same problem with pieces of the source text being copied out-of-order (100 bytes from the end/middle of the source being copied to the beginning of the target, followed by the rest of the source)? Wouldn't you also have to do the work of discovering the largest contiguous block of source text as "the main stream", so determine that those first 100 bytes are to be interpreted as new bytes, etc? Caching this stuff on the server would of course be ideal. Whether it be "post-commit" or on-demand (first guy requesting the blame takes the hit), both approaches seem good to me. Working on that would be severely out of my league though :-). At least for now. Another thing that occurred to me: since most time of the current blame implementation is spent on "diff" (svn_diff_file_diff_2), maybe a quick win could be to simply (?) optimize the diff code? Or write a specialized faster version for blame. On my tests with a 1,5 Mb file (61000 lines), svn diffing it takes about 500 ms on my machine. GNU diff is much faster (300 ms for the first run, 72 ms on following runs). This seems to indicate that there is much room for optimization of svn diff. Or is there something extra that svn diff does, necessary in the svn context? I have looked a little bit at the svn diff code, and saw that most of the time is spent in the while loop inside svn_diff__get_tokens in token.c, presumably extracting the tokens (lines) from the file(s). Haven't looked any further/deeper. Anybody have any brilliant ideas/suggestions? Or is this a bad idea, not worthy of further exploration :-) ? BTW, I also tested with Stefan Fuhrmann's performance bra...@r985697, just for kicks (had some trouble building it on Windows, but eventually managed to get an svn.exe out of it). The timing of svn diff of such a large file was about the same, so that didn't help. But maybe the branch isn't ready for prime time just yet ... Cheers, -- Johan