On Fri, Aug 20, 2010 at 9:11 PM, Johan Corveleyn <jcor...@gmail.com> wrote: > On Fri, Aug 20, 2010 at 1:40 PM, Johan Corveleyn <jcor...@gmail.com> wrote: >> After eliminating antivirus, and running with a release build instead >> of a debug build, svn diff is just about on par with GNU diff. So this >> eliminates the option of optimizing diff ... > > Unless ... > > For every diff during blame calculation, tokens (lines) are extracted > and processed each time for the "original" and the "modified". This > takes up most of the time of svn diff. However, the file that is > playing the "modified" role in one step, will be the "original" in the > next step of blame processing. So in theory we already have those > tokens from the previous step, and don't have to extract/compare/... > them again. > > If one could find a way to recycle the tokens from the "modified" of > the previous diff into the tokens of the "original" of the next diff, > that would probably make the diff part of the blame operation almost > twice as fast. And since diffing still accounts for ~90% of blame time > on the client, that should speed it up considerably. > > Sounds like a plan? > > I'll try to write some sort of POC for this idea soon, unless someone > tells me it's a stupid idea :-).
Hm, after several attempts to get this working, it seems to be a dead end. Whatever I do, I can't seem to avoid lifetime issues, dangling pointers, tokens referring to files which are no longer there, ... I first tried just to pass through the position_list[1] (position_list from the modified file) to the next diff run as the new position_list[0]. I did this by returning the position_list[1] in an output parameter, up to the call in blame.c!add_file_blame, and then reusing that as input parameter for the next diff call (by stashing it away in the file_rev_baton, until the next run). That works, but it doesn't help, because the real power of the algorithm is in the tree, in which all the tokens from both original and modified are sorted and made unique (i.e. each unique token is represented only once in the tree). After the tree is correctly set up, the rest of the algo is very fast, because it only needs to compare token references to find the longest common subsequence. At the end of each run the tree is filled with a combination of the tokens of original and modified, so I can't just reuse/recycle that, because that would lead to a major memory leak (building up the tree with all the tokens from all revisions of the entire blame operation). And refilling a clean tree with the tokens already present in the passed-on-position_list also doesn't work, because the token content really needs to still be accessible (either in memory, or readable from the original file), so that token_compare can work when adding the new tokens from the new modified file ... Anyway, I don't know if this still makes sense. I'm just noting it down to order my thoughts, and maybe help someone else thinking about this. I'm gonna let it rest now, because I seem to have hit a brick wall with this. Will focus my efforts on other approaches to speed up blame, such as the fundamental changes of binary blaming, avoiding diffs, ... but I'm not sure if I'll be able to (have the time to) climb the learning curve for that. Cheers, -- Johan