On Wed, May 18, 2011 at 1:56 AM, Morten Kloster <mor...@gmail.com> wrote: > Log message: > > Speed-up of libsvn_diff using token counts > By using indices, not node pointers, to refer to tokens, and counting > the number of each token, the longest common subsequence (lcs) > algorithm at the heart of libsvn_diff becomes much faster in many > situations by skipping tokens that are unique to one source or the other. > > * subversion/libsvn_diff/token.c > (svn_diff__node_t): Added 'index' member. > (svn_diff__tree_t): Added 'node_count' member. > (svn_diff__get_token_count): New method. > (tree_insert_token): Assigns indices to nodes. > (svn_diff__get_tokens): Uses index, not pointer, > to identify node in position struct. > > * subversion/libsvn_diff/lcs.c > (svn_diff__snake): Takes token counts as additional argument. > Skips tokens that are unique to one or the other file. > (svn_diff__lcs): Takes token counts and number of different tokens > as additional arguments. Calculates number of tokens unique to > each source and uses this to compute effective length difference. > > * subversion/libsvn_diff/diff.h > (svn_diff__position_t): Changed node member from being > a pointer to the node to an integer index for the node. > Updated method declarations. > > * subversion/libsvn_diff/diff.c > * subversion/libsvn_diff/diff3.c > * subversion/libsvn_diff/diff4.c > (svn_diff_diff_2, svn_diff_diff_3, svn_diff_diff_4): Count the > number of times each token appears in each source. > (svn_diff__resolve_conflict): Takes number of different tokens > as additional argument. Counts the number of times > each token appears in each (partial) source. > > > > > > > Comments: > This patch can reduce the time required for a diff by a large factor > when comparing long files with large differences. As an extreme case, > splitting sqlite.c into two roughly equal parts and diffing them against > each other took about a minute on my laptop with the original code, > and 7-8 seconds with my patch. The speed-up is gained only if a > large fraction of the changed tokens are unique to one source or the > other. I can not see that the change would cause a significant > performance degradation in any reasonable case. > > Counting the number of times each token appears is also useful for > potential additional features in the future, such as identifying moved > blocks outside of the longest common subsequence. > > I have tested only svn_diff_diff_2 (that it gives identical results as > before), but the changes to svn_diff_diff3 and svn_diff_diff4 are > completely analogous. Tested on a Windows 7 64-bit machine, > TortoiseSVN build script. > > > Morten Kloster (first patch)
Interesting! The patch seems to be missing though. The list software tends to strip off attachments if they don't have a text/plain mime-type. Can you try sending it again with a .txt extension? -- Johan