Re: svn diff optimization to make blame faster?
On 15.09.2010 14:20, Johan Corveleyn wrote: Some update on this: I have implemented this for svn_diff (excluding the identical prefix and suffix of both files, and only then starting to fill up the token tree and let the lcs-agorithm to its thing). It makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms vs. 150-170 ms). Hmmm ... looks to me like test data tailored to the optimization. :) -- Brane
Re: svn diff optimization to make blame faster?
On Mon, Sep 20, 2010 at 11:52 AM, Branko Čibej br...@xbc.nu wrote: On 15.09.2010 14:20, Johan Corveleyn wrote: Some update on this: I have implemented this for svn_diff (excluding the identical prefix and suffix of both files, and only then starting to fill up the token tree and let the lcs-agorithm to its thing). It makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms vs. 150-170 ms). Hmmm ... looks to me like test data tailored to the optimization. :) Nope, that's real data from a real repository, with a normal kind of change that happens here every day. Of course this optimization is most effective if there are a lot of common prefix/suffix lines. If there is a single change in the first line, and a single change in the last one, this optimization will do nothing but introduce a little bit of extra overhead. And it will obviously make the most impact on large files (in fact it's just relative to the ratio of the number of common prefix/suffix lines to the number of lines in between). I'm sorry it takes me longer than expected to post a version of this to the list, but I'm still having some problems with a couple of edge conditions (I'm learning C as I go, and I'm struggling with a couple of pointer calculations/comparisons). I plan to post something during this week... Cheers, -- Johan
Re: svn diff optimization to make blame faster?
On Sun, Sep 5, 2010 at 1:53 AM, Johan Corveleyn jcor...@gmail.com wrote: On Tue, Aug 24, 2010 at 11:11 AM, Philip Martin philip.mar...@wandisco.com wrote: Johan Corveleyn jcor...@gmail.com writes: On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej br...@xbc.nu wrote: svn_diff uses basically the same algorithm as GNU diff but implemented slightly differently and IIRC it doesn't have some of GNU diff's optimizations. I'm sure it can be speeded up, but haven't a clue about how much. Ok, thanks. In the meantime I saw that there is not that much difference anymore between GNU diff and svn_diff, after running the latter from a release build, and disabling my anti-virus (which makes me wonder why my anti-virus slows down svn_diff (impact when opening the modified datasource), but not on GNU diff). The big difference between Subversion's diff and GNU diff is that GNU uses heuristics to cut short the diff algorithm whereas Subversion grinds on to the end. Certain files trigger the heuristics and then GNU diff is much faster than Subversion. Typically the file has a large number of matches and non-matches repeated through the file, machine generated files sometimes fit this pattern. GNU diff's heuristics work well so when they trigger the resulting diff is usually good enough. They can be disabled using the --minimal option and using that makes GNU diff performance much more like Subversion. Hmm, I thought harder about this. If I try --minimal with GNU diff, it's just as fast, still significantly faster than svn diff. Then I read a little bit more about diff algorithms, like the blog entry that Greg Hudson suggested in the other thread [1], about the Patience Diff algorithm [2]. In there the following two lines caught my eye, first steps in the diff algo: [[[ 1) Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match. 2) Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match. ]]] (then the longest common subsequence algorithm kicks in) Now, these steps are not specific to Patience Diff, they are common to most diff algorithms. And I bet it's a much more efficient to exclude the common prefix/suffix this way, than to include them into the entire lcs algorithm. Right now, svn diff doesn't do this AFAICS. However, GNU diff does, and I think that's where most of the performance difference comes from. For big files, where a couple of lines are modified somewhere in the middle (and the first 3 and last 3 lines are identical), I think this can make a big difference. Thoughts? If no-one else fancies this, I think I'll take a stab at implementing this optimization (first with a POC to see if it's significant). Unless feedback tells me it's not worth it, not a good idea, ... Some update on this: I have implemented this for svn_diff (excluding the identical prefix and suffix of both files, and only then starting to fill up the token tree and let the lcs-agorithm to its thing). It makes a *huge* difference. On my bigfile.xml (1.5 Mb) with only one line changed, the call to svn_diff_diff is ~10 times faster (15-20 ms vs. 150-170 ms). This means blame is much faster for such big files as well, because diff accounts for 90% of the client-side work of blame. This means that in most cases the client cpu won't be the bottleneck anymore, so it's more constrained by server-side and network. In my tests, blame was about twice as fast for bigfile.xml (since all my tests were done with client and server on the same machine, I'm guessing the server is now the bottleneck (as well as them running on the same machine)). Same factor of performance increase for a medium-sized file (a java source file of 300 Kb). The patch is still very crude, with lots of code duplication, and everything stuffed in one big function (and I may have overlooked some edge conditions), so I don't feel it's ready to post yet (don't want to waste people's time :)). I'll try to clean it up as soon as I can and post a reasonable version to the list. Some notes already: - I basically wrote a new function datasources_open in diff_file.c (similar to datasource_open), that opens both files (original and modified), and compares bytes from the beginning until the first nonmatching byte (counting newlines when it sees them), and backwards from the end to find the first (or last, depending how you look at it) nonmatching byte. The number of prefix_lines is then used to determine the starting offset of the tokens (lines) that will later be extracted. - I think I should generalize this function to take an array of datasources, and not just two, so it can be used by diff3 and diff4 as well. But for now I didn't want to take on that extra complication. - The code is complicated by the fact that the svn_diff code (diff_file.c) works with chunks of data (doesn't read the entire file(s)
Re: svn diff optimization to make blame faster?
On Wed, Sep 15, 2010 at 02:20:05PM +0200, Johan Corveleyn wrote: Just to know how hard/fast I should pursue this: is there any chance that a change like this could still make it into 1.7 (maybe even backported to 1.6?), provided that the patch is of good quality and all... ? Depending on how invasive the change is, and how quickly we get it reviewed and tested, there's a chance that this can make the 1.7 release. For 1.6 it's probably not worth the effort. Stefan
Re: svn diff optimization to make blame faster?
Johan Corveleyn wrote on Wed, Sep 15, 2010 at 14:20:05 +0200: The patch is still very crude, with lots of code duplication, and everything stuffed in one big function (and I may have overlooked some edge conditions), so I don't feel it's ready to post yet (don't want to waste people's time :)). I'll try to clean it up as soon as I can and post a reasonable version to the list. +1, sounds like you have some interesting stuff working there. :-)
Re: svn diff optimization to make blame faster?
On Sun, Aug 22, 2010 at 4:02 PM, Branko Čibej br...@xbc.nu wrote: On 18.08.2010 00:59, Johan Corveleyn wrote: Hi devs, While Looking to improve performance of svn annotate [1], I found that the current blame algorithm is mainly client-side bound, and that most of its time is spent on svn diff (calls to svn_diff_file_diff_2 from add_file_blame in blame.c). Apart from avoiding to build full-texts and diffing them altogether (which is subject of further discussion in [1]), I'm wondering if optimization of svn diff wouldn't also be an interesting way to improve the speed of blame. So the main question is: is it worth it to spend time to analyze this further and try to improve performance? Or has this already been optimized in the past, or is it simply already as optimal as it can get? I have no idea really, so if anyone can shed some light ... Gut feeling tells me that there must be room for optimization, since GNU diff seems a lot faster than svn diff for the same large file (with one line changed) on my machine [1]. But maybe svn's diff algorithm is purposefully different (better? more accurate? ...) than GNU's, or there are specific things in the svn context so svn diff has to do more work. Any thoughts? svn_diff uses basically the same algorithm as GNU diff but implemented slightly differently and IIRC it doesn't have some of GNU diff's optimizations. I'm sure it can be speeded up, but haven't a clue about how much. Ok, thanks. In the meantime I saw that there is not that much difference anymore between GNU diff and svn_diff, after running the latter from a release build, and disabling my anti-virus (which makes me wonder why my anti-virus slows down svn_diff (impact when opening the modified datasource), but not on GNU diff). There may still be some slight speed difference (enough to be significant for a blame operation doing 100's or 1000's of diffs), but not that much as I thought at first. So I don't think I'm going to spend more time on trying to speed up svn_diff (also, I'm not really an expert at optimizing C code, so ... I'll leave that to others :-)). Cheers, -- Johan
Re: svn diff optimization to make blame faster?
Johan Corveleyn jcor...@gmail.com writes: Ok, thanks. In the meantime I saw that there is not that much difference anymore between GNU diff and svn_diff, after running the latter from a release build, and disabling my anti-virus (which makes me wonder why my anti-virus slows down svn_diff (impact when opening the modified datasource), but not on GNU diff). The big difference between Subversion's diff and GNU diff is that GNU uses heuristics to cut short the diff algorithm whereas Subversion grinds on to the end. Certain files trigger the heuristics and then GNU diff is much faster than Subversion. Typically the file has a large number of matches and non-matches repeated through the file, machine generated files sometimes fit this pattern. GNU diff's heuristics work well so when they trigger the resulting diff is usually good enough. They can be disabled using the --minimal option and using that makes GNU diff performance much more like Subversion. -- Philip
Re: svn diff optimization to make blame faster?
On Thu, Aug 19, 2010 at 11:38 PM, Johan Corveleyn jcor...@gmail.com wrote: Feh, I just redid my apr_time_now+printf profiling with a release build (of trunk), instead of a debug build, and that makes a *big* difference. Total time of the svn_diff_diff call is now down to ~300 ms. Still a lot slower than GNU diff (78 ms), but a lot better than with the debug build. That will teach me not to do performance tests with a debug build. Still, the challenge holds: why is svn diff ~4 times slower than GNU diff? Also, still puzzling to me: why does datasource_open take so long in the modified case? Now it takes twice as long as the while loop ... @#...@#!!! /me bangs head against wall The slowness of datasource_open was because of my antivirus (AVG free edition). After disabling it, it's completely gone. Running a release build, with antivirus disabled, and minimal instrumentation brings svn diff performance (after a couple of runs) more or less on par with GNU diff: $ time svn diff bigfile.xml Starting diff ... (entered svn_diff_diff in diff.c) Diff finished in 78125 usec (78 millis) [snip] real0m0.359s user0m0.015s sys 0m0.031s I guess this closes this thread ... on to more useful stuff. There is no point in trying to optimize svn diff, so I guess blame can only be sped up by more fundamental changes (i.e. avoiding diffs, which still amount to ~90% of blame time on the client side). Sorry for any wasted time. (on the bright side, I learned quite a lot about accurately measuring performance) Cheers, -- Johan
Re: svn diff optimization to make blame faster?
Johan Corveleyn wrote on Fri, Aug 20, 2010 at 01:43:16 +0200: If anyone has any more suggestions, short of installing a Linux distribution and trying a linux profiler ... Ask one of the other devs to do the profiling for you?
Re: svn diff optimization to make blame faster?
On Wed, Aug 18, 2010 at 3:44 PM, Johan Corveleyn jcor...@gmail.com wrote: On Wed, Aug 18, 2010 at 12:49 PM, Stefan Sperling s...@elego.de wrote: Can you show a profiler run that illustrates where the client is spending most of its time during diff? That would probably help with getting opinions from people, because it saves them from spending time doing this research themselves. You've already hinted at svn_diff__get_tokens() in another mail, but a real profiler run would show more candidates. Sorry, I'm still very much a beginner in C (I've been programming for 10 years, but mainly in Java). Especially the tooling is a steep part of the learning curve :-), so I don't know (how to use) a profiler yet. Any suggestions? I'm on Windows (XP), using VCE 2008, and used cygwin to compare with GNU diff. I googled around a bit for C profilers on Windows. I found several which seem useful: - very sleepy (http://www.codersnotes.com/sleepy/sleepy) - Shiny (http://sourceforge.net/projects/shinyprofiler/) - AMD CodeAnalyst (http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx) - AQTime - not free, but has a trial of 30 days (http://www.automatedqa.com/products/aqtime/) Before I dive in and try them out: any preference/favorites from the windows devs on this list? Or other suggestions? Further, some additional context to the manual-profiling numbers: see below. For the time being, I've helped myself using apr_time_now() from apr_time.h and printf statements. Not terribly accurate and probably somewhat overheadish, but it gives me some hints about the bottlenecks. FWIW, below is the output of a recent run with my instrumented trunk build. I've instrumented svn_diff_diff in diff.c and svn_diff__get_tokens in token.c. I checked out bigfile.xml from a repository, and edited a single line of it in my working copy. The two statements Starting diff and Diff finished are the first and last statements inside the svn_diff_diff function. These are numbers from the second run (any following runs show approximately the same numbers). $ time svn diff bigfile.xml Starting diff ... (entered svn_diff_diff in diff.c) - calling svn_diff__get_tokens for svn_diff_datasource_original == svn_diff__get_tokens datasource_open: 0 usec == svn_diff__get_tokens while loop: 265625 usec calls to datasource_get_next_token: 62500 usec svn_diff__tree_insert_token: 171875 usec rest: 15625 usec - done: 281250 usec - calling svn_diff__get_tokens for svn_diff_datasource_modified == svn_diff__get_tokens datasource_open: 234375 usec == svn_diff__get_tokens while loop: 312500 usec calls to datasource_get_next_token: 62500 usec svn_diff__tree_insert_token: 171875 usec rest: 46875 usec - done: 562500 usec - calling svn_diff__lcs - done: 0 usec - calling svn_diff__diff - done: 0 usec Diff finished in 875000 usec (875 millis) Index: bigfile.xml === [snip] real 0m1.266s user 0m0.015s sys 0m0.031s For comparison: a debug build from trunk, with only one instrumentation spot (at start and end of svn_diff_diff): $ time svn diff bigfile.xml Starting diff ... (entered svn_diff_diff in diff.c) Diff finished in 703125 usec (703 millis) [snip] real0m1.109s user0m0.015s sys 0m0.031s So the instrumentation in the critical loop probably costs me around 150-200 ms. A release build will also probably be faster, but I have no time to make one and test it now. If I time my 1.6.5 build from tigris.org, that's still a lot faster: $ time svn diff bigfile.xml [snip] real0m0.468s user0m0.015s sys 0m0.031s Maybe that can be totally attributed to the build (release vs. debug), or maybe 1.6.5 was faster at svn diff than trunk is? Some observations: - svn_diff__get_tokens takes most of the time - for some reason, in the case of datasource_modified, the call to datasource_open takes a long time (234 ms). In case of datasource_original, it's instantaneous. Maybe because of translation, ... of the pristine file? But I'd think that would be the original?? - apart from that, most of the time goes to the while loop in svn_diff__get_tokens. - Inside the while loop, most time is spent on calls to svn_diff__tree_insert_token (which compares tokens (=lines) to insert them into some tree structure). Calls to datasource_get_next_token also take some time. I'm not too sure of the accuracy of those last numbers with my simple profiling method, because it's the accumulated time of calls inside a while loop (with 61000 iterations). For completeness, the same diff with /usr/bin/diff (under cygwin), of the edited bigfile.xml vs. the original, as of the second diff run: $ ls -l settings.xml -rwxr-xr-x+ 1 User None 1447693 2010-08-17 14:20 bigfile.xml $ time diff
RE: svn diff optimization to make blame faster?
-Original Message- From: Johan Corveleyn [mailto:jcor...@gmail.com] Sent: donderdag 19 augustus 2010 4:55 To: Subversion Development Subject: Re: svn diff optimization to make blame faster? I googled around a bit for C profilers on Windows. I found several which seem useful: - very sleepy (http://www.codersnotes.com/sleepy/sleepy) - Shiny (http://sourceforge.net/projects/shinyprofiler/) - AMD CodeAnalyst (http://developer.amd.com/cpu/CodeAnalyst/Pages/default.aspx) - AQTime - not free, but has a trial of 30 days (http://www.automatedqa.com/products/aqtime/) Microsoft and Intel have some advanced profilers integrated in their C products. (Only a bit of experience with those) I had good results with very sleepy. When you have the PDB files available it is as simple as just running the application and you get an easy to understand result. (Make sure you have a recent version; the older versions only allowed attaching to an already running process) Bert
Re: svn diff optimization to make blame faster?
On Wed, Aug 18, 2010 at 12:59:21AM +0200, Johan Corveleyn wrote: Hi devs, While Looking to improve performance of svn annotate [1], I found that the current blame algorithm is mainly client-side bound, and that most of its time is spent on svn diff (calls to svn_diff_file_diff_2 from add_file_blame in blame.c). Apart from avoiding to build full-texts and diffing them altogether (which is subject of further discussion in [1]), I'm wondering if optimization of svn diff wouldn't also be an interesting way to improve the speed of blame. So the main question is: is it worth it to spend time to analyze this further and try to improve performance? Or has this already been optimized in the past, or is it simply already as optimal as it can get? I have no idea really, so if anyone can shed some light ... Gut feeling tells me that there must be room for optimization, since GNU diff seems a lot faster than svn diff for the same large file (with one line changed) on my machine [1]. But maybe svn's diff algorithm is purposefully different (better? more accurate? ...) than GNU's, or there are specific things in the svn context so svn diff has to do more work. Any thoughts? Can you show a profiler run that illustrates where the client is spending most of its time during diff? That would probably help with getting opinions from people, because it saves them from spending time doing this research themselves. You've already hinted at svn_diff__get_tokens() in another mail, but a real profiler run would show more candidates. Thanks, Stefan
svn diff optimization to make blame faster?
Hi devs, While Looking to improve performance of svn annotate [1], I found that the current blame algorithm is mainly client-side bound, and that most of its time is spent on svn diff (calls to svn_diff_file_diff_2 from add_file_blame in blame.c). Apart from avoiding to build full-texts and diffing them altogether (which is subject of further discussion in [1]), I'm wondering if optimization of svn diff wouldn't also be an interesting way to improve the speed of blame. So the main question is: is it worth it to spend time to analyze this further and try to improve performance? Or has this already been optimized in the past, or is it simply already as optimal as it can get? I have no idea really, so if anyone can shed some light ... Gut feeling tells me that there must be room for optimization, since GNU diff seems a lot faster than svn diff for the same large file (with one line changed) on my machine [1]. But maybe svn's diff algorithm is purposefully different (better? more accurate? ...) than GNU's, or there are specific things in the svn context so svn diff has to do more work. Any thoughts? -- Johan [1] http://svn.haxx.se/dev/archive-2010-08/0408.shtml