request for new API function
Hi, To find all files and folders that have a specific property set I need to crawl the whole working copy and fetch the properties of each and every item, then scan the returned property list for that property. But WC-NG uses an SQLite db so this task should be much faster with a lot less disk access. I'd like to ask for a new API which would allow me to search the database for properties much faster. Something like svn_wc_propsearch() with parameters: * string to search for * value indicating how the search is done (equal, contains, begins with, ... basically all the possible match functions SQLite offers) * bool indicating whether the search is done on the NODES table or the ACTUAL_NODE table * callback function which receives the search results The callback function would receive not just the properties and paths of the items the found properties belong to, but also any information that the table row contains (because when an information is available for free, an API should return it to avoid forcing clients to call yet another API to get that info - if clients don't need the info they can just ignore it). And while we're at it: new APIs to search for other columns in the NODES and ACTUAL_NODE table would also be nice and useful: * search for 'changed_revision' and/or 'changed_date': allows to show a view with all files that haven't changed for a long time, or files that have changed recently, or ... * search for 'changed_author': allows to show a quick view of who changed what last, gather statistics, ... * search for 'depth' to quickly determine whether the working copy is full or may be missing something * search for 'file_external' With such new APIs, clients could use the advantages of WC-NG too. A lot of ideas I had couldn't be done before because it would have been just too slow. But now with WC-NG and the database it would be fast enough. Thoughts? Stefan -- ___ oo // \\ De Chelonian Mobile (_,\/ \_/ \ TortoiseSVN \ \_/_\_/The coolest Interface to (Sub)Version Control /_/ \_\ http://tortoisesvn.net
Re: SQLite and callbacks
On Fri, Feb 04, 2011 at 03:29:37AM +0200, Daniel Shahaf wrote: (test script attached --- hope it goes through) Stefan Sperling wrote on Wed, Feb 02, 2011 at 18:53:39 +0100: I've made svn proplist issue per-directory queries in r1066541. Reviews of this change are most welcome. Also, if someone has a nice way of testing performance impacts of this change it would be interesting to see the results. I have not done any benchmarking myself yet. Working copy of /tags/1.6.*, no local mods, r1066541: [[[ == pristine-t1-1.out == 233.55user 78.55system 5:12.08elapsed 100%CPU (0avgtext+0avgdata 32176maxresident)k 0inputs+0outputs (0major+6199minor)pagefaults 0swaps == pristine-t1-2.out == 237.68user 79.22system 5:16.88elapsed 100%CPU (0avgtext+0avgdata 32176maxresident)k 0inputs+0outputs (0major+6200minor)pagefaults 0swaps == pristine-t1-3.out == 232.96user 76.04system 5:08.98elapsed 100%CPU (0avgtext+0avgdata 32176maxresident)k 0inputs+0outputs (0major+6197minor)pagefaults 0swaps ]]] Ditto, r1066540: [[[ == pristine-t2-1.out == 253.41user 82.90system 5:36.27elapsed 100%CPU (0avgtext+0avgdata 30480maxresident)k 0inputs+0outputs (0major+6099minor)pagefaults 0swaps == pristine-t2-2.out == 252.31user 82.52system 5:34.81elapsed 100%CPU (0avgtext+0avgdata 30480maxresident)k 0inputs+0outputs (0major+6090minor)pagefaults 0swaps == pristine-t2-3.out == 234.95user 75.25system 5:10.15elapsed 100%CPU (0avgtext+0avgdata 30480maxresident)k 0inputs+0outputs (0major+6092minor)pagefaults 0swaps ]]] So, 5% more memory, but (excluding the t2-3 result) post-change is 7% faster than post-change. Is this difference really significant? Ditto, with local prop mods on all *.c/*.py files: [[[ == pset-Rpyc-t1-1.out == 219.41user 67.76system 4:47.13elapsed 100%CPU (0avgtext+0avgdata 32192maxresident)k 0inputs+0outputs (0major+6200minor)pagefaults 0swaps == pset-Rpyc-t1-2.out == 218.48user 65.74system 4:44.18elapsed 100%CPU (0avgtext+0avgdata 32192maxresident)k 0inputs+0outputs (0major+6205minor)pagefaults 0swaps == pset-Rpyc-t1-3.out == 219.14user 67.83system 4:46.94elapsed 100%CPU (0avgtext+0avgdata 32192maxresident)k 0inputs+0outputs (0major+6197minor)pagefaults 0swaps ]]] Ditto, r1066540: [[[ == pset-Rpyc-t2-1.out == 217.34user 64.90system 4:42.20elapsed 100%CPU (0avgtext+0avgdata 30480maxresident)k 0inputs+0outputs (0major+6097minor)pagefaults 0swaps == pset-Rpyc-t2-2.out == 215.01user 67.10system 4:42.08elapsed 100%CPU (0avgtext+0avgdata 30464maxresident)k 0inputs+0outputs (0major+6097minor)pagefaults 0swaps == pset-Rpyc-t2-3.out == 212.82user 63.68system 4:36.46elapsed 100%CPU (0avgtext+0avgdata 30480maxresident)k 0inputs+0outputs (0major+6094minor)pagefaults 0swaps ]]] Pre-change is faster. I'll look at profiler output. It has much better granularity. It's possible that we still run queries elsewhere that affect the results. We need to know whether the runtime overhead of sqlite queries has decreased. Of course, with some usage patterns (e.g. working copies which contain only directories, or disproportionally more directories than files), we can expect little to no difference. Stefan
Re: [Proposed] Split very long messages by paragraph for easy translate
On Fri, Feb 04, 2011 at 04:10:30PM +0800, Dongsheng Song wrote: On Fri, Feb 4, 2011 at 15:59, Dongsheng Song dongsheng.s...@gmail.com wrote: On Sun, Nov 14, 2010 at 01:19, Greg Hudson ghud...@mit.edu wrote: On Sat, 2010-11-13 at 10:31 -0500, Daniel Shahaf wrote: Sounds reasonable. What changes to the source code would be required? Do we just change N_(three\n\nparagraphs\n\nhere\n) to N_(three\n) N_(paragraphs\n) N_(here\n) No, that would just result in evaluating gettext on the combined string, same as before. I can see two options for help strings in particular: 1. Rev svn_opt_subcommand_desc2_t to include an array of help strings which are translated and displayed in sequence. 2. Change print_command_info2 to look at the help string and break it up at certain boundaries (such as blank lines or numbered list entries) before translating it. (Mercurial is written in Python, so it has different constraints.) Change svn_opt_subcommand_desc2_t.help from 'const char *' to 'const char **' maybe break the ABI, I don't think we can do it in the near future. We won't change svn_opt_subcommand_desc2_t. Instead, we would introduce a new svn_opt_subcommand_desc3_t. Change print_command_info2 have the similarly issues, and it's more complex to generate and store strings for translating. There would be a new print_command_info3. It might be more complext to generate and store multiple strings. But if this helps translators I think we should do it. We must find a nice way of splitting up help texts large help texts like this put too much burden on translators. I have another approach, introduce a new function to concatenate many strings, e.g. const char * svn_opt__string_concat(apr_pool_t *pool, const char *s1, ...) The last parameter should be NULL to indicate this is the last parameter, then the very long messages: N_(para1 para2 para3 ... paraN) Can be changed to: svn_opt__string_concat(N_(para1), (N_(para2), (N_(para3), ..., (N_(paraN), NULL) Why I recall this thread ? Because when I translating today, I found a message which have 255 lines (svn/main.c:676 ), It's supper terrible ! I can not image the maintenance work when there have one line changed only! Any comments? -- Dongsheng Song OOPS, apr_pstrcat can concatenate multiple strings, not need for svn_opt__string_concat. http://apr.apache.org/docs/apr/1.4/group__apr__strings.html#g7bd80c95ffb7b3f96bc78e7b5b5b0045 Unfortunately, the help text is part of the definition of a static array. So at the moment we cannot use apr_pstrcat() to split up the help text. Stefan
Re: request for new API function
On Sat, Feb 05, 2011 at 10:28:20AM +0100, Stefan Küng wrote: Hi, To find all files and folders that have a specific property set I need to crawl the whole working copy and fetch the properties of each and every item, then scan the returned property list for that property. But WC-NG uses an SQLite db so this task should be much faster with a lot less disk access. I'd like to ask for a new API which would allow me to search the database for properties much faster. Something like svn_wc_propsearch() with parameters: * string to search for * value indicating how the search is done (equal, contains, begins with, ... basically all the possible match functions SQLite offers) * bool indicating whether the search is done on the NODES table or the ACTUAL_NODE table * callback function which receives the search results The callback function would receive not just the properties and paths of the items the found properties belong to, but also any information that the table row contains (because when an information is available for free, an API should return it to avoid forcing clients to call yet another API to get that info - if clients don't need the info they can just ignore it). And while we're at it: new APIs to search for other columns in the NODES and ACTUAL_NODE table would also be nice and useful: * search for 'changed_revision' and/or 'changed_date': allows to show a view with all files that haven't changed for a long time, or files that have changed recently, or ... * search for 'changed_author': allows to show a quick view of who changed what last, gather statistics, ... * search for 'depth' to quickly determine whether the working copy is full or may be missing something * search for 'file_external' With such new APIs, clients could use the advantages of WC-NG too. A lot of ideas I had couldn't be done before because it would have been just too slow. But now with WC-NG and the database it would be fast enough. Thoughts? I think we should go into this direction. In fact, I think we should simply change the existing APIs to use the fastest possible way of getting at information. Most code we have still crawls the working copy, and that is an artifact of how the 1.6.x working copy was structured. We're now at single DB, but we're not yet using the single DB to its full potential. We're currently treating it more or less like a key/value store for information about paths. Have you seen r1039808 and the resulting the Sqlite and callbacks thread on dev@? That thread describes some of the issue we're facing with the interaction of callbacks in our APIs and sqlite queries. There were two approaches discussed in that thread. I am currently experimenting with the queries per-directory approach (see r1051452 and r1066541). I'm expecting this to be too slow, but I'm doing it anyway for two reasons. One is that we'll have real data to look at. The other is that we might need code that does per-directory queries anyway to satisfy backwards compatibility constraints (see the thread sqlite and callbacks thread for details). I think we will eventually need to query the database like people would normally query a database, letting sqlite do most of the work of pulling data out of the db. However we need to agree on how to solve problems with implications this has on the existing APIs. Stefan
Re: request for new API function
On Sat, Feb 05, 2011 at 01:56:41PM +0100, Stefan Sperling wrote: There were two approaches discussed in that thread. I am currently experimenting with the queries per-directory approach (see r1051452 and r1066541). Sorry, I meant r1050650, not r1051452.
Re: [Proposed] Split very long messages by paragraph for easy translate
On Sat, 2011-02-05, Stefan Sperling wrote: It might be more complext to generate and store multiple strings. But if this helps translators I think we should do it. We must find a nice way of splitting up help texts large help texts like this put too much burden on translators. Looking at this concern from the outside, I have a question. Do any of the commonly available translation tools do a good job of assisting with translating long strings? I would expect that merely showing a diff of the old and new English strings would be enough assistance in most cases. I can see how using a simple tool that doesn't even show a diff would make the job very difficult. - Julian
Re: SQLite and callbacks
On Sat, Feb 05, 2011 at 01:11:28PM +0100, Stefan Sperling wrote: I'll look at profiler output. It has much better granularity. I took 3 profiled runs of svn pl -v -R on an svn-trunk working copy with and without the patch. The results were pretty much the same for all runs. The numbers below are from the second run. The change has reduced the number of queries we issue, but we are still issuing way too many queries for the impact to be significant. Before: called/total parents index %timeself descendents called+selfname index called/total children 0.141.038467/8467sqlite3Step [8] [10]55.20.141.038467 sqlite3VdbeExec [10] [children of sqlite3VdbeExec omitted] % cumulative self self total time seconds secondscalls ms/call ms/call name 22.7 0.48 0.48 _mcount [19] 12.3 0.74 0.26 __mcount [25] 10.4 0.96 0.22 478996 0.00 0.00 sqlite3BtreeMovetoUnpacked [18] 6.6 1.10 0.14 8467 0.02 0.14 sqlite3VdbeExec [10] After: called/total parents index %timeself descendents called+selfname index called/total children 0.171.074832/4832sqlite3Step [8] [9] 64.40.171.074832 sqlite3VdbeExec [9] [again, children o sqlite3VdbeExec omitted] % cumulative self self total time seconds secondscalls ms/call ms/call name 21.8 0.42 0.42 _mcount [18] 13.5 0.68 0.26 473315 0.00 0.00 sqlite3BtreeMovetoUnpacked [16] 8.8 0.85 0.17 4832 0.04 0.26 sqlite3VdbeExec [9] 8.8 1.02 0.17 __mcount [20] The number of sqlite3VdbeExec() invocations has been reduced from 8467 to 4832. But 4832 queries is still way too much when compared to not crawling the tree at all. I don't have related gprof data anymore, but the log message of r1039808 contains some hints: gprof says this change speeds up svn proplist -R svn-trunk-wc /dev/null on my box from about 0.35s to about 0.11s. During profiled runs, the function sqlite3Step() went from using 27% of the total time down to 15%. The time spent within and below svn_client_proplist3() went down from 51% to 36%. The profiler overhead accounted for about 31% of the total run time before this change and about 45% after. So, as expected, r1039808 had a much greater effect than r1066541. Note that for our purposes sqlite3Step() is equivalent to sqlite3VdbeExec(). _mcount in the gprof output above shows the profiler overhead. I think we should strongly consider revving affected callbacks in the 1.7 API and document restriction they have to heed. Then we can bring the r1039808 code back. We can keep the code added in r1066541 for backwards-compatibility if desired. Stefan
Re: SQLite and callbacks
On Sat, Feb 05, 2011 at 03:29:23PM +0100, Stefan Sperling wrote: I think we should strongly consider revving affected callbacks in the 1.7 API and document restriction they have to heed. Then we can bring the r1039808 code back. We can keep the code added in r1066541 for backwards-compatibility if desired. By the way, I think this decision is very important for progress of wc-ng. And I also think it has way too much impact to be decided on a lazy consensus basis. So I'm *not* going to move forward with this before I've heard at least a couple of affirmative opinions. Please consider taking the time to think about this issue and let your opinion be heard. Thanks.
Re: request for new API function
On 05.02.2011 13:56, Stefan Sperling wrote: I think we should go into this direction. In fact, I think we should simply change the existing APIs to use the fastest possible way of getting at information. Well, currently there is no API that does what I suggested (basically return all results of a db query without even touching any files in the WC or have do this for every file/folder separately). Most code we have still crawls the working copy, and that is an artifact of how the 1.6.x working copy was structured. We're now at single DB, but we're not yet using the single DB to its full potential. We're currently treating it more or less like a key/value store for information about paths. Have you seen r1039808 and the resulting the Sqlite and callbacks thread on dev@? That thread describes some of the issue we're facing with the interaction of callbacks in our APIs and sqlite queries. There were two approaches discussed in that thread. I am currently experimenting with the queries per-directory approach (see r1051452 and r1066541). I'm expecting this to be too slow, but I'm doing it anyway for two reasons. One is that we'll have real data to look at. The other is that we might need code that does per-directory queries anyway to satisfy backwards compatibility constraints (see the thread sqlite and callbacks thread for details). I think we will eventually need to query the database like people would normally query a database, letting sqlite do most of the work of pulling data out of the db. However we need to agree on how to solve problems with implications this has on the existing APIs. I've read up on that thread. It seems the problem you're facing comes from the fact that you need to stay compatible with pre 1.7 APIs and clients, and the fact that you can't enforce clients to behave, only to ask them to behave and then hope for the best. However what I'm asking for here are *new* APIs which do something no existing API currently does. So staying compatible wouldn't be a problem. And if you're worried about clients not behaving properly, why not get rid of the callback completely and just return all information at once in one big chunk of memory. Talking about UI clients, this won't be a problem because they usually have to store all information they receive in a callback anyway so they have it ready to show in the UI. So for them, the memory use wouldn't be bigger at all. Of course, those APIs I'm asking for might not be very useful for existing APIs or other stuff that is done in the svn library. Those might only be useful for some svn clients. But I hope that's not a blocker for implementing those. I also thought of just query the SQLite db myself directly, but then I don't like to do something that's not really allowed. However: I did a quick test with the Check-for-modifications dialog in TSVN. It has a feature where you can enable showing all properties. To do that, a separate thread is started which lists all properties of all items in the working copy. On one of my working copies, this takes about 50 seconds. Using a simple SQLite query on the NODE table took in average 1260ms. Parsing the data and preparing it for use in the UI took another 3.5 seconds. Now *that* a speed improvement I really like. Stefan -- ___ oo // \\ De Chelonian Mobile (_,\/ \_/ \ TortoiseSVN \ \_/_\_/The coolest Interface to (Sub)Version Control /_/ \_\ http://tortoisesvn.net
Re: request for new API function
On Sat, Feb 05, 2011 at 04:22:29PM +0100, Stefan Küng wrote: On 05.02.2011 13:56, Stefan Sperling wrote: I think we should go into this direction. In fact, I think we should simply change the existing APIs to use the fastest possible way of getting at information. Well, currently there is no API that does what I suggested (basically return all results of a db query without even touching any files in the WC or have do this for every file/folder separately). Well, the svn_proplist case I'm looking at is the same thing. I want to answer the request Give me all properties within this working copy by issuing as few sqlite queries as possible. You are talking about such requests in general. I am talking about one specific instance (proplist). But essentially we want the same thing. I've read up on that thread. It seems the problem you're facing comes from the fact that you need to stay compatible with pre 1.7 APIs and clients, and the fact that you can't enforce clients to behave, only to ask them to behave and then hope for the best. Yes. However what I'm asking for here are *new* APIs which do something no existing API currently does. So staying compatible wouldn't be a problem. New APIs don't make the problems we have with existing APIs go away. The backwards compat problem doesn't affect TortoiseSVN, since you will simply provide builds linked to 1.7 and tell all users to upgrade. But we need to keep existing clients that were compiled against 1.6.x and earlier working. So it's not not a problem, it's just not TortoiseSVN's problem :) And if you're worried about clients not behaving properly, why not get rid of the callback completely and just return all information at once in one big chunk of memory. Talking about UI clients, this won't be a problem because they usually have to store all information they receive in a callback anyway so they have it ready to show in the UI. So for them, the memory use wouldn't be bigger at all. Really? Even with gigantic working copies? What if the amount of information requested simply doesn't fit into memory? I'd prefer a system that cannot fail in this way. I'd prefer passing the information to the client piece by piece, and letting the client worry about where to store it. If at all possible, the library shouldn't allocate huge slabs of memory outside of the client's control. Of course, those APIs I'm asking for might not be very useful for existing APIs or other stuff that is done in the svn library. Those might only be useful for some svn clients. But I hope that's not a blocker for implementing those. I hope that we'll get a good set of APIs for 1.7 that will satisfy all clients out there, including TortoiseSVN. What these APIs will look like isn't set in stone yet. I also thought of just query the SQLite db myself directly, but then I don't like to do something that's not really allowed. However: I did a quick test with the Check-for-modifications dialog in TSVN. It has a feature where you can enable showing all properties. To do that, a separate thread is started which lists all properties of all items in the working copy. On one of my working copies, this takes about 50 seconds. Using a simple SQLite query on the NODE table took in average 1260ms. Parsing the data and preparing it for use in the UI took another 3.5 seconds. Now *that* a speed improvement I really like. How is your query any different from the new proplist API and implementation I added in r1039808? I think that provides what you need (for the proplist case). It opens the db, runs a query on it and streams the results to the client via a callback. Very low overhead. Stefan
Re: request for new API function
On 05.02.2011 16:46, Stefan Sperling wrote: On Sat, Feb 05, 2011 at 04:22:29PM +0100, Stefan Küng wrote: On 05.02.2011 13:56, Stefan Sperling wrote: I think we should go into this direction. In fact, I think we should simply change the existing APIs to use the fastest possible way of getting at information. Well, currently there is no API that does what I suggested (basically return all results of a db query without even touching any files in the WC or have do this for every file/folder separately). Well, the svn_proplist case I'm looking at is the same thing. I want to answer the request Give me all properties within this working copy by issuing as few sqlite queries as possible. You are talking about such requests in general. I am talking about one specific instance (proplist). But essentially we want the same thing. Yes, the proplist case is what I need. But of course, the other cases I listed are also valid requests. For example give me all entries that have the author XXX as the last-commit or give me all paths that were changed since date XXX. These could also be answered by sqlite queries alone. Or even give me all paths in this working copy would also be a big help. Getting that information without the sqlite db would require crawling the whole working copy, resulting in a lot of disk access. I've read up on that thread. It seems the problem you're facing comes from the fact that you need to stay compatible with pre 1.7 APIs and clients, and the fact that you can't enforce clients to behave, only to ask them to behave and then hope for the best. Yes. However what I'm asking for here are *new* APIs which do something no existing API currently does. So staying compatible wouldn't be a problem. New APIs don't make the problems we have with existing APIs go away. The backwards compat problem doesn't affect TortoiseSVN, since you will simply provide builds linked to 1.7 and tell all users to upgrade. But we need to keep existing clients that were compiled against 1.6.x and earlier working. So it's not not a problem, it's just not TortoiseSVN's problem :) Sorry, I didn't realize the problem already exists. I was under the impression that the new API that was reverted a day later only had that problem and that's why it got reverted. And if you're worried about clients not behaving properly, why not get rid of the callback completely and just return all information at once in one big chunk of memory. Talking about UI clients, this won't be a problem because they usually have to store all information they receive in a callback anyway so they have it ready to show in the UI. So for them, the memory use wouldn't be bigger at all. Really? Even with gigantic working copies? What if the amount of information requested simply doesn't fit into memory? I'd prefer a system that cannot fail in this way. I'd prefer passing the information to the client piece by piece, and letting the client worry about where to store it. If at all possible, the library shouldn't allocate huge slabs of memory outside of the client's control. Sure, that would be the ideal way. But if it's not possible to force clients to behave and if that's a requirement, allocating a lot of memory might be the only alternative. Of course, those APIs I'm asking for might not be very useful for existing APIs or other stuff that is done in the svn library. Those might only be useful for some svn clients. But I hope that's not a blocker for implementing those. I hope that we'll get a good set of APIs for 1.7 that will satisfy all clients out there, including TortoiseSVN. What these APIs will look like isn't set in stone yet. I also thought of just query the SQLite db myself directly, but then I don't like to do something that's not really allowed. However: I did a quick test with the Check-for-modifications dialog in TSVN. It has a feature where you can enable showing all properties. To do that, a separate thread is started which lists all properties of all items in the working copy. On one of my working copies, this takes about 50 seconds. Using a simple SQLite query on the NODE table took in average 1260ms. Parsing the data and preparing it for use in the UI took another 3.5 seconds. Now *that* a speed improvement I really like. How is your query any different from the new proplist API and implementation I added in r1039808? I think that provides what you need (for the proplist case). It opens the db, runs a query on it and streams the results to the client via a callback. Very low overhead. Haven't tried that proplist API. I just reviewed it and it really is what I need. Too bad it got reverted. I hope it will get put back in soon. Stefan -- ___ oo // \\ De Chelonian Mobile (_,\/ \_/ \ TortoiseSVN \ \_/_\_/The coolest Interface to (Sub)Version Control /_/ \_\ http://tortoisesvn.net
Re: request for new API function
On Sat, Feb 05, 2011 at 05:15:22PM +0100, Stefan Küng wrote: On 05.02.2011 16:46, Stefan Sperling wrote: What if the amount of information requested simply doesn't fit into memory? I'd prefer a system that cannot fail in this way. I'd prefer passing the information to the client piece by piece, and letting the client worry about where to store it. If at all possible, the library shouldn't allocate huge slabs of memory outside of the client's control. Sure, that would be the ideal way. But if it's not possible to force clients to behave and if that's a requirement, allocating a lot of memory might be the only alternative. The other alternative that's being considered is to run per-directory queries in the compat code. Then we can invoke callbacks once per directory, outside of an sqlite transaction context, so that callbacks written for 1.6.x an earlier cannot cause deadlocks on the DB. This might mean that code compiled against 1.6.x runs a bit slower against 1.7. libraries. But it will still run and function correctly. The amount of memory allocated would be a function of the contents of one directory (vs. the content of an entire working copy...) How is your query any different from the new proplist API and implementation I added in r1039808? I think that provides what you need (for the proplist case). It opens the db, runs a query on it and streams the results to the client via a callback. Very low overhead. Haven't tried that proplist API. I just reviewed it and it really is what I need. Too bad it got reverted. I hope it will get put back in soon. I hope to be able to put it back in. I would like to rev all affected callbacks, and document what they're not allowed to do. If a callback behaves improperly, the entire working copy can deadlock. Maybe we can even try to detect deadlocks and error out (see http://svn.haxx.se/dev/archive-2011-01/0219.shtml). This is really a design problem for the most part. Implementing it is fairly simple once we've decided what we want to do. It's nice to know that TortoiseSVN would be happy with this approach. Stefan
Re: SQLite and callbacks
On 05.02.2011 15:35, Stefan Sperling wrote: On Sat, Feb 05, 2011 at 03:29:23PM +0100, Stefan Sperling wrote: I think we should strongly consider revving affected callbacks in the 1.7 API and document restriction they have to heed. Then we can bring the r1039808 code back. We can keep the code added in r1066541 for backwards-compatibility if desired. By the way, I think this decision is very important for progress of wc-ng. And I also think it has way too much impact to be decided on a lazy consensus basis. So I'm *not* going to move forward with this before I've heard at least a couple of affirmative opinions. Please consider taking the time to think about this issue and let your opinion be heard. Thanks. In that case it would make sense to create a completely new set of APIs (with a correspondingly new set of callbacks) with different restrictions so that the implementation can be made fast, i.e., by doing one query per working copy instead of one per directory or per-file. I would not worry about existing clients -- simply mark the existing APIs as deprecated, but keep them and do not attempt to improve their performance. Clients that are being actively developed will move to the new, faster infrastructure sooner rather than later. Those that are not being maintained, or whose maintainers don't want to bother with changing them, will remain bog slow and slowly fall by the wayside. We should not be in the business of trying to improve other people's code for them, only give them the tools to do so if they want to. In short -- +1 to this approach. If it's not taken, then the whole wc-ng effort will (almost) have been in vain as far as end users are concerned. -- Brane
Effect of indices on SQLite (optimizer) performance
Yesterday or IRC, Bert, Philip and I were chatting about our SQLite perf issues and how Philip's findings in the past suggested that SQLite wasn't using its indices to optimize our queries. After searching and discussing its documentation, Philip suggested the -too obvious- maybe we have the wrong indices. So, I went to work with his fake database generator script (attached as test.py). The type of query we're seeing problematic performance with looks like the one below. The essential part is the WHERE clause. SELECT * FROM nodes WHERE wc_id = 1 AND (local_relpath = 'foo' OR local_relpath like 'foo%'); We discussed 3 ways to achieve the effect of this query: 1. The query itself 2. The query stated as a UNION of two queries 3. Running the two parts of the UNION manually ourselves. Ad (1) This query doesn't perform as we had hoped to get from using a database. Ad (2) In the past, UNIONs have been explicitly removed because they were creating temporary tables (on disk!). However, since then we have changed our SQLite setup to create temporary tables in memory, so the option should really be re-evaluated. Ad (3) I'd hate to have to use two queries in all places in our source where we want to run queries like these. As a result, I think this scenario should be avoided if we can. So, I've created 'perf.py' to evaluate each of these scenarios, researching the effect on each of them under the influence of adding different indices. This is my finding: Scenario (1) [an AND combined with a complex OR] doesn't perform well under any circumstance. Scenario (2) performs differently, depending on the available indices. Scenario (3) performs roughly equal to scenario (2). Scenario (2) takes ~0.27 seconds to evaluate in the unmodified database. Adding an index on (wc_id, local_relpath) makes the execution time drop to ~0.000156 seconds! Seems Philip was right :-) We need to carefully review the indices we have in our database to support good performance. Bye, Erik.
Re: SQLite and callbacks
On Sat, Feb 05, 2011 at 06:47:35PM +0100, Branko Čibej wrote: I would not worry about existing clients -- simply mark the existing APIs as deprecated, but keep them and do not attempt to improve their performance. Neglecting performance of backwards compat code is an interesting idea. It allows us to focus on the new APIs first and foremost. The existing APIs will continue to work using the node walker and issue queries per node as they do now. We could consider optimising them later, before or after 1.7 release. The required changes are mostly mechanical.
Re: Effect of indices on SQLite (optimizer) performance
Now attached as text files (to be renamed to .py) to prevent the mailer software from dropping them... Bye, Erik. On Sat, Feb 5, 2011 at 7:05 PM, Erik Huelsmann ehu...@gmail.com wrote: Yesterday or IRC, Bert, Philip and I were chatting about our SQLite perf issues and how Philip's findings in the past suggested that SQLite wasn't using its indices to optimize our queries. After searching and discussing its documentation, Philip suggested the -too obvious- maybe we have the wrong indices. So, I went to work with his fake database generator script (attached as test.py). The type of query we're seeing problematic performance with looks like the one below. The essential part is the WHERE clause. SELECT * FROM nodes WHERE wc_id = 1 AND (local_relpath = 'foo' OR local_relpath like 'foo%'); We discussed 3 ways to achieve the effect of this query: 1. The query itself 2. The query stated as a UNION of two queries 3. Running the two parts of the UNION manually ourselves. Ad (1) This query doesn't perform as we had hoped to get from using a database. Ad (2) In the past, UNIONs have been explicitly removed because they were creating temporary tables (on disk!). However, since then we have changed our SQLite setup to create temporary tables in memory, so the option should really be re-evaluated. Ad (3) I'd hate to have to use two queries in all places in our source where we want to run queries like these. As a result, I think this scenario should be avoided if we can. So, I've created 'perf.py' to evaluate each of these scenarios, researching the effect on each of them under the influence of adding different indices. This is my finding: Scenario (1) [an AND combined with a complex OR] doesn't perform well under any circumstance. Scenario (2) performs differently, depending on the available indices. Scenario (3) performs roughly equal to scenario (2). Scenario (2) takes ~0.27 seconds to evaluate in the unmodified database. Adding an index on (wc_id, local_relpath) makes the execution time drop to ~0.000156 seconds! Seems Philip was right :-) We need to carefully review the indices we have in our database to support good performance. Bye, Erik. #!/usr/bin/python import os, sqlite3, time c = sqlite3.connect('wcx.db') c.execute(pragma case_sensitive_like=1) c.execute(pragma foreign_keys=on) c.execute(pragma synchronous=off) c.execute(pragma temp_store=memory) start = time.clock() # cpu clock as float in secs #c.execute(drop index i_wc_id_rp;) #c.execute(create index i_wc_id_rp on nodes (wc_id, local_relpath);) print c.execute(.indices) # strategy 1 c.execute(select * from nodes where wc_id = 1 AND (local_relpath like 'foo/%' OR local_relpath = 'foo');); # strategy 2 #c.execute(select * from nodes where wc_id = 1 AND local_relpath like 'foo/%' # union select * from nodes where wc_id = 1 AND local_relpath = 'foo';) # strategy 3 #c.execute(select * from nodes where wc_id = 1 AND local_relpath like 'foo/%';) #c.execute(select * from nodes where wc_id = 1 AND local_relpath = 'foo';) end = time.clock() print timing: %5f\n % (end - start) #!/usr/bin/python import os, sqlite3 try: os.remove('wcx.db') except: pass c = sqlite3.connect('wcx.db') c.execute(pragma case_sensitive_like=1) c.execute(pragma foreign_keys=on) c.execute(pragma synchronous=off) c.execute(pragma temp_store=memory) c.execute(create table repository ( id integer primary key autoincrement, root text unique not null, uuid text not null)) c.execute(create index i_uuid on repository (uuid)) c.execute(create index i_root on repository (root)) c.execute(create table wcroot ( id integer primary key autoincrement, local_abspath text unique)) c.execute(create unique index i_local_abspath on wcroot (local_abspath)) c.execute(create table nodes ( wc_id integer not null references wcroot (id), local_relpath text not null, op_depth integer not null, parent_relpath text, repos_id integer references repository (id), repos_path text, revision integer, presence text not null, depth text, moved_here integer, moved_to text, kind text not null, changed_revision integer, changed_date integer, changed_author text, checksum text properties blob, translated_size integer, last_mod_time integer, dav_cache blob, symlink_target text, file_external text, primary key(wc_id, local_relpath, op_depth))) c.execute(create index i_parent on nodes (wc_id,
Re: Effect of indices on SQLite (optimizer) performance
On Sat, Feb 5, 2011 at 1:05 PM, Erik Huelsmann ehu...@gmail.com wrote: Scenario (2) takes ~0.27 seconds to evaluate in the unmodified database. Adding an index on (wc_id, local_relpath) makes the execution time drop to ~0.000156 seconds! Seems Philip was right :-) We need to carefully review the indices we have in our database to support good performance. I wish this document were fully fleshed out, it seems like it has some good info in it: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html Getting indexes in place for the bulk of our reads is essential. It seems like now would be a good time to make that a priority. Of course adding more indexes will further slow down write speed (which seems bad already) so maybe the above document will give ideas for other optimizations. Did anyone see the tests I posted on users@ of a checkout with 5000 files in single folder? I really thought we would be faster than 1.6 already but we are actually several factors slower. My background is all with DB2 on OS/400. Something I was looking for in SQLite docs is whether it uses hints for the number of rows in a table. For example, DB2 optimizes a new table for 10,000 rows with increments of 1,000 when you reach the limit. If you know you are inserting 100,000 rows you can get a massive performance improvement by telling DB2 to optimize for a larger size. I was wondering if SQLite was doing something like optimizing for 100 rows or something small. I noticed the end of the checkout is really slow which implies it does not insert the rows fast. Maybe this is just an area where we need to use transactions better? Anyway, a big +1 on getting the right indexes in place. I know SQLite has an EXPLAIN statement. Not sure if there are tools you can use to just capture information and have it tell you the indexes you needed. On databases like DB2 there are tools like that available and it can save time. In fact you could almost remove all indexes and run some tests to let the db tell you what indexes you needed. Of course our test suite probably does not have enough data in the db to make indexes any faster than a table scan, so you would probably have to do manual testing using a large working copy to see what you need. -- Thanks Mark Phippard http://markphip.blogspot.com/
Re: Effect of indices on SQLite (optimizer) performance
On Sat, Feb 5, 2011 at 8:25 PM, Mark Phippard markp...@gmail.com wrote: On Sat, Feb 5, 2011 at 1:05 PM, Erik Huelsmann ehu...@gmail.com wrote: Scenario (2) takes ~0.27 seconds to evaluate in the unmodified database. Adding an index on (wc_id, local_relpath) makes the execution time drop to ~0.000156 seconds! Seems Philip was right :-) We need to carefully review the indices we have in our database to support good performance. I wish this document were fully fleshed out, it seems like it has some good info in it: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html Getting indexes in place for the bulk of our reads is essential. It seems like now would be a good time to make that a priority. Of course adding more indexes will further slow down write speed (which seems bad already) so maybe the above document will give ideas for other optimizations. Did anyone see the tests I posted on users@ of a checkout with 5000 files in single folder? I really thought we would be faster than 1.6 already but we are actually several factors slower. My background is all with DB2 on OS/400. Something I was looking for in SQLite docs is whether it uses hints for the number of rows in a table. For example, DB2 optimizes a new table for 10,000 rows with increments of 1,000 when you reach the limit. If you know you are inserting 100,000 rows you can get a massive performance improvement by telling DB2 to optimize for a larger size. I was wondering if SQLite was doing something like optimizing for 100 rows or something small. I noticed the end of the checkout is really slow which implies it does not insert the rows fast. Maybe this is just an area where we need to use transactions better? Their FAQ (http://www.sqlite.org/faq.html#q19) sure suggests that it's not wise to do separate inserts: the document says SQLite easily does 50k inserts per sec into a table on moderate hardware, but only roughly 60 transactions per second... That would surely point into the direction of using transactions when we need mass inserts! I'm not sure exactly where in our code these inserts should be collected though. Maybe one of the WC-NG regulars has an idea? Bye, Erik.
Re: SQLite and callbacks
On Sat, Feb 5, 2011 at 7:14 PM, Stefan Sperling s...@elego.de wrote: On Sat, Feb 05, 2011 at 06:47:35PM +0100, Branko Čibej wrote: I would not worry about existing clients -- simply mark the existing APIs as deprecated, but keep them and do not attempt to improve their performance. Neglecting performance of backwards compat code is an interesting idea. It allows us to focus on the new APIs first and foremost. The existing APIs will continue to work using the node walker and issue queries per node as they do now. We could consider optimising them later, before or after 1.7 release. The required changes are mostly mechanical. I agree with most of what's been said here. I think it would be a pity to use WC-NG in a way that provides far from optimal performance. FWIW, I just did a quick run of your per-directory proplist query vs. the per-node version, on my Windows XP platform, to have another data point. The performance improvement is significant, but not earth-shattering (around 20%). Just tested with a freshly checked out working copy of svn trunk: 1) Per-node queries (r1066540). Looking at the third run, to make sure everything is hot in cache: $ time svn proplist -R . /dev/null real0m1.532s user0m0.015s sys 0m0.015s 2) Per-dir queries (r1066541). Looking at the third run, to make sure everything is hot in cache: $ time svn proplist -R . /dev/null real0m1.218s user0m0.015s sys 0m0.031s 3) For comparison, I also tested with SlikSVN 1.6.13. This is still more than twice as fast: $ time svn proplist -R . /dev/null real0m0.578s user0m0.015s sys 0m0.046s Cheers, -- Johan
Re: SQLite and callbacks
On Sat, Feb 5, 2011 at 9:37 PM, Johan Corveleyn jcor...@gmail.com wrote: On Sat, Feb 5, 2011 at 7:14 PM, Stefan Sperling s...@elego.de wrote: On Sat, Feb 05, 2011 at 06:47:35PM +0100, Branko Čibej wrote: I would not worry about existing clients -- simply mark the existing APIs as deprecated, but keep them and do not attempt to improve their performance. Neglecting performance of backwards compat code is an interesting idea. It allows us to focus on the new APIs first and foremost. The existing APIs will continue to work using the node walker and issue queries per node as they do now. We could consider optimising them later, before or after 1.7 release. The required changes are mostly mechanical. I agree with most of what's been said here. I think it would be a pity to use WC-NG in a way that provides far from optimal performance. FWIW, I just did a quick run of your per-directory proplist query vs. the per-node version, on my Windows XP platform, to have another data point. The performance improvement is significant, but not earth-shattering (around 20%). Just tested with a freshly checked out working copy of svn trunk: 1) Per-node queries (r1066540). Looking at the third run, to make sure everything is hot in cache: $ time svn proplist -R . /dev/null real 0m1.532s user 0m0.015s sys 0m0.015s 2) Per-dir queries (r1066541). Looking at the third run, to make sure everything is hot in cache: $ time svn proplist -R . /dev/null real 0m1.218s user 0m0.015s sys 0m0.031s 3) For comparison, I also tested with SlikSVN 1.6.13. This is still more than twice as fast: $ time svn proplist -R . /dev/null real 0m0.578s user 0m0.015s sys 0m0.046s I should've added one more test, to put things in perspective :-), namely the test with r1039808 (the per-wc query version): $ time svn proplist -R . /dev/null real0m0.328s user0m0.015s sys 0m0.031s (but, as you probably already know, this execution is not cancellable, which is pretty annoying :-), especially when I forget to redirect to /dev/null, but let it write on the console). Cheers, -- Johan