request for new API function

2011-02-05 Thread Stefan Küng

Hi,

To find all files and folders that have a specific property set I need 
to crawl the whole working copy and fetch the properties of each and 
every item, then scan the returned property list for that property.
But WC-NG uses an SQLite db so this task should be much faster with a 
lot less disk access.


I'd like to ask for a new API which would allow me to search the 
database for properties much faster. Something like


svn_wc_propsearch()

with parameters:

* string to search for
* value indicating how the search is done (equal, contains, begins with, 
... basically all the possible match functions SQLite offers)
* bool indicating whether the search is done on the NODES table or the 
ACTUAL_NODE table

* callback function which receives the search results

The callback function would receive not just the properties and paths of 
the items the found properties belong to, but also any information that 
the table row contains (because when an information is available for 
free, an API should return it to avoid forcing clients to call yet 
another API to get that info - if clients don't need the info they can 
just ignore it).


And while we're at it: new APIs to search for other columns in the NODES 
and ACTUAL_NODE table would also be nice and useful:
* search for 'changed_revision' and/or 'changed_date': allows to show a 
view with all files that haven't changed for a long time, or files that 
have changed recently, or ...
* search for 'changed_author': allows to show a quick view of who 
changed what last, gather statistics, ...
* search for 'depth' to quickly determine whether the working copy is 
full or may be missing something

* search for 'file_external'

With such new APIs, clients could use the advantages of WC-NG too.

A lot of ideas I had couldn't be done before because it would have been 
just too slow. But now with WC-NG and the database it would be fast enough.


Thoughts?

Stefan

--
   ___
  oo  // \\  De Chelonian Mobile
 (_,\/ \_/ \ TortoiseSVN
   \ \_/_\_/The coolest Interface to (Sub)Version Control
   /_/   \_\ http://tortoisesvn.net


Re: SQLite and callbacks

2011-02-05 Thread Stefan Sperling
On Fri, Feb 04, 2011 at 03:29:37AM +0200, Daniel Shahaf wrote:
 (test script attached --- hope it goes through)
 
 Stefan Sperling wrote on Wed, Feb 02, 2011 at 18:53:39 +0100:
  I've made svn proplist issue per-directory queries in r1066541.
  Reviews of this change are most welcome.
  Also, if someone has a nice way of testing performance impacts of
  this change it would be interesting to see the results.
  I have not done any benchmarking myself yet.
  
 
 Working copy of /tags/1.6.*, no local mods, r1066541:
 [[[
 == pristine-t1-1.out ==
 233.55user 78.55system 5:12.08elapsed 100%CPU (0avgtext+0avgdata 
 32176maxresident)k
 0inputs+0outputs (0major+6199minor)pagefaults 0swaps
 
 == pristine-t1-2.out ==
 237.68user 79.22system 5:16.88elapsed 100%CPU (0avgtext+0avgdata 
 32176maxresident)k
 0inputs+0outputs (0major+6200minor)pagefaults 0swaps
 
 == pristine-t1-3.out ==
 232.96user 76.04system 5:08.98elapsed 100%CPU (0avgtext+0avgdata 
 32176maxresident)k
 0inputs+0outputs (0major+6197minor)pagefaults 0swaps
 ]]]
 
 Ditto, r1066540:
 [[[
 == pristine-t2-1.out ==
 253.41user 82.90system 5:36.27elapsed 100%CPU (0avgtext+0avgdata 
 30480maxresident)k
 0inputs+0outputs (0major+6099minor)pagefaults 0swaps
 
 == pristine-t2-2.out ==
 252.31user 82.52system 5:34.81elapsed 100%CPU (0avgtext+0avgdata 
 30480maxresident)k
 0inputs+0outputs (0major+6090minor)pagefaults 0swaps
 
 == pristine-t2-3.out ==
 234.95user 75.25system 5:10.15elapsed 100%CPU (0avgtext+0avgdata 
 30480maxresident)k
 0inputs+0outputs (0major+6092minor)pagefaults 0swaps
 ]]]
 
 So, 5% more memory, but (excluding the t2-3 result) post-change is 7%
 faster than post-change.

Is this difference really significant?


 Ditto, with local prop mods on all *.c/*.py files:
 [[[
 == pset-Rpyc-t1-1.out ==
 219.41user 67.76system 4:47.13elapsed 100%CPU (0avgtext+0avgdata 
 32192maxresident)k
 0inputs+0outputs (0major+6200minor)pagefaults 0swaps
 
 == pset-Rpyc-t1-2.out ==
 218.48user 65.74system 4:44.18elapsed 100%CPU (0avgtext+0avgdata 
 32192maxresident)k
 0inputs+0outputs (0major+6205minor)pagefaults 0swaps
 
 == pset-Rpyc-t1-3.out ==
 219.14user 67.83system 4:46.94elapsed 100%CPU (0avgtext+0avgdata 
 32192maxresident)k
 0inputs+0outputs (0major+6197minor)pagefaults 0swaps
 ]]]
 
 Ditto, r1066540:
 [[[
 == pset-Rpyc-t2-1.out ==
 217.34user 64.90system 4:42.20elapsed 100%CPU (0avgtext+0avgdata 
 30480maxresident)k
 0inputs+0outputs (0major+6097minor)pagefaults 0swaps
 
 == pset-Rpyc-t2-2.out ==
 215.01user 67.10system 4:42.08elapsed 100%CPU (0avgtext+0avgdata 
 30464maxresident)k
 0inputs+0outputs (0major+6097minor)pagefaults 0swaps
 
 == pset-Rpyc-t2-3.out ==
 212.82user 63.68system 4:36.46elapsed 100%CPU (0avgtext+0avgdata 
 30480maxresident)k
 0inputs+0outputs (0major+6094minor)pagefaults 0swaps
 ]]]
 
 Pre-change is faster.

I'll look at profiler output. It has much better granularity.
It's possible that we still run queries elsewhere that affect the results.
We need to know whether the runtime overhead of sqlite queries has decreased.

Of course, with some usage patterns (e.g. working copies which contain only
directories, or disproportionally more directories than files),
we can expect little to no difference.

Stefan


Re: [Proposed] Split very long messages by paragraph for easy translate

2011-02-05 Thread Stefan Sperling
On Fri, Feb 04, 2011 at 04:10:30PM +0800, Dongsheng Song wrote:
 On Fri, Feb 4, 2011 at 15:59, Dongsheng Song dongsheng.s...@gmail.com wrote:
  On Sun, Nov 14, 2010 at 01:19, Greg Hudson ghud...@mit.edu wrote:
  On Sat, 2010-11-13 at 10:31 -0500, Daniel Shahaf wrote:
  Sounds reasonable.
 
  What changes to the source code would be required?
 
  Do we just change
        N_(three\n\nparagraphs\n\nhere\n)
  to
        N_(three\n) N_(paragraphs\n) N_(here\n)
 
  No, that would just result in evaluating gettext on the combined string,
  same as before.  I can see two options for help strings in particular:
 
  1. Rev svn_opt_subcommand_desc2_t to include an array of help strings
  which are translated and displayed in sequence.
 
  2. Change print_command_info2 to look at the help string and break it up
  at certain boundaries (such as blank lines or numbered list entries)
  before translating it.
 
  (Mercurial is written in Python, so it has different constraints.)
 
 
  Change svn_opt_subcommand_desc2_t.help from 'const char *' to 'const
  char **' maybe break the ABI,
  I don't think we can do it in the near future.

We won't change svn_opt_subcommand_desc2_t.
Instead, we would introduce a new svn_opt_subcommand_desc3_t.

 
  Change print_command_info2 have the similarly issues, and it's more
  complex to generate and store strings for translating.

There would be a new print_command_info3.
It might be more complext to generate and store multiple strings.
But if this helps translators I think we should do it.
We must find a nice way of splitting up help texts large help
texts like this put too much burden on translators.

  I have another approach, introduce a new function to concatenate many
  strings, e.g.
  const char * svn_opt__string_concat(apr_pool_t *pool, const char *s1, ...)
 
  The last parameter should be NULL to indicate this is the last
  parameter, then the very long messages:
  N_(para1 para2 para3 ...  paraN)
 
  Can be changed to:
  svn_opt__string_concat(N_(para1), (N_(para2), (N_(para3), ...,
  (N_(paraN), NULL)
 
  Why I recall this thread ? Because when I translating today, I found a
  message which have 255 lines (svn/main.c:676 ),
  It's supper terrible ! I can not image the maintenance work when there
  have one line changed only!
 
  Any comments?
 
  --
  Dongsheng Song
 
 
 OOPS, apr_pstrcat can concatenate multiple strings, not need for
 svn_opt__string_concat.
 http://apr.apache.org/docs/apr/1.4/group__apr__strings.html#g7bd80c95ffb7b3f96bc78e7b5b5b0045

Unfortunately, the help text is part of the definition of a static array.
So at the moment we cannot use apr_pstrcat() to split up the help text.

Stefan


Re: request for new API function

2011-02-05 Thread Stefan Sperling
On Sat, Feb 05, 2011 at 10:28:20AM +0100, Stefan Küng wrote:
 Hi,
 
 To find all files and folders that have a specific property set I
 need to crawl the whole working copy and fetch the properties of
 each and every item, then scan the returned property list for that
 property.
 But WC-NG uses an SQLite db so this task should be much faster with
 a lot less disk access.
 
 I'd like to ask for a new API which would allow me to search the
 database for properties much faster. Something like
 
 svn_wc_propsearch()
 
 with parameters:
 
 * string to search for
 * value indicating how the search is done (equal, contains, begins
 with, ... basically all the possible match functions SQLite offers)
 * bool indicating whether the search is done on the NODES table or
 the ACTUAL_NODE table
 * callback function which receives the search results
 
 The callback function would receive not just the properties and
 paths of the items the found properties belong to, but also any
 information that the table row contains (because when an information
 is available for free, an API should return it to avoid forcing
 clients to call yet another API to get that info - if clients don't
 need the info they can just ignore it).
 
 And while we're at it: new APIs to search for other columns in the
 NODES and ACTUAL_NODE table would also be nice and useful:
 * search for 'changed_revision' and/or 'changed_date': allows to
 show a view with all files that haven't changed for a long time, or
 files that have changed recently, or ...
 * search for 'changed_author': allows to show a quick view of who
 changed what last, gather statistics, ...
 * search for 'depth' to quickly determine whether the working copy
 is full or may be missing something
 * search for 'file_external'
 
 With such new APIs, clients could use the advantages of WC-NG too.
 
 A lot of ideas I had couldn't be done before because it would have
 been just too slow. But now with WC-NG and the database it would be
 fast enough.
 
 Thoughts?

I think we should go into this direction.
In fact, I think we should simply change the existing APIs to use
the fastest possible way of getting at information.

Most code we have still crawls the working copy, and that is an
artifact of how the 1.6.x working copy was structured.
We're now at single DB, but we're not yet using the single DB to
its full potential. We're currently treating it more or less like
a key/value store for information about paths.

Have you seen r1039808 and the resulting the Sqlite and callbacks
thread on dev@? That thread describes some of the issue we're facing
with the interaction of callbacks in our APIs and sqlite queries.

There were two approaches discussed in that thread. I am currently
experimenting with the queries per-directory approach (see r1051452
and r1066541). I'm expecting this to be too slow, but I'm doing it
anyway for two reasons. One is that we'll have real data to look at.
The other is that we might need code that does per-directory queries
anyway to satisfy backwards compatibility constraints (see the thread
sqlite and callbacks thread for details).

I think we will eventually need to query the database like people would
normally query a database, letting sqlite do most of the work of pulling data
out of the db. However we need to agree on how to solve problems with
implications this has on the existing APIs.

Stefan


Re: request for new API function

2011-02-05 Thread Stefan Sperling
On Sat, Feb 05, 2011 at 01:56:41PM +0100, Stefan Sperling wrote:
 There were two approaches discussed in that thread. I am currently
 experimenting with the queries per-directory approach (see r1051452
 and r1066541).

Sorry, I meant r1050650, not r1051452.


Re: [Proposed] Split very long messages by paragraph for easy translate

2011-02-05 Thread Julian Foad
On Sat, 2011-02-05, Stefan Sperling wrote:
 It might be more complext to generate and store multiple strings.
 But if this helps translators I think we should do it.
 We must find a nice way of splitting up help texts large help
 texts like this put too much burden on translators.

Looking at this concern from the outside, I have a question.  Do any of
the commonly available translation tools do a good job of assisting with
translating long strings?  I would expect that merely showing a diff of
the old and new English strings would be enough assistance in most
cases.  I can see how using a simple tool that doesn't even show a diff
would make the job very difficult.

- Julian




Re: SQLite and callbacks

2011-02-05 Thread Stefan Sperling
On Sat, Feb 05, 2011 at 01:11:28PM +0100, Stefan Sperling wrote:
 I'll look at profiler output. It has much better granularity.

I took 3 profiled runs of svn pl -v -R on an svn-trunk working copy
with and without the patch.
The results were pretty much the same for all runs.
The numbers below are from the second run.

The change has reduced the number of queries we issue,
but we are still issuing way too many queries for the impact to
be significant.

Before:
  called/total   parents 
index  %timeself descendents  called+selfname   index
  called/total   children

0.141.038467/8467sqlite3Step [8]
[10]55.20.141.038467 sqlite3VdbeExec [10]
 [children of
  sqlite3VdbeExec omitted]


  %   cumulative   self  self total   
 time   seconds   secondscalls  ms/call  ms/call  name
 22.7   0.48 0.48 _mcount [19]
 12.3   0.74 0.26 __mcount [25]
 10.4   0.96 0.22   478996 0.00 0.00  
sqlite3BtreeMovetoUnpacked [18]
  6.6   1.10 0.14 8467 0.02 0.14  sqlite3VdbeExec [10]


After:
  called/total   parents 
index  %timeself descendents  called+selfname   index
  called/total   children

0.171.074832/4832sqlite3Step [8]
[9] 64.40.171.074832 sqlite3VdbeExec [9]
[again, children o
 sqlite3VdbeExec omitted]


  %   cumulative   self  self total   
 time   seconds   secondscalls  ms/call  ms/call  name
 21.8   0.42 0.42 _mcount [18]
 13.5   0.68 0.26   473315 0.00 0.00  
sqlite3BtreeMovetoUnpacked [16]
  8.8   0.85 0.17 4832 0.04 0.26  sqlite3VdbeExec [9]
  8.8   1.02 0.17 __mcount [20]


The number of sqlite3VdbeExec() invocations has been reduced from 8467 to 4832.
But 4832 queries is still way too much when compared to not crawling
the tree at all. I don't have related gprof data anymore, but the log
message of r1039808 contains some hints:

 gprof says this change speeds up svn proplist -R svn-trunk-wc /dev/null
 on my box from about 0.35s to about 0.11s. During profiled runs, the function
 sqlite3Step() went from using 27% of the total time down to 15%. The time spent
 within and below svn_client_proplist3() went down from 51% to 36%.
 The profiler overhead accounted for about 31% of the total run time
 before this change and about 45% after.

So, as expected, r1039808 had a much greater effect than r1066541.
Note that for our purposes sqlite3Step() is equivalent to sqlite3VdbeExec().
_mcount in the gprof output above shows the profiler overhead.

I think we should strongly consider revving affected callbacks in the
1.7 API and document restriction they have to heed. Then we can bring
the r1039808 code back. We can keep the code added in r1066541 for
backwards-compatibility if desired.

Stefan


Re: SQLite and callbacks

2011-02-05 Thread Stefan Sperling
On Sat, Feb 05, 2011 at 03:29:23PM +0100, Stefan Sperling wrote:
 I think we should strongly consider revving affected callbacks in the
 1.7 API and document restriction they have to heed. Then we can bring
 the r1039808 code back. We can keep the code added in r1066541 for
 backwards-compatibility if desired.

By the way, I think this decision is very important for progress of wc-ng.
And I also think it has way too much impact to be decided on a lazy consensus
basis. So I'm *not* going to move forward with this before I've heard at
least a couple of affirmative opinions. Please consider taking the time
to think about this issue and let your opinion be heard. Thanks.


Re: request for new API function

2011-02-05 Thread Stefan Küng

On 05.02.2011 13:56, Stefan Sperling wrote:


I think we should go into this direction.
In fact, I think we should simply change the existing APIs to use
the fastest possible way of getting at information.


Well, currently there is no API that does what I suggested (basically 
return all results of a db query without even touching any files in the 
WC or have do this for every file/folder separately).



Most code we have still crawls the working copy, and that is an
artifact of how the 1.6.x working copy was structured.
We're now at single DB, but we're not yet using the single DB to
its full potential. We're currently treating it more or less like
a key/value store for information about paths.

Have you seen r1039808 and the resulting the Sqlite and callbacks
thread on dev@? That thread describes some of the issue we're facing
with the interaction of callbacks in our APIs and sqlite queries.

There were two approaches discussed in that thread. I am currently
experimenting with the queries per-directory approach (see r1051452
and r1066541). I'm expecting this to be too slow, but I'm doing it
anyway for two reasons. One is that we'll have real data to look at.
The other is that we might need code that does per-directory queries
anyway to satisfy backwards compatibility constraints (see the thread
sqlite and callbacks thread for details).

I think we will eventually need to query the database like people would
normally query a database, letting sqlite do most of the work of pulling data
out of the db. However we need to agree on how to solve problems with
implications this has on the existing APIs.


I've read up on that thread. It seems the problem you're facing comes 
from the fact that you need to stay compatible with pre 1.7 APIs and 
clients, and the fact that you can't enforce clients to behave, only to 
ask them to behave and then hope for the best.


However what I'm asking for here are *new* APIs which do something no 
existing API currently does. So staying compatible wouldn't be a 
problem. And if you're worried about clients not behaving properly, why 
not get rid of the callback completely and just return all information 
at once in one big chunk of memory.
Talking about UI clients, this won't be a problem because they usually 
have to store all information they receive in a callback anyway so they 
have it ready to show in the UI. So for them, the memory use wouldn't be 
bigger at all.


Of course, those APIs I'm asking for might not be very useful for 
existing APIs or other stuff that is done in the svn library. Those 
might only be useful for some svn clients. But I hope that's not a 
blocker for implementing those.


I also thought of just query the SQLite db myself directly, but then I 
don't like to do something that's not really allowed.
However: I did a quick test with the Check-for-modifications dialog in 
TSVN. It has a feature where you can enable showing all properties. To 
do that, a separate thread is started which lists all properties of all 
items in the working copy. On one of my working copies, this takes about 
50 seconds. Using a simple SQLite query on the NODE table took in 
average 1260ms. Parsing the data and preparing it for use in the UI took 
another 3.5 seconds. Now *that* a speed improvement I really like.



Stefan

--
   ___
  oo  // \\  De Chelonian Mobile
 (_,\/ \_/ \ TortoiseSVN
   \ \_/_\_/The coolest Interface to (Sub)Version Control
   /_/   \_\ http://tortoisesvn.net


Re: request for new API function

2011-02-05 Thread Stefan Sperling
On Sat, Feb 05, 2011 at 04:22:29PM +0100, Stefan Küng wrote:
 On 05.02.2011 13:56, Stefan Sperling wrote:
 
 I think we should go into this direction.
 In fact, I think we should simply change the existing APIs to use
 the fastest possible way of getting at information.
 
 Well, currently there is no API that does what I suggested
 (basically return all results of a db query without even touching
 any files in the WC or have do this for every file/folder
 separately).

Well, the svn_proplist case I'm looking at is the same thing.
I want to answer the request Give me all properties within this
working copy by issuing as few sqlite queries as possible.

You are talking about such requests in general.
I am talking about one specific instance (proplist).
But essentially we want the same thing.

 I've read up on that thread. It seems the problem you're facing
 comes from the fact that you need to stay compatible with pre 1.7
 APIs and clients, and the fact that you can't enforce clients to
 behave, only to ask them to behave and then hope for the best.

Yes.

 However what I'm asking for here are *new* APIs which do something
 no existing API currently does. So staying compatible wouldn't be a
 problem.

New APIs don't make the problems we have with existing APIs go away.

The backwards compat problem doesn't affect TortoiseSVN, since
you will simply provide builds linked to 1.7 and tell all users to upgrade.

But we need to keep existing clients that were compiled against 1.6.x
and earlier working.  So it's not not a problem, it's just not
TortoiseSVN's problem :)

 And if you're worried about clients not behaving properly,
 why not get rid of the callback completely and just return all
 information at once in one big chunk of memory.
 Talking about UI clients, this won't be a problem because they
 usually have to store all information they receive in a callback
 anyway so they have it ready to show in the UI. So for them, the
 memory use wouldn't be bigger at all.

Really? Even with gigantic working copies?
What if the amount of information requested simply doesn't fit
into memory? I'd prefer a system that cannot fail in this way.
I'd prefer passing the information to the client piece by piece,
and letting the client worry about where to store it.
If at all possible, the library shouldn't allocate huge slabs of
memory outside of the client's control.
 
 Of course, those APIs I'm asking for might not be very useful for
 existing APIs or other stuff that is done in the svn library. Those
 might only be useful for some svn clients. But I hope that's not a
 blocker for implementing those.

I hope that we'll get a good set of APIs for 1.7 that will
satisfy all clients out there, including TortoiseSVN.
What these APIs will look like isn't set in stone yet.

 I also thought of just query the SQLite db myself directly, but then
 I don't like to do something that's not really allowed.
 However: I did a quick test with the Check-for-modifications dialog
 in TSVN. It has a feature where you can enable showing all
 properties. To do that, a separate thread is started which lists all
 properties of all items in the working copy. On one of my working
 copies, this takes about 50 seconds. Using a simple SQLite query on
 the NODE table took in average 1260ms. Parsing the data and
 preparing it for use in the UI took another 3.5 seconds. Now *that*
 a speed improvement I really like.

How is your query any different from the new proplist API and
implementation I added in r1039808? I think that provides what you need
(for the proplist case). It opens the db, runs a query on it and streams
the results to the client via a callback. Very low overhead.

Stefan


Re: request for new API function

2011-02-05 Thread Stefan Küng

On 05.02.2011 16:46, Stefan Sperling wrote:

On Sat, Feb 05, 2011 at 04:22:29PM +0100, Stefan Küng wrote:

On 05.02.2011 13:56, Stefan Sperling wrote:


I think we should go into this direction.
In fact, I think we should simply change the existing APIs to use
the fastest possible way of getting at information.


Well, currently there is no API that does what I suggested
(basically return all results of a db query without even touching
any files in the WC or have do this for every file/folder
separately).


Well, the svn_proplist case I'm looking at is the same thing.
I want to answer the request Give me all properties within this
working copy by issuing as few sqlite queries as possible.

You are talking about such requests in general.
I am talking about one specific instance (proplist).
But essentially we want the same thing.


Yes, the proplist case is what I need.
But of course, the other cases I listed are also valid requests. For 
example give me all entries that have the author XXX as the 
last-commit or give me all paths that were changed since date XXX. 
These could also be answered by sqlite queries alone.


Or even give me all paths in this working copy would also be a big 
help. Getting that information without the sqlite db would require 
crawling the whole working copy, resulting in a lot of disk access.



I've read up on that thread. It seems the problem you're facing
comes from the fact that you need to stay compatible with pre 1.7
APIs and clients, and the fact that you can't enforce clients to
behave, only to ask them to behave and then hope for the best.


Yes.


However what I'm asking for here are *new* APIs which do something
no existing API currently does. So staying compatible wouldn't be a
problem.


New APIs don't make the problems we have with existing APIs go away.

The backwards compat problem doesn't affect TortoiseSVN, since
you will simply provide builds linked to 1.7 and tell all users to upgrade.

But we need to keep existing clients that were compiled against 1.6.x
and earlier working.  So it's not not a problem, it's just not
TortoiseSVN's problem :)


Sorry, I didn't realize the problem already exists. I was under the 
impression that the new API that was reverted a day later only had that 
problem and that's why it got reverted.




And if you're worried about clients not behaving properly,
why not get rid of the callback completely and just return all
information at once in one big chunk of memory.
Talking about UI clients, this won't be a problem because they
usually have to store all information they receive in a callback
anyway so they have it ready to show in the UI. So for them, the
memory use wouldn't be bigger at all.


Really? Even with gigantic working copies?
What if the amount of information requested simply doesn't fit
into memory? I'd prefer a system that cannot fail in this way.
I'd prefer passing the information to the client piece by piece,
and letting the client worry about where to store it.
If at all possible, the library shouldn't allocate huge slabs of
memory outside of the client's control.


Sure, that would be the ideal way. But if it's not possible to force 
clients to behave and if that's a requirement, allocating a lot of 
memory might be the only alternative.



Of course, those APIs I'm asking for might not be very useful for
existing APIs or other stuff that is done in the svn library. Those
might only be useful for some svn clients. But I hope that's not a
blocker for implementing those.


I hope that we'll get a good set of APIs for 1.7 that will
satisfy all clients out there, including TortoiseSVN.
What these APIs will look like isn't set in stone yet.


I also thought of just query the SQLite db myself directly, but then
I don't like to do something that's not really allowed.
However: I did a quick test with the Check-for-modifications dialog
in TSVN. It has a feature where you can enable showing all
properties. To do that, a separate thread is started which lists all
properties of all items in the working copy. On one of my working
copies, this takes about 50 seconds. Using a simple SQLite query on
the NODE table took in average 1260ms. Parsing the data and
preparing it for use in the UI took another 3.5 seconds. Now *that*
a speed improvement I really like.


How is your query any different from the new proplist API and
implementation I added in r1039808? I think that provides what you need
(for the proplist case). It opens the db, runs a query on it and streams
the results to the client via a callback. Very low overhead.


Haven't tried that proplist API. I just reviewed it and it really is 
what I need. Too bad it got reverted. I hope it will get put back in soon.


Stefan

--
   ___
  oo  // \\  De Chelonian Mobile
 (_,\/ \_/ \ TortoiseSVN
   \ \_/_\_/The coolest Interface to (Sub)Version Control
   /_/   \_\ http://tortoisesvn.net


Re: request for new API function

2011-02-05 Thread Stefan Sperling
On Sat, Feb 05, 2011 at 05:15:22PM +0100, Stefan Küng wrote:
 On 05.02.2011 16:46, Stefan Sperling wrote:
 What if the amount of information requested simply doesn't fit
 into memory? I'd prefer a system that cannot fail in this way.
 I'd prefer passing the information to the client piece by piece,
 and letting the client worry about where to store it.
 If at all possible, the library shouldn't allocate huge slabs of
 memory outside of the client's control.
 
 Sure, that would be the ideal way. But if it's not possible to force
 clients to behave and if that's a requirement, allocating a lot of
 memory might be the only alternative.

The other alternative that's being considered is to run per-directory
queries in the compat code. Then we can invoke callbacks once per
directory, outside of an sqlite transaction context, so that callbacks
written for 1.6.x an earlier cannot cause deadlocks on the DB.

This might mean that code compiled against 1.6.x runs a bit slower
against 1.7. libraries. But it will still run and function correctly.
The amount of memory allocated would be a function of the contents of
one directory (vs. the content of an entire working copy...)

 How is your query any different from the new proplist API and
 implementation I added in r1039808? I think that provides what you need
 (for the proplist case). It opens the db, runs a query on it and streams
 the results to the client via a callback. Very low overhead.
 
 Haven't tried that proplist API. I just reviewed it and it really is
 what I need. Too bad it got reverted. I hope it will get put back in
 soon.

I hope to be able to put it back in.

I would like to rev all affected callbacks, and document what they're
not allowed to do. If a callback behaves improperly, the entire working
copy can deadlock. Maybe we can even try to detect deadlocks and error
out (see http://svn.haxx.se/dev/archive-2011-01/0219.shtml).

This is really a design problem for the most part.
Implementing it is fairly simple once we've decided what we want to do.

It's nice to know that TortoiseSVN would be happy with this approach.

Stefan


Re: SQLite and callbacks

2011-02-05 Thread Branko Čibej
On 05.02.2011 15:35, Stefan Sperling wrote:
 On Sat, Feb 05, 2011 at 03:29:23PM +0100, Stefan Sperling wrote:
 I think we should strongly consider revving affected callbacks in the
 1.7 API and document restriction they have to heed. Then we can bring
 the r1039808 code back. We can keep the code added in r1066541 for
 backwards-compatibility if desired.
 By the way, I think this decision is very important for progress of wc-ng.
 And I also think it has way too much impact to be decided on a lazy consensus
 basis. So I'm *not* going to move forward with this before I've heard at
 least a couple of affirmative opinions. Please consider taking the time
 to think about this issue and let your opinion be heard. Thanks.

In that case it would make sense to create a completely new set of APIs
(with a correspondingly new set of callbacks) with different
restrictions so that the implementation can be made fast, i.e., by doing
one query per working copy instead of one per directory or per-file.

I would not worry about existing clients -- simply mark the existing
APIs as deprecated, but keep them and do not attempt to improve their
performance. Clients that are being actively developed will move to the
new, faster infrastructure sooner rather than later. Those that are not
being maintained, or whose maintainers don't want to bother with
changing them, will remain bog slow and slowly fall by the wayside. We
should not be in the business of trying to improve other people's code
for them, only give them the tools to do so if they want to.

In short -- +1 to this approach. If it's not taken, then the whole wc-ng
effort will (almost) have been in vain as far as end users are concerned.

-- Brane


Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Erik Huelsmann
Yesterday or IRC, Bert, Philip and I were chatting about our SQLite
perf issues and how Philip's findings in the past suggested that
SQLite wasn't using its indices to optimize our queries.

After searching and discussing its documentation, Philip suggested the
-too obvious- maybe we have the wrong indices.

So, I went to work with his fake database generator script (attached
as test.py).


The type of query we're seeing problematic performance with looks like
the one below. The essential part is the WHERE clause.

SELECT * FROM nodes WHERE wc_id = 1 AND (local_relpath = 'foo' OR
local_relpath like 'foo%');


We discussed 3 ways to achieve the effect of this query:

 1. The query itself
 2. The query stated as a UNION of two queries
 3. Running the two parts of the UNION manually ourselves.

Ad (1)
This query doesn't perform as we had hoped to get from using a database.

Ad (2)
In the past, UNIONs have been explicitly removed because they were
creating temporary tables (on disk!). However, since then we have
changed our SQLite setup to create temporary tables in memory, so the
option should really be re-evaluated.

Ad (3)
I'd hate to have to use two queries in all places in our source where
we want to run queries like these. As a result, I think this scenario
should be avoided if we can.


So, I've created 'perf.py' to evaluate each of these scenarios,
researching the effect on each of them under the influence of adding
different indices.

This is my finding:

Scenario (1) [an AND combined with a complex OR] doesn't perform well
under any circumstance.

Scenario (2) performs differently, depending on the available indices.

Scenario (3) performs roughly equal to scenario (2).


Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
database. Adding an index on (wc_id, local_relpath) makes the
execution time drop to ~0.000156 seconds!


Seems Philip was right :-) We need to carefully review the indices we
have in our database to support good performance.


Bye,


Erik.


Re: SQLite and callbacks

2011-02-05 Thread Stefan Sperling
On Sat, Feb 05, 2011 at 06:47:35PM +0100, Branko Čibej wrote:
 I would not worry about existing clients -- simply mark the existing
 APIs as deprecated, but keep them and do not attempt to improve their
 performance.

Neglecting performance of backwards compat code is an interesting idea.
It allows us to focus on the new APIs first and foremost.

The existing APIs will continue to work using the node walker and issue
queries per node as they do now. We could consider optimising them later,
before or after 1.7 release. The required changes are mostly mechanical.


Re: Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Erik Huelsmann
Now attached as text files (to be renamed to .py) to prevent the
mailer software from dropping them...

Bye,

Erik.

On Sat, Feb 5, 2011 at 7:05 PM, Erik Huelsmann ehu...@gmail.com wrote:
 Yesterday or IRC, Bert, Philip and I were chatting about our SQLite
 perf issues and how Philip's findings in the past suggested that
 SQLite wasn't using its indices to optimize our queries.

 After searching and discussing its documentation, Philip suggested the
 -too obvious- maybe we have the wrong indices.

 So, I went to work with his fake database generator script (attached
 as test.py).


 The type of query we're seeing problematic performance with looks like
 the one below. The essential part is the WHERE clause.

 SELECT * FROM nodes WHERE wc_id = 1 AND (local_relpath = 'foo' OR
 local_relpath like 'foo%');


 We discussed 3 ways to achieve the effect of this query:

  1. The query itself
  2. The query stated as a UNION of two queries
  3. Running the two parts of the UNION manually ourselves.

 Ad (1)
 This query doesn't perform as we had hoped to get from using a database.

 Ad (2)
 In the past, UNIONs have been explicitly removed because they were
 creating temporary tables (on disk!). However, since then we have
 changed our SQLite setup to create temporary tables in memory, so the
 option should really be re-evaluated.

 Ad (3)
 I'd hate to have to use two queries in all places in our source where
 we want to run queries like these. As a result, I think this scenario
 should be avoided if we can.


 So, I've created 'perf.py' to evaluate each of these scenarios,
 researching the effect on each of them under the influence of adding
 different indices.

 This is my finding:

 Scenario (1) [an AND combined with a complex OR] doesn't perform well
 under any circumstance.

 Scenario (2) performs differently, depending on the available indices.

 Scenario (3) performs roughly equal to scenario (2).


 Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
 database. Adding an index on (wc_id, local_relpath) makes the
 execution time drop to ~0.000156 seconds!


 Seems Philip was right :-) We need to carefully review the indices we
 have in our database to support good performance.


 Bye,


 Erik.

#!/usr/bin/python

import os, sqlite3, time

c = sqlite3.connect('wcx.db')
c.execute(pragma case_sensitive_like=1)
c.execute(pragma foreign_keys=on)
c.execute(pragma synchronous=off)
c.execute(pragma temp_store=memory)

start = time.clock() # cpu clock as float in secs

#c.execute(drop index i_wc_id_rp;)
#c.execute(create index i_wc_id_rp on nodes (wc_id, local_relpath);)

print c.execute(.indices)

# strategy 1
c.execute(select * from nodes where wc_id = 1 AND
   (local_relpath like 'foo/%'
OR local_relpath = 'foo'););


# strategy 2
#c.execute(select * from nodes where wc_id = 1 AND local_relpath like 'foo/%'
# union select * from nodes where wc_id = 1 AND local_relpath = 
'foo';)

# strategy 3
#c.execute(select * from nodes where wc_id = 1 AND local_relpath like 
'foo/%';)
#c.execute(select * from nodes where wc_id = 1 AND local_relpath = 'foo';)





end = time.clock()


print timing: %5f\n % (end - start)


#!/usr/bin/python

import os, sqlite3

try: os.remove('wcx.db')
except: pass

c = sqlite3.connect('wcx.db')
c.execute(pragma case_sensitive_like=1)
c.execute(pragma foreign_keys=on)
c.execute(pragma synchronous=off)
c.execute(pragma temp_store=memory)
c.execute(create table repository (
   id integer primary key autoincrement,
   root text unique not null,
   uuid text not null))
c.execute(create index i_uuid on repository (uuid))
c.execute(create index i_root on repository (root))
c.execute(create table wcroot (
   id integer primary key autoincrement,
   local_abspath text unique))
c.execute(create unique index i_local_abspath on wcroot (local_abspath))
c.execute(create table nodes (
   wc_id integer not null references wcroot (id),
   local_relpath text not null,
   op_depth integer not null,
   parent_relpath text,
   repos_id integer references repository (id),
   repos_path text,
   revision integer,
   presence text not null,
   depth text,
   moved_here integer,
   moved_to text,
   kind text not null,
   changed_revision integer,
   changed_date integer,
   changed_author text,
   checksum text
   properties blob,
   translated_size integer,
   last_mod_time integer,
   dav_cache blob,
   symlink_target text,
   file_external text,
   primary key(wc_id, local_relpath, op_depth)))
c.execute(create index i_parent on nodes (wc_id,
 

Re: Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Mark Phippard
On Sat, Feb 5, 2011 at 1:05 PM, Erik Huelsmann ehu...@gmail.com wrote:

 Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
 database. Adding an index on (wc_id, local_relpath) makes the
 execution time drop to ~0.000156 seconds!


 Seems Philip was right :-) We need to carefully review the indices we
 have in our database to support good performance.
I wish this document were fully fleshed out, it seems like it has some
good info in it:

http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

Getting indexes in place for the bulk of our reads is essential.  It
seems like now would be a good time to make that a priority.  Of
course adding more indexes will further slow down write speed (which
seems bad already) so maybe the above document will give ideas for
other optimizations.

Did anyone see the tests I posted on users@ of a checkout with 5000
files in single folder?  I really thought we would be faster than 1.6
already but we are actually several factors slower.

My background is all with DB2 on OS/400.  Something I was looking for
in SQLite docs is whether it uses hints for the number of rows in a
table.  For example, DB2 optimizes a new table for 10,000 rows with
increments of 1,000 when you reach the limit.  If you know you are
inserting 100,000 rows you can get a massive performance improvement
by telling DB2 to optimize for a larger size.  I was wondering if
SQLite was doing something like optimizing for 100 rows or something
small.  I noticed the end of the checkout is really slow which implies
it does not insert the rows fast.  Maybe this is just an area where we
need to use transactions better?

Anyway, a big +1 on getting the right indexes in place.  I know SQLite
has an EXPLAIN statement.  Not sure if there are tools you can use to
just capture information and have it tell you the indexes you needed.
On databases like DB2 there are tools like that available and it can
save time.  In fact you could almost remove all indexes and run some
tests to let the db tell you what indexes you needed.  Of course our
test suite probably does not have enough data in the db to make
indexes any faster than a table scan, so you would probably have to do
manual testing using a large working copy to see what you need.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/


Re: Effect of indices on SQLite (optimizer) performance

2011-02-05 Thread Erik Huelsmann
On Sat, Feb 5, 2011 at 8:25 PM, Mark Phippard markp...@gmail.com wrote:
 On Sat, Feb 5, 2011 at 1:05 PM, Erik Huelsmann ehu...@gmail.com wrote:

 Scenario (2) takes ~0.27 seconds to evaluate in the unmodified
 database. Adding an index on (wc_id, local_relpath) makes the
 execution time drop to ~0.000156 seconds!


 Seems Philip was right :-) We need to carefully review the indices we
 have in our database to support good performance.
 I wish this document were fully fleshed out, it seems like it has some
 good info in it:

 http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

 Getting indexes in place for the bulk of our reads is essential.  It
 seems like now would be a good time to make that a priority.  Of
 course adding more indexes will further slow down write speed (which
 seems bad already) so maybe the above document will give ideas for
 other optimizations.

 Did anyone see the tests I posted on users@ of a checkout with 5000
 files in single folder?  I really thought we would be faster than 1.6
 already but we are actually several factors slower.

 My background is all with DB2 on OS/400.  Something I was looking for
 in SQLite docs is whether it uses hints for the number of rows in a
 table.  For example, DB2 optimizes a new table for 10,000 rows with
 increments of 1,000 when you reach the limit.  If you know you are
 inserting 100,000 rows you can get a massive performance improvement
 by telling DB2 to optimize for a larger size.  I was wondering if
 SQLite was doing something like optimizing for 100 rows or something
 small.  I noticed the end of the checkout is really slow which implies
 it does not insert the rows fast.  Maybe this is just an area where we
 need to use transactions better?

Their FAQ (http://www.sqlite.org/faq.html#q19) sure suggests that it's
not wise to do separate inserts: the document says SQLite easily does
50k inserts per sec into a table on moderate hardware, but only
roughly 60 transactions per second...

That would surely point into the direction of using transactions when
we need mass inserts! I'm not sure exactly where in our code these
inserts should be collected though. Maybe one of the WC-NG regulars
has an idea?


Bye,

Erik.


Re: SQLite and callbacks

2011-02-05 Thread Johan Corveleyn
On Sat, Feb 5, 2011 at 7:14 PM, Stefan Sperling s...@elego.de wrote:
 On Sat, Feb 05, 2011 at 06:47:35PM +0100, Branko Čibej wrote:
 I would not worry about existing clients -- simply mark the existing
 APIs as deprecated, but keep them and do not attempt to improve their
 performance.

 Neglecting performance of backwards compat code is an interesting idea.
 It allows us to focus on the new APIs first and foremost.

 The existing APIs will continue to work using the node walker and issue
 queries per node as they do now. We could consider optimising them later,
 before or after 1.7 release. The required changes are mostly mechanical.

I agree with most of what's been said here. I think it would be a pity
to use WC-NG in a way that provides far from optimal performance.

FWIW, I just did a quick run of your per-directory proplist query vs.
the per-node version, on my Windows XP platform, to have another data
point. The performance improvement is significant, but not
earth-shattering (around 20%).

Just tested with a freshly checked out working copy of svn trunk:

1) Per-node queries (r1066540). Looking at the third run, to make sure
everything is hot in cache:

$ time svn proplist -R . /dev/null

real0m1.532s
user0m0.015s
sys 0m0.015s


2) Per-dir queries (r1066541). Looking at the third run, to make sure
everything is hot in cache:

$ time svn proplist -R . /dev/null

real0m1.218s
user0m0.015s
sys 0m0.031s


3) For comparison, I also tested with SlikSVN 1.6.13. This is still
more than twice as fast:

$ time svn proplist -R . /dev/null

real0m0.578s
user0m0.015s
sys 0m0.046s


Cheers,
-- 
Johan


Re: SQLite and callbacks

2011-02-05 Thread Johan Corveleyn
On Sat, Feb 5, 2011 at 9:37 PM, Johan Corveleyn jcor...@gmail.com wrote:
 On Sat, Feb 5, 2011 at 7:14 PM, Stefan Sperling s...@elego.de wrote:
 On Sat, Feb 05, 2011 at 06:47:35PM +0100, Branko Čibej wrote:
 I would not worry about existing clients -- simply mark the existing
 APIs as deprecated, but keep them and do not attempt to improve their
 performance.

 Neglecting performance of backwards compat code is an interesting idea.
 It allows us to focus on the new APIs first and foremost.

 The existing APIs will continue to work using the node walker and issue
 queries per node as they do now. We could consider optimising them later,
 before or after 1.7 release. The required changes are mostly mechanical.

 I agree with most of what's been said here. I think it would be a pity
 to use WC-NG in a way that provides far from optimal performance.

 FWIW, I just did a quick run of your per-directory proplist query vs.
 the per-node version, on my Windows XP platform, to have another data
 point. The performance improvement is significant, but not
 earth-shattering (around 20%).

 Just tested with a freshly checked out working copy of svn trunk:

 1) Per-node queries (r1066540). Looking at the third run, to make sure
 everything is hot in cache:

 $ time svn proplist -R . /dev/null

 real    0m1.532s
 user    0m0.015s
 sys     0m0.015s


 2) Per-dir queries (r1066541). Looking at the third run, to make sure
 everything is hot in cache:

 $ time svn proplist -R . /dev/null

 real    0m1.218s
 user    0m0.015s
 sys     0m0.031s


 3) For comparison, I also tested with SlikSVN 1.6.13. This is still
 more than twice as fast:

 $ time svn proplist -R . /dev/null

 real    0m0.578s
 user    0m0.015s
 sys     0m0.046s

I should've added one more test, to put things in perspective :-),
namely the test with r1039808 (the per-wc query version):

$ time svn proplist -R . /dev/null

real0m0.328s
user0m0.015s
sys 0m0.031s

(but, as you probably already know, this execution is not cancellable,
which is pretty annoying :-), especially when I forget to redirect to
/dev/null, but let it write on the console).

Cheers,
-- 
Johan