On Tue, Sep 28, 2010 at 08:46, Philip Martin <philip.mar...@wandisco.com> wrote: > Julian Foad <julian.f...@wandisco.com> writes: > >>> I believe the answer is "often". A simple 'svn status' will need to >>> distinguish between 'A' and 'R', and that is done by checking >>> prior_deleted. >>> >>> And no... this amount of denormalization will not hurt us. >> >> OK. I know that "svn status" speed is a big deal. > > I'm not sure prior_deleted is an optimisation. When I last looked at
Great analysis below. I'll take your word for it :-) prior_deleted is effectively something like base_shadowed, and yeah... if we're iterating over all nodes within a parent, then it can be generated "simply". > the performance of SQLite I concluded that status would be too slow if > it ran per-node queries, it has to run per-dir queries. So > > SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND local_relpath = ?2 > > is too slow, we need to run > > SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND parent_relpath = ?2 > > iterate over the rows caching the data and then generate the status > for each child. To do this, it seems that we're going to need to expose (from wc_db.h) a structure containing "all" the row data. Thankfully, this big ol' structure is *private* to libsvn_wc, and we can change it at will (unlike svn_wc_status_t). I would suggest that we use a callback -- wc_db can step through each row of the result, fill in the structure, and execute a callback (clearing an iterpool on each "step" with the cursor). The one tricky part to a callback is eliding all but the highest op_depth. Does an ORDER BY in the query affect its runtime at all? >... > but that nested SELECT is expensive. It's not as bad as doing > per-node queries but it is still too slow, the database query alone is > slower than 1.6 status. I don't think this is something that can be > fixed using an index because on my test data the runtime generally > goes up by orders of magnitude when the indexing is wrong. Yeah... the more indexes present, the more that need to be maintained when rows are modified. > I can get round this by querying for all op_depth and using the cache > to select the highest. The cache is a hash, keyed on local_relpath, > that stores the data associated with the highest op_depth seen and it > simply discards/overwrites lower op_depth. I've prototyped this and If we order by local_relpath, then the "cache" is simply one row (hanging out in a structure, waiting to be passed to that callback). > it's as fast as the per-dir BASE_NODE query on my test data. This is > not surprising since my test data consists of mostly single op_depth > with a few double op_depth. We have to have good performance on such > working copies because they are the most common, I think it will be > unusual to have a large working copy where lots of nodes have a large > number of op_depth. Agreed. > Now to prior_deleted. The fundamental status query looks like > > SELECT ... FROM NODES WHERE wc_id = ?1 AND parent_relpath = ?2 > > and all op_depth are seen. It's quite simple to cache the two highest > op_depth so prior_deleted only provides information that is already > available. It's not an optimisation for status, if anything it will > make the database bigger and slower. Again, thanks for the analysis. Okay... let's skip it! >... Cheers, -g