Re: add NODES.prior_deleted boolean column

Greg Stein Tue, 28 Sep 2010 08:30:57 -0700

On Tue, Sep 28, 2010 at 08:46, Philip Martin <philip.mar...@wandisco.com> wrote:
> Julian Foad <julian.f...@wandisco.com> writes:
>
>>> I believe the answer is "often". A simple 'svn status' will need to
>>> distinguish between 'A' and 'R', and that is done by checking
>>> prior_deleted.
>>>
>>> And no... this amount of denormalization will not hurt us.
>>
>> OK.  I know that "svn status" speed is a big deal.
>
> I'm not sure prior_deleted is an optimisation.  When I last looked at


Great analysis below. I'll take your word for it :-)

prior_deleted is effectively something like base_shadowed, and yeah...
if we're iterating over all nodes within a parent, then it can be
generated "simply".

> the performance of SQLite I concluded that status would be too slow if
> it ran per-node queries, it has to run per-dir queries.  So
>
>  SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND local_relpath = ?2
>
> is too slow, we need to run
>
>  SELECT ... FROM BASE_NODE WHERE wc_id = ?1 AND parent_relpath = ?2
>
> iterate over the rows caching the data and then generate the status
> for each child.

To do this, it seems that we're going to need to expose (from wc_db.h)
a structure containing "all" the row data. Thankfully, this big ol'
structure is *private* to libsvn_wc, and we can change it at will
(unlike svn_wc_status_t). I would suggest that we use a callback --
wc_db can step through each row of the result, fill in the structure,
and execute a callback (clearing an iterpool on each "step" with the
cursor).

The one tricky part to a callback is eliding all but the highest
op_depth. Does an ORDER BY in the query affect its runtime at all?

>...
> but that nested SELECT is expensive.  It's not as bad as doing
> per-node queries but it is still too slow, the database query alone is
> slower than 1.6 status.  I don't think this is something that can be
> fixed using an index because on my test data the runtime generally
> goes up by orders of magnitude when the indexing is wrong.

Yeah... the more indexes present, the more that need to be maintained
when rows are modified.

> I can get round this by querying for all op_depth and using the cache
> to select the highest.  The cache is a hash, keyed on local_relpath,
> that stores the data associated with the highest op_depth seen and it
> simply discards/overwrites lower op_depth.  I've prototyped this and

If we order by local_relpath, then the "cache" is simply one row
(hanging out in a structure, waiting to be passed to that callback).

> it's as fast as the per-dir BASE_NODE query on my test data.  This is
> not surprising since my test data consists of mostly single op_depth
> with a few double op_depth.  We have to have good performance on such
> working copies because they are the most common, I think it will be
> unusual to have a large working copy where lots of nodes have a large
> number of op_depth.

Agreed.

> Now to prior_deleted.  The fundamental status query looks like
>
>  SELECT ... FROM NODES WHERE wc_id = ?1 AND parent_relpath = ?2
>
> and all op_depth are seen.  It's quite simple to cache the two highest
> op_depth so prior_deleted only provides information that is already
> available. It's not an optimisation for status, if anything it will
> make the database bigger and slower.

Again, thanks for the analysis. Okay... let's skip it!

>...

Cheers,
-g

Re: add NODES.prior_deleted boolean column

Reply via email to