Can anyone spare a bit of time to take this forward? I'll tell you what I'm planning to do next. Maybe someone can get to this before I can.
The first thing I want to do is add a DB index on the 'refcount' column of the 'pristine' table. I'm convinced that will be necessary for adequate performance. To do so involves: * Adding the 'CREATE INDEX' SQL statement. That's simple - just copy/modify/paste one line in wc-metadata.sql. * Adding a WC format bump step that adds the index to existing dev WCs. Also quite simple. * I'd be happier if I ran a timing comparison, before and after, with searching for unreferenced pristines (those whose refcount is 0). That would give me confidence that adding the index has actually happened and that it has the desired effect. It would need a test with a large number of pristines, a scattered selection of them having refcount=0, and then time the running of svn_wc__db_pristine_cleanup(). With an index, this should find them all very quickly and the deletion time should dominate the whole cleanup operation, whereas without an index the time taken to scan the whole table should dominate when the table is very large. On Thu, 2011-01-13, Branko Čibej wrote: > On 13.01.2011 20:01, Julian Foad wrote: > > I have committed the ref counting for pristine texts (r1058523) and have > > had a bit more insight into when to perform the deletion. > > > > Deleting unreferenced texts on closing a "wcroot" is too late - too > > large a granularity - for the way the WC code is currently structured > > because this doesn't happen until a (long-lived) pool is cleaned up, as > > Bert pointed out. > > > > Deleting unreferenced texts after running the Work Queue is too soon - > > too fine a granularity - for the way "commit" is currently structured. > > A simple test of this approach showed that by the time the post-commit > > processing tries to install a reference to a pristine text whose > > checksum was noted earlier, in some cases that pristine row has already > > been purged. > > This would indicate that the reference counting happens too soon ... in > other words, that a pristine can be dereferenced whilst some part of the > code (or database) still refers to it. That breaks database consistency > -- what happens if the user aborts a commit and then calls 'svn > cleanup', for example? If what I said about 'commit' is correct, then yes, that's bad and we should look for a better way. But I haven't tested it properly; I noticed that the commit failed saying that the DB failed a 'REFERENCES' clause, and what I said here is a hypothesis about how that happened. Investingating this is the second thing I would want to do next. > > At the moment I think the practical solution is to insert calls to the > > pristine cleanup at the end of each main libsvn_wc public API that may > > have freed up one or more pristines. > > Mmm. Too many points of failure, don't you think? Of course, it's no > "worse" than the cancellation handlers, however, and I suppose missing a > chance to delete is better than having one too many. I'm thinking this is nevertheless the best option. The consequence of missing some places is not too bad at all. If we miss places where we should clean up, then all it means is that some operations will leak unused pristines, but they won't be leaked permanently, only until the user runs one of the operations that do have the cleanup call in them. That's a consequence of calling a global "clean up all unreferenced pristines" function rather than trying to keep track of which pristines have been freed up by *this* operation. > > Wherever we do it, if it's at all frequent, the search for unreferenced > > pristines needs to be efficient. At present I use "SELECT ... FROM > > PRISTINE WHERE refcount = 0". I believe that can most easily be made > > efficient by adding an index on the 'refcount' column, so that's what I > > intend to do. > > That's the only thing you can do, unless you want to enhance the > triggers to populate a queue of to-be-deleted pristines. I agree that adding an index on the table is a better solution than trying to maintain a separate queue of the pristines that are waiting to be cleaned up. The index makes it efficient for a simple query ('refcount = 0') to provide the same information as the separate queue would provide. - Julian