Hi all, a recent comment from Paul on the revision model RFC reminded me that 
we should have a discussion on how we maintain aggregate statistics about 
databases stored in FoundationDB. I’ll ignore the statistics associated with 
secondary indexes for the moment, assuming that the design we put in place for 
document data can serve as the basis for an extension there.

The first class of statistics are the ones we report in GET /<dbname>, which 
are documented here:

http://docs.couchdb.org/en/stable/api/database/common.html#get--db

These fall into a few different classes:

doc_count, doc_del_count: these should be maintained using FoundationDB’s 
atomic operations. The revision model RFC enumerated all the possible update 
paths and showed that we always have enough information to know whether to 
increment or decrement each of these counters; i.e., we always know when we’re 
removing the last deleted=false branch, adding a new branch to a 
previously-deleted document, etc.

update_seq: this must _not_ be maintained as its own key; attempting to do so 
would cause every write to the database to conflict with every other write and 
kill throughput. Rather, we can do a limit=1 range read on the end of the 
?CHANGES space to retrieve the current sequence of the database.

sizes.*: things get a little weird here. Historically we relied on the 
relationship between sizes.active and sizes.file to know when to trigger a 
database compaction, but we don’t yet have a need for compaction in the 
FDB-based data model and it’s not clear how we should define these two 
quantities. The sizes.external field has also been a little fuzzy. Ignoring the 
various definitions of “size” for the moment, let’s agree that we’ll want to be 
tracking some set of byte counts for each database. I think the way we should 
do this is by extending the information stored in each edit branch in 
?REVISIONS to included the size(s) of the current revision. When we update a 
document we need to compare the size(s) of the new revision with the size(s) of 
the parent, and update the database level atomic counter(s) appropriately. This 
requires an enhancement to RFC 001.

I’d like to further propose that we track byte counts not just at a database 
level but also across the entire Directory associated with a single CouchDB 
deployment, so that FoundationDB administrators managing multiple applications 
for a single cluster can have a better view of per-Directory resource 
utilization without walking every single database stored inside.

Looking past the DB info endpoint, one other statistic worth discussing is the 
“offset” field included with every response to an _all_docs request. This is 
not something that we get for free in FoundationDB, and I have to confess it 
seems to be of limited utility. We could support this by implementing a tree 
structure by adding additional aggregation keys on top of the keys stored in 
the _all_docs space, but I’m skeptical that it’s worth baking this extra cost 
into every database update and _all_docs operation. I’d like to hear others’ 
thoughts on this one.

I haven’t yet looked closely at _stats and _system to see if any of those 
metrics require specific support from FDB.

Adam

Reply via email to