Hi all, a recent comment from Paul on the revision model RFC reminded me that we should have a discussion on how we maintain aggregate statistics about databases stored in FoundationDB. I’ll ignore the statistics associated with secondary indexes for the moment, assuming that the design we put in place for document data can serve as the basis for an extension there.
The first class of statistics are the ones we report in GET /<dbname>, which are documented here: http://docs.couchdb.org/en/stable/api/database/common.html#get--db These fall into a few different classes: doc_count, doc_del_count: these should be maintained using FoundationDB’s atomic operations. The revision model RFC enumerated all the possible update paths and showed that we always have enough information to know whether to increment or decrement each of these counters; i.e., we always know when we’re removing the last deleted=false branch, adding a new branch to a previously-deleted document, etc. update_seq: this must _not_ be maintained as its own key; attempting to do so would cause every write to the database to conflict with every other write and kill throughput. Rather, we can do a limit=1 range read on the end of the ?CHANGES space to retrieve the current sequence of the database. sizes.*: things get a little weird here. Historically we relied on the relationship between sizes.active and sizes.file to know when to trigger a database compaction, but we don’t yet have a need for compaction in the FDB-based data model and it’s not clear how we should define these two quantities. The sizes.external field has also been a little fuzzy. Ignoring the various definitions of “size” for the moment, let’s agree that we’ll want to be tracking some set of byte counts for each database. I think the way we should do this is by extending the information stored in each edit branch in ?REVISIONS to included the size(s) of the current revision. When we update a document we need to compare the size(s) of the new revision with the size(s) of the parent, and update the database level atomic counter(s) appropriately. This requires an enhancement to RFC 001. I’d like to further propose that we track byte counts not just at a database level but also across the entire Directory associated with a single CouchDB deployment, so that FoundationDB administrators managing multiple applications for a single cluster can have a better view of per-Directory resource utilization without walking every single database stored inside. Looking past the DB info endpoint, one other statistic worth discussing is the “offset” field included with every response to an _all_docs request. This is not something that we get for free in FoundationDB, and I have to confess it seems to be of limited utility. We could support this by implementing a tree structure by adding additional aggregation keys on top of the keys stored in the _all_docs space, but I’m skeptical that it’s worth baking this extra cost into every database update and _all_docs operation. I’d like to hear others’ thoughts on this one. I haven’t yet looked closely at _stats and _system to see if any of those metrics require specific support from FDB. Adam