Hi all,

Most of the discussions so far have focused on the core features that are 
fundamental to CouchDB: JSON documents, revision tracking, _changes. I thought 
I’d start a thread on something a bit different: the _db_updates feed.

The _db_updates feed is an API that enables users to discover database 
lifecycle events across an entire CouchDB instance. It’s primarily useful in 
deployments that have lots and lots of databases, where it’s impractical to 
keep connections open for every database, and where database creations and 
deletions may be an automated aspect of the application’s use of CouchDB.

There are really two topics for discussion here. The first is: do we need to 
keep it? The primary driver of applications creating lots of DBs is the per-DB 
granularity of access controls; if we go down the route of implementing the 
document-level _access proposal perhaps users naturally migrate away from this 
DB-per-user data model. I’d be curious to hear points of view there.

I’ll assume for now that we do want to keep it, and offer some thoughts on how 
to implement it. The main challenge with _db_updates is managing the write 
contention; in write-heavy databases you have a lot of producers trying to tag 
that particular database as “updated", but all the consumer really cares about 
is getting a single “dbname”:”updated” event as needed. In the current 
architecture we try to dedupe a lot of the events in-memory before updating a 
regular CouchDB database with this information, but this leaves us exposed to 
possibly dropping events within a few second window.

## Option 1: Queue + Compaction

One way to tackle this in FoundationDB is to have an intermediate subspace 
reserved as a queue. Each transaction that modifies a database would insert a 
versionstamped KV into the queue like

Versionstamp = (DbName, EventType)

Versionstamps are monotonically increasing and inserting versionstamped keys is 
a conflict-free operation. We’d have a consumer of this queue which is 
responsible for “log compaction”; i.e., the consumer would do range reads on 
the queue subspace, toss out duplicate contiguous “dbname”:“updated” events, 
and update a second index which would look more like the _changes feed.

### Scaling Consumers

A single consumer can likely process 10k events/sec or more, but eventually 
we’ll need to scale. Borrowing from systems like Kafka the typical way to do 
this is to divide the queue into partitions and have individual consumers 
mapped to each partition. A partition in this model would just be a prefix on 
the Versionstamp:

(PartitionID, Versionstamp) = (DbName, EventType)

Our consumers will be more efficient and less likely to conflict with one 
another on updating the _db_updates index if messages are keyed to a partition 
based on DbName, although this still runs the risk that a couple of 
high-throughput databases could swamp a partition.

I’m not sure about the best path forward for handling that scenario. One could 
implement a rate-limiter that starts sloughing off additional messages for 
high-throughput databases (which has some careful edge cases), split the 
messages for a single database across multiple partitions, rely on operators to 
blacklist certain databases from the _db_updates system, etc. Each has 
downsides.

## Option 2: Atomic Ops + Consumer

In this approach we still have an intermediate subspace, and a consumer of that 
subspace which updates the _db_updates index. But this time, we have at most 
one KV per database in the subspace, with an atomic counter for a value. When a 
document is updated it bumps the counter for its database in that subspace. So 
we’ll have entries like

(“counters”, “db1235”) = 1
(“counters”, “db0001”) = 42
(“counters”, “telemetry-db”) = 12312

and so on. Like versionstamps, atomic operations are conflict-free so we need 
not worry about introducing spurious conflicts on high-throughput databases.

The initial pass of the consumer logic would go something like this:

- Do a snapshot range read of the “counters” subspace (or whatever we call it)
- Record the current values for all counters in a separate summary KV (you’ll 
see why in a minute)
- Do a limit=1 range read on the _changes space for each DB in the list to grab 
the latest Sequence
- Update the _db_updates index with the latest Sequence for each of these 
databases

On a second pass, the consumer would read the summary KV from the last pass and 
compare the previous counters with the current values. If any counters have not 
been updated in the interval, the consumer would try to clear those from the 
“counters” subspace (adding them as explicit conflict keys to ensure we don’t 
miss a concurrent update). It would then proceed with the rest of the logic 
from the initial pass. This is a careful balancing act:

- We don’t want to pollute the “counters” subspace with idle databases because 
each entry requires an extra read of _changes
- We don’t want to attempt to clear counters that are constantly updated 
because that’s going to fail with a conflict every time

The scalability axis here is the number of databases updated within any short 
window of time (~1 second or less). If we end up with that number growing large 
we can have consumers responsible for range of the “counters” subspace, though 
I think that’s less likely than in the queue-based design.

I don’t know in detail what optimizations FoundationDB applies to atomic 
operations (e.g. coalescing them at a layer above the storage engine). That’s 
worth checking into, as otherwise I’d be concerned about introducing super-hot 
keys here.

This option does not handle the “created” and “deleted” lifecycle events for 
each database, but those are really quite simple and could really be inserted 
directly into the _db_updates index.

===

There are some additional details which can be fleshed out in an RFC, but this 
is the basic gist of things. Both designs would be more robust at capturing 
every single updated database (because the enqueue/increment operation would be 
part of the document update transaction). They would allow for a small delay 
between the document update and the appearance of the database in _db_updates, 
which is no different than we have today. They each require a background 
process.

Let’s hear what you think, both about the interest level for this feature and 
any comments on the designs. I may take this one over to the FDB forums as well 
for feedback. Cheers,

Adam

Reply via email to