Hi all, as the project devs are working through the design for the _changes
feed in FoundationDB we’ve come across a limitation that is worth discussing
with the broader user community. FoundationDB currently imposes a 5 second
limit on all transactions, and read versions from old transactions are
inaccessible after that window. This means that, unlike a single CouchDB
storage shard, it is not possible to grab a long-lived snapshot of the entire
database.
In extant versions of CouchDB we rely on this long-lived snapshot behavior for
a number of operations, some of which are user-facing. For example, it is
possible to make a request to the _changes feed for a database of an arbitrary
size and, if you’ve got the storage space and time to spare, you can pull down
a snapshot of the entire database in a single request. That snapshot will
contain exactly one entry for each document in the database. In CouchDB 1.x the
documents appear in the order in which they were most recently updated. In
CouchDB 2.x there is no guaranteed ordering, although in practice the documents
are roughly ordered by most recent edit. Note that you really do have to
complete the operation in a single HTTP request; if you chunk up the requests
or have to retry because the connection was severed then the exactly-once
guarantees disappear.
We have a couple of different options for how we can implement _changes with
FoundationDB as a backing store, I’ll describe them below and discuss the
tradeoffs
## Option A: Single Version Index, long-running operations as multiple
transactions
In this option the internal index has exactly one entry for each document at
all times. A _changes request that cannot be satisfied within the 5 second
limit will be implemented as multiple FoundationDB transactions under the
covers. These transactions will have different read versions, and a document
that gets updated in between those read versions will show up *multiple times*
in the response body. The entire feed will be totally ordered, and later
occurrences of a particular document are guaranteed to represent more recent
edits than than the earlier occurrences. In effect, it’s rather like the
semantics of a feed=continuous request today, but with much better ordering and
zero possibility of “rewinds” where large portions of the ID space get replayed
because of issues in the cluster.
This option is very efficient internally and does not require any background
maintenance. A future enhancement in FoundationDB’s storage engine is designed
to enable longer-running read-only transactions, so we will likely to be able
to improve the semantics with this option over time.
## Option B: Multi-Version Index
In this design the internal index can contain multiple entries for a given
document. Each entry includes the sequence at which the document edit was made,
and may also include a sequence at which it was overwritten by a more recent
edit.
The implementation of a _changes request would start by getting the current
version of the datastore (call this the read version), and then as it examines
entries in the index it would skip over any entries where there’s a “tombstone”
sequence less than the read version. Crucially, if the request needs to be
implemented across multiple transactions, each transaction would use the same
read version when deciding whether to include entries in the index in the
_changes response. The readers would know to stop when and if they encounter an
entry where the created version is greater than the read version. Perhaps a
diagram helps to clarify, a simplified version of the internal index might look
like
{“seq”: 1, “id”: ”foo”}
{“seq”: 2, “id”: ”bar”, “tombstone”: 5}
{“seq”: 3, “id”: “baz”}
{“seq”: 4, “id”: “bif”, “tombstone": 6}
{“seq”: 5, “id”: “bar”}
{“seq”: 6, “id”: “bif”}
A _changes request which happens to commence when the database is at sequence 5
would return (ignoring the format of “seq” for simplicity)
{“seq”: 1, “id”: ”foo”}
{“seq”: 3, “id”: “baz”}
{“seq”: 4, “id”: “bif”}
{“seq”: 5, “id”: “bar”}
i.e., the first instance “bar” would be skipped over because a more recent
version exists within the time horizon, but the first instance of “bif” would
included because “seq”: 6 is outside our horizon.
The downside of this approach is someone has to go in and clean up tombstoned
index entries eventually (or else provision lots and lots of storage space).
One way we could do this (inside CouchDB) would be to have each _changes
session record its read version somewhere, and then have a background process
go in and remove tombstoned entries where the tombstone is less than the
earliest read version of any active request. It’s doable, but definitely more
load on the server.
Also, note this approach is not guaranteeing that the older versions of the
documents referenced in those tombstoned entries are actually accessible. Much
like today, the changes feed would include a revisi