Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Adam Kocoloski Thu, 07 Mar 2019 04:35:52 -0800

Hi Garren,

In general we wouldn’t know ahead of time whether we can complete in five 
seconds. I believe the way it works is that we start a transaction, issue a 
bunch of reads, and after 5 seconds any additional reads will start to fail 
with something like “read version too old”. That’s our queue to start a new 
transaction. All the reads that completed successfully are fine, and the 
CouchDB API layer can certainly choose to start streaming as soon as the first 
read completes (~2ms after the beginning of the transaction).


Agree with Bob that steering towards a larger number of short-lived operations 
is the way to go in general. But I also want to balance that with 
backwards-compatibility where it makes sense.

Adam

> On Mar 7, 2019, at 7:22 AM, Garren Smith <gar...@apache.org> wrote:
> 
> I agree that option A seems the most sensibile. I just want to understand
> this comment:
> 
>>> A _changes request that cannot be satisfied within the 5 second limit
> will be implemented as multiple FoundationDB transactions under the covers
> 
> How will we know if a change request cannot be completed in 5 seconds? Can
> we tell that beforehand. Or would we try and complete a change request. The
> transaction fails after 5 seconds and then do multiple transactions to get
> the full changes? If that is the case the response from CouchDB to the user
> will be really slow as they have already waited 5 seconds and have still
> not received anything. Or if we start streaming a result back to the user
> in the first transaction (Is this even possible?) then we would somehow
> need to know how to continue the changes feed after the transaction has
> failed.
> 
> Then Bob from your comment:
> 
>>> Forcing clients to do short (<5s) requests feels like a general good, as
> long as meaningful things can be done in that time-frame, which I strongly
> believe from what we've said elsewhere that they can.
> 
> That makes sense, but how would we do that? How do you help a user to make
> sure their request is under 5 seconds?
> 
> Cheers
> Garren
> 
> 
> 
> On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <rnew...@apache.org> wrote:
> 
>> Hi,
>> 
>> Given that option A is the behaviour of feed=continuous today (barring the
>> initial whole-snapshot phase to catch up to "now") I think that's the right
>> move.  I confess to not reading your option B too deeply but I was there on
>> IRC when the first spark was lit. We can build some sort of temporary
>> multi-index on FDB today, that's clear, but it's equally clear that we
>> should avoid doing so if at all possible.
>> 
>> Perhaps the future Redwood storage engine for FDB will, as you say,
>> significantly improve on this, but, even if it does, I'm not 100% convinced
>> we should expose it. Forcing clients to do short (<5s) requests feels like
>> a general good, as long as meaningful things can be done in that
>> time-frame, which I strongly believe from what we've said elsewhere that
>> they can.
>> 
>> CouchDB's API, as we both know from rich (heh, and sometimes poor)
>> experience in production, has a lot of endpoints of wildly varying
>> performance characteristics. It's right that we evolve away from that where
>> possible, and this seems a great candidate given the replicator in ~all
>> versions of CouchDB will handle the change without blinking.
>> 
>> We have the same issue for _all_docs and _view and _find, in that the user
>> might ask for more data back than can be sent within a single FDB
>> transaction. I suggest that's a new thread, though.
>> 
>> --
>>  Robert Samuel Newson
>>  rnew...@apache.org
>> 
>> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
>>> Hi all, as the project devs are working through the design for the
>>> _changes feed in FoundationDB we’ve come across a limitation that is
>>> worth discussing with the broader user community. FoundationDB
>>> currently imposes a 5 second limit on all transactions, and read
>>> versions from old transactions are inaccessible after that window. This
>>> means that, unlike a single CouchDB storage shard, it is not possible
>>> to grab a long-lived snapshot of the entire database.
>>> 
>>> In extant versions of CouchDB we rely on this long-lived snapshot
>>> behavior for a number of operations, some of which are user-facing. For
>>> example, it is possible to make a request to the _changes feed for a
>>> database of an arbitrary size and, if you’ve got the storage space and
>>> time to spare, you can pull down a snapshot of the entire database in a
>>> single request. That snapshot will contain exactly one entry for each
>>> document in the database. In CouchDB 1.x the documents appear in the
>>> order in which they were most recently updated. In CouchDB 2.x there is
>>> no guaranteed ordering, although in practice the documents are roughly
>>> ordered by most recent edit. Note that you really do have to complete
>>> the operation in a single HTTP request; if you chunk up the requests or
>>> have to retry because the connection was severed then the exactly-once
>>> guarantees disappear.
>>> 
>>> We have a couple of different options for how we can implement _changes
>>> with FoundationDB as a backing store, I’ll describe them below and
>>> discuss the tradeoffs
>>> 
>>> ## Option A: Single Version Index, long-running operations as multiple
>>> transactions
>>> 
>>> In this option the internal index has exactly one entry for each
>>> document at all times. A _changes request that cannot be satisfied
>>> within the 5 second limit will be implemented as multiple FoundationDB
>>> transactions under the covers. These transactions will have different
>>> read versions, and a document that gets updated in between those read
>>> versions will show up *multiple times* in the response body. The entire
>>> feed will be totally ordered, and later occurrences of a particular
>>> document are guaranteed to represent more recent edits than than the
>>> earlier occurrences. In effect, it’s rather like the semantics of a
>>> feed=continuous request today, but with much better ordering and zero
>>> possibility of “rewinds” where large portions of the ID space get
>>> replayed because of issues in the cluster.
>>> 
>>> This option is very efficient internally and does not require any
>>> background maintenance. A future enhancement in FoundationDB’s storage
>>> engine is designed to enable longer-running read-only transactions, so
>>> we will likely to be able to improve the semantics with this option
>>> over time.
>>> 
>>> ## Option B: Multi-Version Index
>>> 
>>> In this design the internal index can contain multiple entries for a
>>> given document. Each entry includes the sequence at which the document
>>> edit was made, and may also include a sequence at which it was
>>> overwritten by a more recent edit.
>>> 
>>> The implementation of a _changes request would start by getting the
>>> current version of the datastore (call this the read version), and then
>>> as it examines entries in the index it would skip over any entries
>>> where there’s a “tombstone” sequence less than the read version.
>>> Crucially, if the request needs to be implemented across multiple
>>> transactions, each transaction would use the same read version when
>>> deciding whether to include entries in the index in the _changes
>>> response. The readers would know to stop when and if they encounter an
>>> entry where the created version is greater than the read version.
>>> Perhaps a diagram helps to clarify, a simplified version of the
>>> internal index might look like
>>> 
>>> {“seq”: 1, “id”: ”foo”}
>>> {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
>>> {“seq”: 3, “id”: “baz”}
>>> {“seq”: 4, “id”: “bif”, “tombstone": 6}
>>> {“seq”: 5, “id”: “bar”}
>>> {“seq”: 6, “id”: “bif”}
>>> 
>>> A _changes request which happens to commence when the database is at
>>> sequence 5 would return (ignoring the format of “seq” for simplicity)
>>> 
>>> {“seq”: 1, “id”: ”foo”}
>>> {“seq”: 3, “id”: “baz”}
>>> {“seq”: 4, “id”: “bif”}
>>> {“seq”: 5, “id”: “bar”}
>>> 
>>> i.e., the first instance “bar” would be skipped over because a more
>>> recent version exists within the time horizon, but the first instance
>>> of “bif” would included because “seq”: 6 is outside our horizon.
>>> 
>>> The downside of this approach is someone has to go in and clean up
>>> tombstoned index entries eventually (or else provision lots and lots of
>>> storage space). One way we could do this (inside CouchDB) would be to
>>> have each _changes session record its read version somewhere, and then
>>> have a background process go in and remove tombstoned entries where the
>>> tombstone is less than the earliest read version of any active request.
>>> It’s doable, but definitely more load on the server.
>>> 
>>> Also, note this approach is not guaranteeing that the older versions of
>>> the documents referenced in those tombstoned entries are actually
>>> accessible. Much like today, the changes feed would include a revision
>>> identifier which, upon closer inspection, has been superseded by a more
>>> recent version of the document. Unlike today, that older version would
>>> be expunged from the database immediately if a descendant revision
>>> exists.
>>> 
>>> —
>>> 
>>> OK, so those are the two basic options. I’d particularly like to hear
>>> if the behavior described in Option A would prove problematic for
>>> certain use cases, as it’s the simpler and more efficient of the two
>>> options. Thanks!
>>> 
>>> Adam
>>

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Reply via email to