Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Adam Kocoloski Thu, 07 Mar 2019 04:37:55 -0800

Bah, our “cue”, not our “queue” ;)

Adam


> On Mar 7, 2019, at 7:35 AM, Adam Kocoloski <[email protected]> wrote:
> 
> Hi Garren,
> 
> In general we wouldn’t know ahead of time whether we can complete in five 
> seconds. I believe the way it works is that we start a transaction, issue a 
> bunch of reads, and after 5 seconds any additional reads will start to fail 
> with something like “read version too old”. That’s our queue to start a new 
> transaction. All the reads that completed successfully are fine, and the 
> CouchDB API layer can certainly choose to start streaming as soon as the 
> first read completes (~2ms after the beginning of the transaction).
> 
> Agree with Bob that steering towards a larger number of short-lived 
> operations is the way to go in general. But I also want to balance that with 
> backwards-compatibility where it makes sense.
> 
> Adam
> 
>> On Mar 7, 2019, at 7:22 AM, Garren Smith <[email protected]> wrote:
>> 
>> I agree that option A seems the most sensibile. I just want to understand
>> this comment:
>> 
>>>> A _changes request that cannot be satisfied within the 5 second limit
>> will be implemented as multiple FoundationDB transactions under the covers
>> 
>> How will we know if a change request cannot be completed in 5 seconds? Can
>> we tell that beforehand. Or would we try and complete a change request. The
>> transaction fails after 5 seconds and then do multiple transactions to get
>> the full changes? If that is the case the response from CouchDB to the user
>> will be really slow as they have already waited 5 seconds and have still
>> not received anything. Or if we start streaming a result back to the user
>> in the first transaction (Is this even possible?) then we would somehow
>> need to know how to continue the changes feed after the transaction has
>> failed.
>> 
>> Then Bob from your comment:
>> 
>>>> Forcing clients to do short (<5s) requests feels like a general good, as
>> long as meaningful things can be done in that time-frame, which I strongly
>> believe from what we've said elsewhere that they can.
>> 
>> That makes sense, but how would we do that? How do you help a user to make
>> sure their request is under 5 seconds?
>> 
>> Cheers
>> Garren
>> 
>> 
>> 
>> On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <[email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> Given that option A is the behaviour of feed=continuous today (barring the
>>> initial whole-snapshot phase to catch up to "now") I think that's the right
>>> move.  I confess to not reading your option B too deeply but I was there on
>>> IRC when the first spark was lit. We can build some sort of temporary
>>> multi-index on FDB today, that's clear, but it's equally clear that we
>>> should avoid doing so if at all possible.
>>> 
>>> Perhaps the future Redwood storage engine for FDB will, as you say,
>>> significantly improve on this, but, even if it does, I'm not 100% convinced
>>> we should expose it. Forcing clients to do short (<5s) requests feels like
>>> a general good, as long as meaningful things can be done in that
>>> time-frame, which I strongly believe from what we've said elsewhere that
>>> they can.
>>> 
>>> CouchDB's API, as we both know from rich (heh, and sometimes poor)
>>> experience in production, has a lot of endpoints of wildly varying
>>> performance characteristics. It's right that we evolve away from that where
>>> possible, and this seems a great candidate given the replicator in ~all
>>> versions of CouchDB will handle the change without blinking.
>>> 
>>> We have the same issue for _all_docs and _view and _find, in that the user
>>> might ask for more data back than can be sent within a single FDB
>>> transaction. I suggest that's a new thread, though.
>>> 
>>> --
>>> Robert Samuel Newson
>>> [email protected]
>>> 
>>> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote:
>>>> Hi all, as the project devs are working through the design for the
>>>> _changes feed in FoundationDB we’ve come across a limitation that is
>>>> worth discussing with the broader user community. FoundationDB
>>>> currently imposes a 5 second limit on all transactions, and read
>>>> versions from old transactions are inaccessible after that window. This
>>>> means that, unlike a single CouchDB storage shard, it is not possible
>>>> to grab a long-lived snapshot of the entire database.
>>>> 
>>>> In extant versions of CouchDB we rely on this long-lived snapshot
>>>> behavior for a number of operations, some of which are user-facing. For
>>>> example, it is possible to make a request to the _changes feed for a
>>>> database of an arbitrary size and, if you’ve got the storage space and
>>>> time to spare, you can pull down a snapshot of the entire database in a
>>>> single request. That snapshot will contain exactly one entry for each
>>>> document in the database. In CouchDB 1.x the documents appear in the
>>>> order in which they were most recently updated. In CouchDB 2.x there is
>>>> no guaranteed ordering, although in practice the documents are roughly
>>>> ordered by most recent edit. Note that you really do have to complete
>>>> the operation in a single HTTP request; if you chunk up the requests or
>>>> have to retry because the connection was severed then the exactly-once
>>>> guarantees disappear.
>>>> 
>>>> We have a couple of different options for how we can implement _changes
>>>> with FoundationDB as a backing store, I’ll describe them below and
>>>> discuss the tradeoffs
>>>> 
>>>> ## Option A: Single Version Index, long-running operations as multiple
>>>> transactions
>>>> 
>>>> In this option the internal index has exactly one entry for each
>>>> document at all times. A _changes request that cannot be satisfied
>>>> within the 5 second limit will be implemented as multiple FoundationDB
>>>> transactions under the covers. These transactions will have different
>>>> read versions, and a document that gets updated in between those read
>>>> versions will show up *multiple times* in the response body. The entire
>>>> feed will be totally ordered, and later occurrences of a particular
>>>> document are guaranteed to represent more recent edits than than the
>>>> earlier occurrences. In effect, it’s rather like the semantics of a
>>>> feed=continuous request today, but with much better ordering and zero
>>>> possibility of “rewinds” where large portions of the ID space get
>>>> replayed because of issues in the cluster.
>>>> 
>>>> This option is very efficient internally and does not require any
>>>> background maintenance. A future enhancement in FoundationDB’s storage
>>>> engine is designed to enable longer-running read-only transactions, so
>>>> we will likely to be able to improve the semantics with this option
>>>> over time.
>>>> 
>>>> ## Option B: Multi-Version Index
>>>> 
>>>> In this design the internal index can contain multiple entries for a
>>>> given document. Each entry includes the sequence at which the document
>>>> edit was made, and may also include a sequence at which it was
>>>> overwritten by a more recent edit.
>>>> 
>>>> The implementation of a _changes request would start by getting the
>>>> current version of the datastore (call this the read version), and then
>>>> as it examines entries in the index it would skip over any entries
>>>> where there’s a “tombstone” sequence less than the read version.
>>>> Crucially, if the request needs to be implemented across multiple
>>>> transactions, each transaction would use the same read version when
>>>> deciding whether to include entries in the index in the _changes
>>>> response. The readers would know to stop when and if they encounter an
>>>> entry where the created version is greater than the read version.
>>>> Perhaps a diagram helps to clarify, a simplified version of the
>>>> internal index might look like
>>>> 
>>>> {“seq”: 1, “id”: ”foo”}
>>>> {“seq”: 2, “id”: ”bar”, “tombstone”: 5}
>>>> {“seq”: 3, “id”: “baz”}
>>>> {“seq”: 4, “id”: “bif”, “tombstone": 6}
>>>> {“seq”: 5, “id”: “bar”}
>>>> {“seq”: 6, “id”: “bif”}
>>>> 
>>>> A _changes request which happens to commence when the database is at
>>>> sequence 5 would return (ignoring the format of “seq” for simplicity)
>>>> 
>>>> {“seq”: 1, “id”: ”foo”}
>>>> {“seq”: 3, “id”: “baz”}
>>>> {“seq”: 4, “id”: “bif”}
>>>> {“seq”: 5, “id”: “bar”}
>>>> 
>>>> i.e., the first instance “bar” would be skipped over because a more
>>>> recent version exists within the time horizon, but the first instance
>>>> of “bif” would included because “seq”: 6 is outside our horizon.
>>>> 
>>>> The downside of this approach is someone has to go in and clean up
>>>> tombstoned index entries eventually (or else provision lots and lots of
>>>> storage space). One way we could do this (inside CouchDB) would be to
>>>> have each _changes session record its read version somewhere, and then
>>>> have a background process go in and remove tombstoned entries where the
>>>> tombstone is less than the earliest read version of any active request.
>>>> It’s doable, but definitely more load on the server.
>>>> 
>>>> Also, note this approach is not guaranteeing that the older versions of
>>>> the documents referenced in those tombstoned entries are actually
>>>> accessible. Much like today, the changes feed would include a revision
>>>> identifier which, upon closer inspection, has been superseded by a more
>>>> recent version of the document. Unlike today, that older version would
>>>> be expunged from the database immediately if a descendant revision
>>>> exists.
>>>> 
>>>> —
>>>> 
>>>> OK, so those are the two basic options. I’d particularly like to hear
>>>> if the behavior described in Option A would prove problematic for
>>>> certain use cases, as it’s the simpler and more efficient of the two
>>>> options. Thanks!
>>>> 
>>>> Adam
>>> 
>

Re: [DISCUSS] On the _changes feed - how hard should we strive for exactly once semantics?

Reply via email to