Hi Garren, In general we wouldn’t know ahead of time whether we can complete in five seconds. I believe the way it works is that we start a transaction, issue a bunch of reads, and after 5 seconds any additional reads will start to fail with something like “read version too old”. That’s our queue to start a new transaction. All the reads that completed successfully are fine, and the CouchDB API layer can certainly choose to start streaming as soon as the first read completes (~2ms after the beginning of the transaction).
Agree with Bob that steering towards a larger number of short-lived operations is the way to go in general. But I also want to balance that with backwards-compatibility where it makes sense. Adam > On Mar 7, 2019, at 7:22 AM, Garren Smith <gar...@apache.org> wrote: > > I agree that option A seems the most sensibile. I just want to understand > this comment: > >>> A _changes request that cannot be satisfied within the 5 second limit > will be implemented as multiple FoundationDB transactions under the covers > > How will we know if a change request cannot be completed in 5 seconds? Can > we tell that beforehand. Or would we try and complete a change request. The > transaction fails after 5 seconds and then do multiple transactions to get > the full changes? If that is the case the response from CouchDB to the user > will be really slow as they have already waited 5 seconds and have still > not received anything. Or if we start streaming a result back to the user > in the first transaction (Is this even possible?) then we would somehow > need to know how to continue the changes feed after the transaction has > failed. > > Then Bob from your comment: > >>> Forcing clients to do short (<5s) requests feels like a general good, as > long as meaningful things can be done in that time-frame, which I strongly > believe from what we've said elsewhere that they can. > > That makes sense, but how would we do that? How do you help a user to make > sure their request is under 5 seconds? > > Cheers > Garren > > > > On Thu, Mar 7, 2019 at 11:15 AM Robert Newson <rnew...@apache.org> wrote: > >> Hi, >> >> Given that option A is the behaviour of feed=continuous today (barring the >> initial whole-snapshot phase to catch up to "now") I think that's the right >> move. I confess to not reading your option B too deeply but I was there on >> IRC when the first spark was lit. We can build some sort of temporary >> multi-index on FDB today, that's clear, but it's equally clear that we >> should avoid doing so if at all possible. >> >> Perhaps the future Redwood storage engine for FDB will, as you say, >> significantly improve on this, but, even if it does, I'm not 100% convinced >> we should expose it. Forcing clients to do short (<5s) requests feels like >> a general good, as long as meaningful things can be done in that >> time-frame, which I strongly believe from what we've said elsewhere that >> they can. >> >> CouchDB's API, as we both know from rich (heh, and sometimes poor) >> experience in production, has a lot of endpoints of wildly varying >> performance characteristics. It's right that we evolve away from that where >> possible, and this seems a great candidate given the replicator in ~all >> versions of CouchDB will handle the change without blinking. >> >> We have the same issue for _all_docs and _view and _find, in that the user >> might ask for more data back than can be sent within a single FDB >> transaction. I suggest that's a new thread, though. >> >> -- >> Robert Samuel Newson >> rnew...@apache.org >> >> On Thu, 7 Mar 2019, at 01:24, Adam Kocoloski wrote: >>> Hi all, as the project devs are working through the design for the >>> _changes feed in FoundationDB we’ve come across a limitation that is >>> worth discussing with the broader user community. FoundationDB >>> currently imposes a 5 second limit on all transactions, and read >>> versions from old transactions are inaccessible after that window. This >>> means that, unlike a single CouchDB storage shard, it is not possible >>> to grab a long-lived snapshot of the entire database. >>> >>> In extant versions of CouchDB we rely on this long-lived snapshot >>> behavior for a number of operations, some of which are user-facing. For >>> example, it is possible to make a request to the _changes feed for a >>> database of an arbitrary size and, if you’ve got the storage space and >>> time to spare, you can pull down a snapshot of the entire database in a >>> single request. That snapshot will contain exactly one entry for each >>> document in the database. In CouchDB 1.x the documents appear in the >>> order in which they were most recently updated. In CouchDB 2.x there is >>> no guaranteed ordering, although in practice the documents are roughly >>> ordered by most recent edit. Note that you really do have to complete >>> the operation in a single HTTP request; if you chunk up the requests or >>> have to retry because the connection was severed then the exactly-once >>> guarantees disappear. >>> >>> We have a couple of different options for how we can implement _changes >>> with FoundationDB as a backing store, I’ll describe them below and >>> discuss the tradeoffs >>> >>> ## Option A: Single Version Index, long-running operations as multiple >>> transactions >>> >>> In this option the internal index has exactly one entry for each >>> document at all times. A _changes request that cannot be satisfied >>> within the 5 second limit will be implemented as multiple FoundationDB >>> transactions under the covers. These transactions will have different >>> read versions, and a document that gets updated in between those read >>> versions will show up *multiple times* in the response body. The entire >>> feed will be totally ordered, and later occurrences of a particular >>> document are guaranteed to represent more recent edits than than the >>> earlier occurrences. In effect, it’s rather like the semantics of a >>> feed=continuous request today, but with much better ordering and zero >>> possibility of “rewinds” where large portions of the ID space get >>> replayed because of issues in the cluster. >>> >>> This option is very efficient internally and does not require any >>> background maintenance. A future enhancement in FoundationDB’s storage >>> engine is designed to enable longer-running read-only transactions, so >>> we will likely to be able to improve the semantics with this option >>> over time. >>> >>> ## Option B: Multi-Version Index >>> >>> In this design the internal index can contain multiple entries for a >>> given document. Each entry includes the sequence at which the document >>> edit was made, and may also include a sequence at which it was >>> overwritten by a more recent edit. >>> >>> The implementation of a _changes request would start by getting the >>> current version of the datastore (call this the read version), and then >>> as it examines entries in the index it would skip over any entries >>> where there’s a “tombstone” sequence less than the read version. >>> Crucially, if the request needs to be implemented across multiple >>> transactions, each transaction would use the same read version when >>> deciding whether to include entries in the index in the _changes >>> response. The readers would know to stop when and if they encounter an >>> entry where the created version is greater than the read version. >>> Perhaps a diagram helps to clarify, a simplified version of the >>> internal index might look like >>> >>> {“seq”: 1, “id”: ”foo”} >>> {“seq”: 2, “id”: ”bar”, “tombstone”: 5} >>> {“seq”: 3, “id”: “baz”} >>> {“seq”: 4, “id”: “bif”, “tombstone": 6} >>> {“seq”: 5, “id”: “bar”} >>> {“seq”: 6, “id”: “bif”} >>> >>> A _changes request which happens to commence when the database is at >>> sequence 5 would return (ignoring the format of “seq” for simplicity) >>> >>> {“seq”: 1, “id”: ”foo”} >>> {“seq”: 3, “id”: “baz”} >>> {“seq”: 4, “id”: “bif”} >>> {“seq”: 5, “id”: “bar”} >>> >>> i.e., the first instance “bar” would be skipped over because a more >>> recent version exists within the time horizon, but the first instance >>> of “bif” would included because “seq”: 6 is outside our horizon. >>> >>> The downside of this approach is someone has to go in and clean up >>> tombstoned index entries eventually (or else provision lots and lots of >>> storage space). One way we could do this (inside CouchDB) would be to >>> have each _changes session record its read version somewhere, and then >>> have a background process go in and remove tombstoned entries where the >>> tombstone is less than the earliest read version of any active request. >>> It’s doable, but definitely more load on the server. >>> >>> Also, note this approach is not guaranteeing that the older versions of >>> the documents referenced in those tombstoned entries are actually >>> accessible. Much like today, the changes feed would include a revision >>> identifier which, upon closer inspection, has been superseded by a more >>> recent version of the document. Unlike today, that older version would >>> be expunged from the database immediately if a descendant revision >>> exists. >>> >>> — >>> >>> OK, so those are the two basic options. I’d particularly like to hear >>> if the behavior described in Option A would prove problematic for >>> certain use cases, as it’s the simpler and more efficient of the two >>> options. Thanks! >>> >>> Adam >>