+1 to both changes, will echo that in the PR. -- Robert Samuel Newson rnew...@apache.org
On Wed, 6 Mar 2019, at 00:04, Adam Kocoloski wrote: > Dredging this thread back up with an eye towards moving to an RFC … > > I was reading through the FoundationDB Record Layer preprint[1] a few > weeks ago and noticed an enhancement to their version of _changes that > I know would be beneficial to IBM and that I think is worth considering > for inclusion in CouchDB directly. Quoting the paper: > > > To implement a sync index, CloudKit leverages the total order on > > FoundationDB’s commit versions by using a VERSION index, mapping versions > > to record identifiers. To perform a sync, CloudKit simply scans the VERSION > > index. > > > > However, commit versions assigned by different FoundationDB clusters are > > uncorrelated. This introduces a challenge when migrating data from one > > cluster to another; CloudKit periodically moves users to improve load > > balance and locality. The sync index must represent the order of updates > > across all clusters, so updates committed after the move must be sorted > > after updates committed before the move. CloudKit addresses this with an > > application-level per-user count of the number of moves, called the > > incarnation. Initially, the incarnation is 1, and CloudKit increments it > > each time the user’s data is moved to a different cluster. On every record > > update, we write the user’s current incarnation to the record’s header; > > these values are not modified during a move. The VERSION sync index maps > > (incarnation, version) pairs to changed records, sorting the changes first > > by incarnation, then by version. > > One of our goals in adopting FoundationDB is to eliminate rewinds of > the _changes feed; we make significant progress towards that goal > simply by adopting FoundationDB versionstamps as sequence identifiers, > but in cases where user data might be migrated from one FoundationDB > cluster to another we can lose this total ordering and rewind (or > worse, possibly skip updates). The “incarnation” trick of prefixing the > versionstamp with an integer which gets bumped whenever a user is moved > is a good way to mitigate that. I’ll give some thought to how the > per-database incarnation can be recorded and what facility we might > have for intelligently bumping it automatically, but I wanted to bring > this to folks’ attention and resurrect this ML thread. > > Another thought I had this evening is to record the number of edit > branches for a given document in the value of the index. The reason I’d > do this is to optimize the popular `style=all_docs` queries to _changes > to avoid an extra range read in the very common case where a document > has only a single edit branch. > > With the incarnation and branch count in place we’d be looking at a > design where the KV pairs have the structure > > (“changes”, Incarnation, Versionstamp) = (ValFomat, DocID, RevFormat, > RevPosition, RevHash, BranchCount) > > where ValFormat is an enumeration enabling schema evolution of the > value format in the future, and RevFormat, RevPosition, RevHash are > associated with the winning edit branch for the document (not > necessarily the edit that occurred at this version, matching current > CouchDB behavior) and carry the meanings defined in the revision > storage RFC[2]. > > A regular _changes feed request can respond simply by scanning this > index. A style=all_docs request can also be a simple scan if > BranchCount is 1; if it’s greater than 1 we would need to do an > additional range read of the “revisions” subspace to retrieve the leaf > revision identifiers for the document in question. An include_docs=true > request would need to do an additional range read in the document > storage subspace for this revision. > > I think both the incarnation and the branch count warrant a small > update to the revision metadata RFC … > > Adam > > [1]: https://www.foundationdb.org/files/record-layer-paper.pdf > [2]: https://github.com/apache/couchdb-documentation/pull/397 > > > > On Feb 5, 2019, at 12:20 PM, Mike Rhodes <couc...@dx13.co.uk> wrote: > > > > Solution (2) appeals to me for its conceptual simplicity -- and having a > > stateless CouchDB layer I feel is super important in simplifying overall > > CouchDB deployment going forward. > > > > -- > > Mike. > > > > On Mon, 4 Feb 2019, at 20:11, Adam Kocoloski wrote: > >> Probably good to take a quick step back and note that FoundationDB’s > >> versionstamps are an elegant and scalable solution to atomically > >> maintaining the index of documents in the order in which they were most > >> recently updated. I think that’s what you mean by the first part of the > >> problem, but I want to make sure that on the ML here we collectively > >> understand that FoundationDB actually nails this hard part of the > >> problem *really* well. > >> > >> When you say “notify CouchDB about new updates”, are you referring to > >> the feed=longpoll or feed=continuous options to the _changes API? I > >> guess I see three different routes that can be taken here. > >> > >> One route is to use the same kind machinery that we have in place today > >> in CouchDB 2.x. As a reminder, the way this works is > >> > >> - a client waiting for changes on a DB spawns one local process and > >> also a rexi RPC process on each node hosting one of the DB shards of > >> interest (see fabric_db_update_listener). > >> - those RPC processes register as local couch_event listeners, where > >> they receive {db_updated, ShardName} messages forwarded to them from > >> the couch_db_updater processes. > >> > >> Of course, in the FoundationDB design we don’t need to serialize > >> updates in couch_db_updater processes, but individual writers could > >> just as easily fire off those db_updated messages. This design is > >> already heavily optimized for large numbers of listeners on large > >> numbers of databases. The downside that I can see is it means the > >> *CouchDB layer nodes would need to form a distributed Erlang cluster* > >> in order for them to learn about the changes being committed from other > >> nodes in the cluster. > >> > >> So let’s say we *didn’t* want to do that and we rather are trying to > >> design for completely independent layer nodes that have no knowledge of > >> or communication with one another save through FoundationDB. There’s > >> definitely something to be said for that constraint. One very simple > >> approach might be to just poll FoundationDB. If the database is under a > >> heavy write load there’s no overhead to this approach; every time we > >> finish sending the output of one range query against the versionstamp > >> space and we re-acquire a new read version there will be new updates to > >> consume. Where it gets inefficient is if we have a lot of listeners on > >> the feed and a very low-throughput database. But one fiddle with > >> polling intervals, or else have a layer of indirection so only one > >> process on each layer node is doing the polling and then sending events > >> to couch_event. I think this could scale quite far. > >> > >> The other option (which I think is the one you’re homing in on) is to > >> leverage FoundationDB’s watchers to get a push notification about > >> updates to a particular key. I would be cautious about creating a > >> specific key or set of keys specifically for this purpose, but, if we > >> find that there’s some other bit of metadata that we are needing to > >> maintain anyway then this could work nicely. I think same indirection > >> that I described above (where each layer node has a maximum of one > >> watcher per database, and it re-broadcasts messages to all interested > >> clients via couch_event) would help us not be too constrained by the > >> limit on watches. > >> > >> So to recap, the three approaches > >> > >> 1. Writers publish db_updated events to couch_event, listeners use > >> distributed Erlang to subscribe to all nodes > >> 2. Poll the _changes subspace, scale by nominating a specific process > >> per node to do the polling > >> 3. Same as #2 but using a watch on DB metadata that changes with every > >> update instead of polling > >> > >> Adam > >> > >>> On Feb 4, 2019, at 2:18 PM, Ilya Khlopotov <iil...@apache.org> wrote: > >>> > >>> One of the features of CouchDB, which doesn't map cleanly into > >>> FoudationDB is changes feed. The essence of the feature is: > >>> - Subscriber of the feed wants to receive notifications when database is > >>> updated. > >>> - The notification includes update_seq for the database and list of > >>> changes which happen at that time. > >>> - The change itself includes docid and rev. > >>> Hi, > >>> > >>> There are multiple ways to easily solve this problem. Designing a > >>> scalable way to do it is way harder. > >>> > >>> There are at least two parts to this problem: > >>> - how to structure secondary indexes so we can provide what we need in > >>> notification event > >>> - how to notify CouchDB about new updates > >>> > >>> For the second part of the problem we could setup a watcher on one of the > >>> keys we have to update on every transaction. For example the key which > >>> tracks the database_size or key which tracks the number of documents or > >>> we can add our own key. The problem is at some point we would hit a > >>> capacity limit for atomic updates of a single key (FoundationDB doesn't > >>> redistribute the load among servers on per key basis). In such case we > >>> would have to distribute the counter among multiple keys to allow > >>> FoundationDB to split the hot range. Therefore, we would have to setup > >>> multiple watches. FoundationDB has a limit on the number of watches the > >>> client can setup (100000 watches). So we need to keep in mind this number > >>> when designing the feature. > >>> > >>> The single key update rate problem is very theoretical and we might > >>> ignore it for the PoC version. Then we can measure the impact and change > >>> design accordingly. The reason I decided to bring it up is to see maybe > >>> someone has a simple solution to avoid the bottleneck. > >>> > >>> best regards, > >>> iilyak > >> > >> > >