Re: [DISCUSS] : things we need to solve/decide : changes feed

Robert Newson Wed, 06 Mar 2019 00:21:01 -0800

+1 to both changes, will echo that in the PR.

-- 
  Robert Samuel Newson
  rnew...@apache.org


On Wed, 6 Mar 2019, at 00:04, Adam Kocoloski wrote:
> Dredging this thread back up with an eye towards moving to an RFC …
> 
> I was reading through the FoundationDB Record Layer preprint[1] a few 
> weeks ago and noticed an enhancement to their version of _changes that 
> I know would be beneficial to IBM and that I think is worth considering 
> for inclusion in CouchDB directly. Quoting the paper:
> 
> > To implement a sync index, CloudKit leverages the total order on 
> > FoundationDB’s commit versions by using a VERSION index, mapping versions 
> > to record identifiers. To perform a sync, CloudKit simply scans the VERSION 
> > index.
> > 
> > However, commit versions assigned by different FoundationDB clusters are 
> > uncorrelated. This introduces a challenge when migrating data from one 
> > cluster to another; CloudKit periodically moves users to improve load 
> > balance and locality. The sync index must represent the order of updates 
> > across all clusters, so updates committed after the move must be sorted 
> > after updates committed before the move. CloudKit addresses this with an 
> > application-level per-user count of the number of moves, called the 
> > incarnation. Initially, the incarnation is 1, and CloudKit increments it 
> > each time the user’s data is moved to a different cluster. On every record 
> > update, we write the user’s current incarnation to the record’s header; 
> > these values are not modified during a move. The VERSION sync index maps 
> > (incarnation, version) pairs to changed records, sorting the changes first 
> > by incarnation, then by version.
> 
> One of our goals in adopting FoundationDB is to eliminate rewinds of 
> the _changes feed; we make significant progress towards that goal 
> simply by adopting FoundationDB versionstamps as sequence identifiers, 
> but in cases where user data might be migrated from one FoundationDB 
> cluster to another we can lose this total ordering and rewind (or 
> worse, possibly skip updates). The “incarnation” trick of prefixing the 
> versionstamp with an integer which gets bumped whenever a user is moved 
> is a good way to mitigate that. I’ll give some thought to how the 
> per-database incarnation can be recorded and what facility we might 
> have for intelligently bumping it automatically, but I wanted to bring 
> this to folks’ attention and resurrect this ML thread.
> 
> Another thought I had this evening is to record the number of edit 
> branches for a given document in the value of the index. The reason I’d 
> do this is to optimize the popular `style=all_docs` queries to _changes 
> to avoid an extra range read in the very common case where a document 
> has only a single edit branch.
> 
> With the incarnation and branch count in place we’d be looking at a 
> design where the KV pairs have the structure
> 
> (“changes”, Incarnation, Versionstamp) = (ValFomat, DocID, RevFormat, 
> RevPosition, RevHash, BranchCount)
> 
> where ValFormat is an enumeration enabling schema evolution of the 
> value format in the future, and RevFormat, RevPosition, RevHash are 
> associated with the winning edit branch for the document (not 
> necessarily the edit that occurred at this version, matching current 
> CouchDB behavior) and carry the meanings defined in the revision 
> storage RFC[2].
> 
> A regular _changes feed request can respond simply by scanning this 
> index. A style=all_docs request can also be a simple scan if 
> BranchCount is 1; if it’s greater than 1 we would need to do an 
> additional range read of the “revisions” subspace to retrieve the leaf 
> revision identifiers for the document in question. An include_docs=true 
> request would need to do an additional range read in the document 
> storage subspace for this revision.
> 
> I think both the incarnation and the branch count warrant a small 
> update to the revision metadata RFC …
> 
> Adam
> 
> [1]: https://www.foundationdb.org/files/record-layer-paper.pdf
> [2]: https://github.com/apache/couchdb-documentation/pull/397
> 
> 
> > On Feb 5, 2019, at 12:20 PM, Mike Rhodes <couc...@dx13.co.uk> wrote:
> > 
> > Solution (2) appeals to me for its conceptual simplicity -- and having a 
> > stateless CouchDB layer I feel is super important in simplifying overall 
> > CouchDB deployment going forward.
> > 
> > -- 
> > Mike.
> > 
> > On Mon, 4 Feb 2019, at 20:11, Adam Kocoloski wrote:
> >> Probably good to take a quick step back and note that FoundationDB’s 
> >> versionstamps are an elegant and scalable solution to atomically 
> >> maintaining the index of documents in the order in which they were most 
> >> recently updated. I think that’s what you mean by the first part of the 
> >> problem, but I want to make sure that on the ML here we collectively 
> >> understand that FoundationDB actually nails this hard part of the 
> >> problem *really* well.
> >> 
> >> When you say “notify CouchDB about new updates”, are you referring to 
> >> the feed=longpoll or feed=continuous options to the _changes API? I 
> >> guess I see  three different routes that can be taken here.
> >> 
> >> One route is to use the same kind machinery that we have in place today 
> >> in CouchDB 2.x. As a reminder, the way this works is
> >> 
> >> -  a client waiting for changes on a DB spawns one local process and 
> >> also a rexi RPC process on each node hosting one of the DB shards of 
> >> interest (see fabric_db_update_listener). 
> >> - those RPC processes register as local couch_event listeners, where 
> >> they receive {db_updated, ShardName} messages forwarded to them from 
> >> the couch_db_updater processes.
> >> 
> >> Of course, in the FoundationDB design we don’t need to serialize 
> >> updates in couch_db_updater processes, but individual writers could 
> >> just as easily fire off those db_updated messages. This design is 
> >> already heavily optimized for large numbers of listeners on large 
> >> numbers of databases. The downside that I can see is it means the 
> >> *CouchDB layer nodes would need to form a distributed Erlang cluster* 
> >> in order for them to learn about the changes being committed from other 
> >> nodes in the cluster.
> >> 
> >> So let’s say we *didn’t* want to do that and we rather are trying to 
> >> design for completely independent layer nodes that have no knowledge of 
> >> or communication with one another save through FoundationDB. There’s 
> >> definitely something to be said for that constraint. One very simple 
> >> approach might be to just poll FoundationDB. If the database is under a 
> >> heavy write load there’s no overhead to this approach; every time we 
> >> finish sending the output of one range query against the versionstamp 
> >> space and we re-acquire a new read version there will be new updates to 
> >> consume. Where it gets inefficient is if we have a lot of listeners on 
> >> the feed and a very low-throughput database. But one fiddle with 
> >> polling intervals, or else have a layer of indirection so only one 
> >> process on each layer node is doing the polling and then sending events 
> >> to couch_event. I think this could scale quite far.
> >> 
> >> The other option (which I think is the one you’re homing in on) is to 
> >> leverage FoundationDB’s watchers to get a push notification about 
> >> updates to a particular key. I would be cautious about creating a 
> >> specific key or set of keys specifically for this purpose, but, if we 
> >> find that there’s some other bit of metadata that we are needing to 
> >> maintain anyway then this could work nicely. I think same indirection 
> >> that I described above (where each layer node has a maximum of one 
> >> watcher per database, and it re-broadcasts messages to all interested 
> >> clients via couch_event) would help us not be too constrained by the 
> >> limit on watches.
> >> 
> >> So to recap, the three approaches
> >> 
> >> 1. Writers publish db_updated events to couch_event, listeners use 
> >> distributed Erlang to subscribe to all nodes
> >> 2. Poll the _changes subspace, scale by nominating a specific process 
> >> per node to do the polling
> >> 3. Same as #2 but using a watch on DB metadata that changes with every 
> >> update instead of polling
> >> 
> >> Adam
> >> 
> >>> On Feb 4, 2019, at 2:18 PM, Ilya Khlopotov <iil...@apache.org> wrote:
> >>> 
> >>> One of the features of CouchDB, which doesn't map cleanly into 
> >>> FoudationDB is changes feed. The essence of the feature is: 
> >>> - Subscriber of the feed wants to receive notifications when database is 
> >>> updated. 
> >>> - The notification includes update_seq for the database and list of 
> >>> changes which happen at that time. 
> >>> - The change itself includes docid and rev. 
> >>> Hi, 
> >>> 
> >>> There are multiple ways to easily solve this problem. Designing a 
> >>> scalable way to do it is way harder.  
> >>> 
> >>> There are at least two parts to this problem:
> >>> - how to structure secondary indexes so we can provide what we need in 
> >>> notification event
> >>> - how to notify CouchDB about new updates
> >>> 
> >>> For the second part of the problem we could setup a watcher on one of the 
> >>> keys we have to update on every transaction. For example the key which 
> >>> tracks the database_size or key which tracks the number of documents or 
> >>> we can add our own key. The problem is at some point we would hit a 
> >>> capacity limit for atomic updates of a single key (FoundationDB doesn't 
> >>> redistribute the load among servers on per key basis). In such case we 
> >>> would have to distribute the counter among multiple keys to allow 
> >>> FoundationDB to split the hot range. Therefore, we would have to setup 
> >>> multiple watches. FoundationDB has a limit on the number of watches the 
> >>> client can setup (100000 watches). So we need to keep in mind this number 
> >>> when designing the feature. 
> >>> 
> >>> The single key update rate problem is very theoretical and we might 
> >>> ignore it for the PoC version. Then we can measure the impact and change 
> >>> design accordingly. The reason I decided to bring it up is to see maybe 
> >>> someone has a simple solution to avoid the bottleneck. 
> >>> 
> >>> best regards,
> >>> iilyak
> >> 
> >> 
> 
>

Re: [DISCUSS] : things we need to solve/decide : changes feed

Reply via email to