[ https://issues.apache.org/jira/browse/COUCHDB-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14293450#comment-14293450 ]
Jan Lehnardt commented on COUCHDB-2310: --------------------------------------- [~benoitc] any news? :) > Add a bulk API for revs & open_revs > ----------------------------------- > > Key: COUCHDB-2310 > URL: https://issues.apache.org/jira/browse/COUCHDB-2310 > Project: CouchDB > Issue Type: Bug > Security Level: public(Regular issues) > Components: HTTP Interface > Reporter: Nolan Lawson > > CouchDB replication is too slow. > And what makes it so slow is that it's just so unnecessarily chatty. During > replication, you have to do a separate GET for each individual document, in > order to get the full {{_revisions}} object for that document (using the > {{revs}} and {{open_revs}} parameters – refer to [the TouchDB > writeup|https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm] > or [Benoit's writeup|http://dataprotocols.org/couchdb-replication/] if you > need a refresher). > So for example, let's say you've got a database full of 10,000 documents, and > you replicate using a batch size of 500 (batch sizes are configurable in > PouchDB). The conversation for a single batch basically looks like this: > {code} > - REPLICATOR: gimme 500 changes since seq X (1 GET request) > - SOURCE: okay > - REPLICATOR: gimme the _revs_diff for these 500 docs/_revs (1 POST request) > - SOURCE: okay > - repeat 500 times: > - REPLICATOR: gimme the _revisions for doc n with _revs [...] (1 GET > request) > - SOURCE: okay > - REPLICATOR: here's a _bulk_docs with 500 documents (1 POST request) > - TARGET: okay > {code} > See the problem here? That 500-loop, where we have to do a GET for each one > of 500 documents, is a lot of unnecessary back-and-forth, considering that > the replicator already knows what it needs before the loop starts. You can > parallelize, but if you assume a browser (e.g. for PouchDB), most browsers > only let you do ~8 simultaneous requests at once. Plus, there's latency and > HTTP headers to consider. So overall, it's not cool. > So why do we even need to do the separate requests? Shouldn't {{_all_docs}} > be good enough? Turns out it's not, because we need this special > {{_revisions}} object. > For example, consider a document {{'foo'}} with 10 revisions. You may compact > the database, in which case revisions {{1-x}} through {{9-x}} are no longer > retrievable. However, if you query using {{revs}} and {{open_revs}}, those > rev IDs are still available: > {code} > $ curl 'http://nolan.iriscouch.com/test/foo?revs=true&open_revs=all' > { > "_id": "foo", > "_rev": "10-c78e199ad5e996b240c9d6482907088e", > "_revisions": { > "start": 10, > "ids": [ > "c78e199ad5e996b240c9d6482907088e", > "f560283f1968a05046f0c38e468006bb", > "0091198554171c632c27c8342ddec5af", > "e0a023e2ea59db73f812ad773ea08b17", > "65d7f8b8206a244035edd9f252f206ad", > "069d1432a003c58bdd23f01ff80b718f", > "d21f26bb604b7fe9eba03ce4562cf37b", > "31d380f99a6e54875855e1c24469622d", > "3b4791360024426eadafe31542a2c34b", > "967a00dff5e02add41819138abb3284d" > ] > } > } > {code} > And in the replication algorithm, _this full \_revisions object is required_ > at the point when you copy the document from one database to another, which > is accomplished with a POST to {{_bulk_docs}} using {{new_edits=false}}. If > you don't have the full {{_revisions}} object, CouchDB accepts the new > revision, but considers it to be a conflict. (The exception is with > generation-1 documents, since they have no history, so as it says in the > TouchDB writeup, you can safely just use {{_all_docs}} as an optimization for > such documents.) > And unfortunately, this {{_revision}} object is only available from the {{GET > /:dbid/:docid}} endpoint. Trust me; I've tried the other APIs. You can't get > it anywhere else. > This is a huge problem, especially in PouchDB where we often have to deal > with CORS, meaning the number of HTTP requests is doubled. So for those 500 > GETs, it's an extra 500 OPTIONs, which is just unacceptable. > Replication does not have to be slow. While we were experimenting with ways > of fetching documents in bulk, we tried a technique that just relied on using > {{_changes}} with {{include_docs=true}} > ([|\#2472|https://github.com/pouchdb/pouchdb/pull/2472]). This pushed > conflicts into the target database, but on the upside, you can sync ~95k > documents from npm's skimdb repository to the browser in less than 20 > minutes! (See [npm-browser.com|http://npm-browser.com] for a demo.) > What an amazing story we could tell about the beauty of CouchDB replication, > if only this trick actually worked! > My proposal is a simple one: just add the {{revs}} and {{open_revs}} options > to {{_all_docs}}. Presumably this would be aligned with {{keys}}, so similar > to how {{keys}} takes an array of docIds, {{open_revs}} would take an array > of array of revisions. {{revs}} would just be a boolean. > This only gets hairy in the case of deleted documents. In this example, > {{bar}} is deleted but {{foo}} is not: > {code} > curl -g > 'http://nolan.iriscouch.com/test/_all_docs?keys=["bar","foo"]&include_docs=true' > {"total_rows":1,"offset":0,"rows":[ > {"id":"bar","key":"bar","value":{"rev":"2-eec205a9d413992850a6e32678485900","deleted":true},"doc":null}, > {"id":"foo","key":"foo","value":{"rev":"10-c78e199ad5e996b240c9d6482907088e"},"doc":{"_id":"foo","_rev":"10-c78e199ad5e996b240c9d6482907088e"}} > ]} > {code} > The cleanest would be to attach the {{_revisions}} object to the {{doc}}, but > if you use {{keys}}, then the deleted documents are returned with {{doc: > null}}, even if you specify {{include_docs=true}}. One workaround would be to > simply add a {{revisions}} object to the {{value}}. > If all of this would be too difficult to implement under the hood in CouchDB, > I'd also be happy to get the {{_revisions}} back in {{_changes}}, > {{_revs_diff}}, or even in a separate endpoint. I don't care, as long as > there is some bulk API where I can get multiple {{_revisions}} for multiple > documents at once. > On the PouchDB end of things, we would really like to push forward on this. > I'm happy to implement a Node.js proxy to stand in front of > CouchDB/Cloudant/CSG and add this new API, plus adding it directly to PouchDB > Server. I can invent whatever API I want, but the main thing is that I would > like this API to be something that all the major players can agree upon > (Apache, Cloudant, Couchbase) so that eventually the proxy is no longer > necessary. > Thanks for reading the WoT. Looking forward to a faster CouchDB replication > protocol, since it's the thing that ties us all together and makes this crazy > experiment worthwhile. > Background: [this|https://github.com/pouchdb/pouchdb/issues/2686] and > [this|https://gist.github.com/nolanlawson/340cb898f8ed9f3db8a0]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)