[jira] [Commented] (COUCHDB-2310) Add a bulk API for revs & open_revs

Robert Newson (JIRA) Thu, 18 Dec 2014 12:22:00 -0800

    [ 
https://issues.apache.org/jira/browse/COUCHDB-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252181#comment-14252181
 ]


Robert Newson commented on COUCHDB-2310:
----------------------------------------

finally, the intent to make everything accessible in bulk using POST's seems to 
ruin our RESTful nature. Is there another way to pursue performance 
enhancements without going that far? I personally hate all the bulk endpoints 
(each added pretty much ad-hoc for much the same reason motivating this ticket).

> Add a bulk API for revs & open_revs
> -----------------------------------
>
>                 Key: COUCHDB-2310
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2310
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: HTTP Interface
>            Reporter: Nolan Lawson
>
> CouchDB replication is too slow.
> And what makes it so slow is that it's just so unnecessarily chatty. During 
> replication, you have to do a separate GET for each individual document, in 
> order to get the full {{_revisions}} object for that document (using the 
> {{revs}} and {{open_revs}} parameters &ndash; refer to [the TouchDB 
> writeup|https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm]
>  or [Benoit's writeup|http://dataprotocols.org/couchdb-replication/] if you 
> need a refresher).
> So for example, let's say you've got a database full of 10,000 documents, and 
> you replicate using a batch size of 500 (batch sizes are configurable in 
> PouchDB). The conversation for a single batch basically looks like this:
> {code}
> - REPLICATOR: gimme 500 changes since seq X (1 GET request)
>   - SOURCE: okay
> - REPLICATOR: gimme the _revs_diff for these 500 docs/_revs (1 POST request)
>   - SOURCE: okay
> - repeat 500 times:
>   - REPLICATOR: gimme the _revisions for doc n with _revs [...] (1 GET 
> request)
>     - SOURCE: okay
> - REPLICATOR: here's a _bulk_docs with 500 documents (1 POST request)
>     - TARGET: okay
> {code}
> See the problem here? That 500-loop, where we have to do a GET for each one 
> of 500 documents, is a lot of unnecessary back-and-forth, considering that 
> the replicator already knows what it needs before the loop starts. You can 
> parallelize, but if you assume a browser (e.g. for PouchDB), most browsers 
> only let you do ~8 simultaneous requests at once. Plus, there's latency and 
> HTTP headers to consider. So overall, it's not cool.
> So why do we even need to do the separate requests? Shouldn't {{_all_docs}} 
> be good enough? Turns out it's not, because we need this special 
> {{_revisions}} object.
> For example, consider a document {{'foo'}} with 10 revisions. You may compact 
> the database, in which case revisions {{1-x}} through {{9-x}} are no longer 
> retrievable. However, if you query using {{revs}} and {{open_revs}}, those 
> rev IDs are still available:
> {code}
> $ curl 'http://nolan.iriscouch.com/test/foo?revs=true&open_revs=all'
> {
>   "_id": "foo",
>   "_rev": "10-c78e199ad5e996b240c9d6482907088e",
>   "_revisions": {
>     "start": 10,
>     "ids": [
>       "c78e199ad5e996b240c9d6482907088e",
>       "f560283f1968a05046f0c38e468006bb",
>       "0091198554171c632c27c8342ddec5af",
>       "e0a023e2ea59db73f812ad773ea08b17",
>       "65d7f8b8206a244035edd9f252f206ad",
>       "069d1432a003c58bdd23f01ff80b718f",
>       "d21f26bb604b7fe9eba03ce4562cf37b",
>       "31d380f99a6e54875855e1c24469622d",
>       "3b4791360024426eadafe31542a2c34b",
>       "967a00dff5e02add41819138abb3284d"
>     ]
>   }
> }
> {code}
> And in the replication algorithm, _this full \_revisions object is required_ 
> at the point when you copy the document from one database to another, which 
> is accomplished with a POST to {{_bulk_docs}} using {{new_edits=false}}. If 
> you don't have the full {{_revisions}} object, CouchDB accepts the new 
> revision, but considers it to be a conflict. (The exception is with 
> generation-1 documents, since they have no history, so as it says in the 
> TouchDB writeup, you can safely just use {{_all_docs}} as an optimization for 
> such documents.)
> And unfortunately, this {{_revision}} object is only available from the {{GET 
> /:dbid/:docid}} endpoint. Trust me; I've tried the other APIs. You can't get 
> it anywhere else.
> This is a huge problem, especially in PouchDB where we often have to deal 
> with CORS, meaning the number of HTTP requests is doubled. So for those 500 
> GETs, it's an extra 500 OPTIONs, which is just unacceptable.
> Replication does not have to be slow. While we were experimenting with ways 
> of fetching documents in bulk, we tried a technique that just relied on using 
> {{_changes}} with {{include_docs=true}} 
> ([|\#2472|https://github.com/pouchdb/pouchdb/pull/2472]). This pushed 
> conflicts into the target database, but on the upside, you can sync ~95k 
> documents from npm's skimdb repository to the browser in less than 20 
> minutes! (See [npm-browser.com|http://npm-browser.com] for a demo.)
> What an amazing story we could tell about the beauty of CouchDB replication, 
> if only this trick actually worked!
> My proposal is a simple one: just add the {{revs}} and {{open_revs}} options 
> to {{_all_docs}}. Presumably this would be aligned with {{keys}}, so similar 
> to how {{keys}} takes an array of docIds, {{open_revs}} would take an array 
> of array of revisions. {{revs}} would just be a boolean.
> This only gets hairy in the case of deleted documents. In this example, 
> {{bar}} is deleted but {{foo}} is not:
> {code}
> curl -g 
> 'http://nolan.iriscouch.com/test/_all_docs?keys=["bar","foo"]&include_docs=true'
> {"total_rows":1,"offset":0,"rows":[
> {"id":"bar","key":"bar","value":{"rev":"2-eec205a9d413992850a6e32678485900","deleted":true},"doc":null},
> {"id":"foo","key":"foo","value":{"rev":"10-c78e199ad5e996b240c9d6482907088e"},"doc":{"_id":"foo","_rev":"10-c78e199ad5e996b240c9d6482907088e"}}
> ]}
> {code}
> The cleanest would be to attach the {{_revisions}} object to the {{doc}}, but 
> if you use {{keys}}, then the deleted documents are returned with {{doc: 
> null}}, even if you specify {{include_docs=true}}. One workaround would be to 
> simply add a {{revisions}} object to the {{value}}.
> If all of this would be too difficult to implement under the hood in CouchDB, 
> I'd also be happy to get the {{_revisions}} back in {{_changes}}, 
> {{_revs_diff}}, or even in a separate endpoint. I don't care, as long as 
> there is some bulk API where I can get multiple {{_revisions}} for multiple 
> documents at once.
> On the PouchDB end of things, we would really like to push forward on this. 
> I'm happy to implement a Node.js proxy to stand in front of 
> CouchDB/Cloudant/CSG and add this new API, plus adding it directly to PouchDB 
> Server. I can invent whatever API I want, but the main thing is that I would 
> like this API to be something that all the major players can agree upon 
> (Apache, Cloudant, Couchbase) so that eventually the proxy is no longer 
> necessary.
> Thanks for reading the WoT. Looking forward to a faster CouchDB replication 
> protocol, since it's the thing that ties us all together and makes this crazy 
> experiment worthwhile.
> Background: [this|https://github.com/pouchdb/pouchdb/issues/2686] and 
> [this|https://gist.github.com/nolanlawson/340cb898f8ed9f3db8a0].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (COUCHDB-2310) Add a bulk API for revs & open_revs

Reply via email to