[jira] [Commented] (COUCHDB-2310) Add a bulk API for revs & open_revs

Alexander Shorin (JIRA) Thu, 18 Dec 2014 14:28:51 -0800

    [ 
https://issues.apache.org/jira/browse/COUCHDB-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252421#comment-14252421
 ]


Alexander Shorin commented on COUCHDB-2310:
-------------------------------------------

I agree with [~rnewson] about the name and bulk evilness. Having this feature 
for /_all_docs would be much more better.

{quote}
Is there another way to pursue performance enhancements without going that far?
{quote}

There is a tricky one by using 
/db/_changes?include_docs=true&style=all_docs&attachments=true which will let 
it emit documents with all leaf/conflicted revisions. But here we have other 
problems:
- we'll have always read all the leaves every time
- no multipart support

So reducing amount of requests will cost us a lot of more traffic to receive 
and memory usage. Implement multipart response type support isn't a big problem 
I think - this mimetype is made for streams. For other problem we could make 
some smart style which will emit all the leaves only once while after (for 
continuous feeds) will emit only new/updated ones. In the result replicator 
will not need to operate with /_revs_diff and document resources and just 
listen changes feed and fetch all the data from there.

> Add a bulk API for revs & open_revs
> -----------------------------------
>
>                 Key: COUCHDB-2310
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2310
>             Project: CouchDB
>          Issue Type: Bug
>      Security Level: public(Regular issues) 
>          Components: HTTP Interface
>            Reporter: Nolan Lawson
>
> CouchDB replication is too slow.
> And what makes it so slow is that it's just so unnecessarily chatty. During 
> replication, you have to do a separate GET for each individual document, in 
> order to get the full {{_revisions}} object for that document (using the 
> {{revs}} and {{open_revs}} parameters &ndash; refer to [the TouchDB 
> writeup|https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm]
>  or [Benoit's writeup|http://dataprotocols.org/couchdb-replication/] if you 
> need a refresher).
> So for example, let's say you've got a database full of 10,000 documents, and 
> you replicate using a batch size of 500 (batch sizes are configurable in 
> PouchDB). The conversation for a single batch basically looks like this:
> {code}
> - REPLICATOR: gimme 500 changes since seq X (1 GET request)
>   - SOURCE: okay
> - REPLICATOR: gimme the _revs_diff for these 500 docs/_revs (1 POST request)
>   - SOURCE: okay
> - repeat 500 times:
>   - REPLICATOR: gimme the _revisions for doc n with _revs [...] (1 GET 
> request)
>     - SOURCE: okay
> - REPLICATOR: here's a _bulk_docs with 500 documents (1 POST request)
>     - TARGET: okay
> {code}
> See the problem here? That 500-loop, where we have to do a GET for each one 
> of 500 documents, is a lot of unnecessary back-and-forth, considering that 
> the replicator already knows what it needs before the loop starts. You can 
> parallelize, but if you assume a browser (e.g. for PouchDB), most browsers 
> only let you do ~8 simultaneous requests at once. Plus, there's latency and 
> HTTP headers to consider. So overall, it's not cool.
> So why do we even need to do the separate requests? Shouldn't {{_all_docs}} 
> be good enough? Turns out it's not, because we need this special 
> {{_revisions}} object.
> For example, consider a document {{'foo'}} with 10 revisions. You may compact 
> the database, in which case revisions {{1-x}} through {{9-x}} are no longer 
> retrievable. However, if you query using {{revs}} and {{open_revs}}, those 
> rev IDs are still available:
> {code}
> $ curl 'http://nolan.iriscouch.com/test/foo?revs=true&open_revs=all'
> {
>   "_id": "foo",
>   "_rev": "10-c78e199ad5e996b240c9d6482907088e",
>   "_revisions": {
>     "start": 10,
>     "ids": [
>       "c78e199ad5e996b240c9d6482907088e",
>       "f560283f1968a05046f0c38e468006bb",
>       "0091198554171c632c27c8342ddec5af",
>       "e0a023e2ea59db73f812ad773ea08b17",
>       "65d7f8b8206a244035edd9f252f206ad",
>       "069d1432a003c58bdd23f01ff80b718f",
>       "d21f26bb604b7fe9eba03ce4562cf37b",
>       "31d380f99a6e54875855e1c24469622d",
>       "3b4791360024426eadafe31542a2c34b",
>       "967a00dff5e02add41819138abb3284d"
>     ]
>   }
> }
> {code}
> And in the replication algorithm, _this full \_revisions object is required_ 
> at the point when you copy the document from one database to another, which 
> is accomplished with a POST to {{_bulk_docs}} using {{new_edits=false}}. If 
> you don't have the full {{_revisions}} object, CouchDB accepts the new 
> revision, but considers it to be a conflict. (The exception is with 
> generation-1 documents, since they have no history, so as it says in the 
> TouchDB writeup, you can safely just use {{_all_docs}} as an optimization for 
> such documents.)
> And unfortunately, this {{_revision}} object is only available from the {{GET 
> /:dbid/:docid}} endpoint. Trust me; I've tried the other APIs. You can't get 
> it anywhere else.
> This is a huge problem, especially in PouchDB where we often have to deal 
> with CORS, meaning the number of HTTP requests is doubled. So for those 500 
> GETs, it's an extra 500 OPTIONs, which is just unacceptable.
> Replication does not have to be slow. While we were experimenting with ways 
> of fetching documents in bulk, we tried a technique that just relied on using 
> {{_changes}} with {{include_docs=true}} 
> ([|\#2472|https://github.com/pouchdb/pouchdb/pull/2472]). This pushed 
> conflicts into the target database, but on the upside, you can sync ~95k 
> documents from npm's skimdb repository to the browser in less than 20 
> minutes! (See [npm-browser.com|http://npm-browser.com] for a demo.)
> What an amazing story we could tell about the beauty of CouchDB replication, 
> if only this trick actually worked!
> My proposal is a simple one: just add the {{revs}} and {{open_revs}} options 
> to {{_all_docs}}. Presumably this would be aligned with {{keys}}, so similar 
> to how {{keys}} takes an array of docIds, {{open_revs}} would take an array 
> of array of revisions. {{revs}} would just be a boolean.
> This only gets hairy in the case of deleted documents. In this example, 
> {{bar}} is deleted but {{foo}} is not:
> {code}
> curl -g 
> 'http://nolan.iriscouch.com/test/_all_docs?keys=["bar","foo"]&include_docs=true'
> {"total_rows":1,"offset":0,"rows":[
> {"id":"bar","key":"bar","value":{"rev":"2-eec205a9d413992850a6e32678485900","deleted":true},"doc":null},
> {"id":"foo","key":"foo","value":{"rev":"10-c78e199ad5e996b240c9d6482907088e"},"doc":{"_id":"foo","_rev":"10-c78e199ad5e996b240c9d6482907088e"}}
> ]}
> {code}
> The cleanest would be to attach the {{_revisions}} object to the {{doc}}, but 
> if you use {{keys}}, then the deleted documents are returned with {{doc: 
> null}}, even if you specify {{include_docs=true}}. One workaround would be to 
> simply add a {{revisions}} object to the {{value}}.
> If all of this would be too difficult to implement under the hood in CouchDB, 
> I'd also be happy to get the {{_revisions}} back in {{_changes}}, 
> {{_revs_diff}}, or even in a separate endpoint. I don't care, as long as 
> there is some bulk API where I can get multiple {{_revisions}} for multiple 
> documents at once.
> On the PouchDB end of things, we would really like to push forward on this. 
> I'm happy to implement a Node.js proxy to stand in front of 
> CouchDB/Cloudant/CSG and add this new API, plus adding it directly to PouchDB 
> Server. I can invent whatever API I want, but the main thing is that I would 
> like this API to be something that all the major players can agree upon 
> (Apache, Cloudant, Couchbase) so that eventually the proxy is no longer 
> necessary.
> Thanks for reading the WoT. Looking forward to a faster CouchDB replication 
> protocol, since it's the thing that ties us all together and makes this crazy 
> experiment worthwhile.
> Background: [this|https://github.com/pouchdb/pouchdb/issues/2686] and 
> [this|https://gist.github.com/nolanlawson/340cb898f8ed9f3db8a0].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (COUCHDB-2310) Add a bulk API for revs & open_revs

Reply via email to