Hi all, It's been some months now that I resumed Damien's initial work on a new replicator. It's now in a state that's perfectly usable (hopefully), functional and with all the features of the current replicator. Its code is at:
https://github.com/fdmanana/couchdb/compare/trunk_new_replicator (the ASF's svn branch "new_replicator" is too aged now and has several issues with pull replications) What it brings to the table: - When new revisions of a document need to be replicated to the target, it only replicates the attachments introduced in the missing revisions, that is, unlike the current replicator it doesn't replicate all the attachments (introduced since the first revision) - this makes a huge difference for databases with many and/or large attachments; - It exploits parallelism more aggressively - the current replicator uses a single process for finding which revisions are missing in the target, a single process to write documents to the target and for push replications a single process that reads documents from the local source (for pull replications it spawns several processes to read documents from the remote source). The new replicator uses several (# is configurable) processes to find the missing revisions and copy documents (read from source and write to target); - Progress is done faster when attachments are present. For a batch of documents, the current replicator fetches first the body of the documents and only when it decides to flush them to disk, for each attachment it opens a separate connection to download each. This means that only after all the attachments for the documents in the batch are fetched, a checkpoint is done. The new replicator fetches documents together with their attachments (in compressed form if they're compressed at the source) in a single request and flushes them as soon as possible; - Better error isolation. Currently, HTTP connections are shared by different replications. This behaviour is more like a "side effect" of using ibrowse's load balancer. The new replicator has its own load balancer implementation which ensures that all the requests in the pipeline of a connection belong to the same replication; - It was completely rewritten from scratch. Hopefully the code is better organised now and slightly shorter; - Better logging (more meaningful information about connectivity errors and document write failures) and integration with the replicator database. There are now some new .ini replicator configuration options ( https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/etc/couchdb/default.ini.tpl.in#L138 ) under the [replicator] section: "worker_processes" - the number of processes that copy documents from the source database to the target database. For each one of them, a missing revisions process is also spawned. For example, if set to 4 (default) we get 8 processes: 4 for copying documents and 4 to find which revisions are missing in the target; "worker_batch_size" - the maximum number of consecutive source sequence numbers each worker processes at once. Default value is 1000. Lower values mean that checkpoints are done more frequently and are recommended when the source has very large documents and/or most documents with many and/or large attachments. Higher values can make the replicator process faster only when most documents are very small (below a few kilobytes) and there are none or very few with attachments; "http_connections" - the maximum number of HTTP connections per replication; "http_pipeline_size" - the maximum pipeline size for each HTTP connection; "connection_timeout" - the period of time (in milliseconds) after which a connection is considered dead because no data was received (and up to 10 retries are done). The default value is 30 000 milliseconds; "socket_options" - options to pass to the TCP sockets. All the available options are listed in the man page of the Erlang module 'inet' (as well as in the man page for the system call setsockopt). Some of these options, such as priority, sndbuf and recbuf, can make a significant difference if set to the correct values, which depend on the OS, hardware and network characteristics - a good tutorial for this can be found at: http://www.ibm.com/developerworks/linux/library/l-hisock.html Any of these options can also be specified for individual replications in the replication document/object. Example: $ curl -H 'Content-Type: application/json' -X POST http://myserver.com/_replicate -d '{ "source": "http://abc.com/foo", "target": "bar", "worker_processes": 6, "http_connections": 20, "http_pipeline_size": 100} Now, for a comparison with the current replicator, here are some results for replications done from scratch (empty target) in a local Wifi network and from my laptop (Europe, Portugal) to my CouchOne account ( http://fdmanana.couchone.com/_utils/ ). Time was measured using the 'time' command like this: $ time curl -X POST http://foobar.com/_replicate ..... --- P U L L R E P L I C A T I O N S --- 1) database "no_atts_db" - 79 743 documents, each with a size between 500 bytes and 3 Kb, no attachments In a Wifi network, current replicator: 5m2.898s new replicator: 4m6.517s >From remote source fdmanana.couchone.com/no_atts_db: current replicator: 4m33.673s new replicator: 4m7.709s 2) database "atts_db", 1100 documents, each with a size between 1 Kb and 2 Kb, and all with 2 or 3 attachments, with sizes 30 bytes, 41 Kb and 2 Mb In a Wifi network, current replicator: 17m27.785s new replicator: 16m1.450s >From remote source fdmanana.couchone.com/atts_db: current replicator: 39m45.223s (first run), 41m11.525s (second run) new replicator: 7m50.914s (first run), 7m38.588s (second run) 3) database "large_6_14", 93 750 documents, each with a size between 6 Kb and 14 Kb, no attachments In a Wifi network, current replicator: 21m40.060s new replicator: 18m59.212s >From remote source fdmanana.couchone.com/large_6_14: current replicator: 10m42.713s new replicator: 10m2.444s 4) database "large1kb", 341 298 documents, each with a size of about 1 Kb, no attachments In a Wifi network, current replicator: 16m17.937s new replicator: 11m7.115s >From remote source fdmanana.couchone.com/large1kb: current replicator: 21m49.032s new replicator: 17m44.142s --- P U S H R E P L I C A T I O N S --- 1) database "no_atts_db" - 79 743 documents, each with a size between 500 bytes and 3 Kb, no attachments In a Wifi network, current replicator: 3m58.723s new replicator: 3m23.846s To remote target fdmanana.couchone.com/no_atts_db: current replicator: 9m56.862s new replicator: 8m24.879s 2) database "atts_db", 1100 documents, each with a size between 1 Kb and 2 Kb, and all with 2 or 3 attachments, with sizes 30 bytes, 41 Kb and 2 Mb In a Wifi network, current replicator: 16m56.750s new replicator: 16m50.552s To remote target fdmanana.couchone.com/atts_db: current replicator: 56m3.895s new replicator: 40m58.584s 3) database "large_6_14", 93 750 documents, each with a size between 6 Kb and 14 Kb, no attachments In a Wifi network, current replicator: 26m56.383s new replicator: 22m22.682s To remote target fdmanana.couchone.com/large_6_14: current replicator: couldn't finish, waited for more than one hour, seemed to hang after replicating 1005 documents new replicator: 47m3.608s 4) database "large1kb", 341 298 documents, each with a size of about 1 Kb, no attachments In a Wifi network, current replicator: 11m19.052s new replicator: 7m26.456s To remote target fdmanana.couchone.com/large1kb: current replicator: 26m49.501s new replicator: 18m01.606s Each of these test replications was done at least 3 times, and the response times were about the same without a significant variance. Raising the number of http connections (and/or pipeline size) and worker process to 6 or 8 didn't offer to me very significant gains, since the whole process was already mostly network IO bound, but this surely depends on the specific network, hardware and eventually the OS. Memory and CPU usage is about the same (with the defaults settings) as the current replicator. Also the new replicator is more careful about buffering too much data and JSON encoding very large lists of documents (to send through the _bulk_docs API). This is a very big change to add to CouchDB, therefore I would like to have others test this branch and report eventual problems. If no one has an objection, I would like to apply this to trunk in about 1 week or so. I would also like to have Adam's review. I'll fill in a Jira ticket by tomorrow. Also, during the development many ibrowse issues were found, reported and fixed (mostly related to streaming and chunked responses). Therefore I would like to thank Chandru (ibrowse's author) for fixing many of them and accepting patches for the remaining. Those issues affected both the new and the current replicator. regards, -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men."