new replicator

Filipe David Manana Mon, 10 Jan 2011 10:36:37 -0800

Hi all,

It's been some months now that I resumed Damien's initial work on a
new replicator. It's now in a state that's perfectly usable
(hopefully), functional and with all the features of the current
replicator.
Its code is at:


https://github.com/fdmanana/couchdb/compare/trunk_new_replicator

(the ASF's svn branch "new_replicator" is too aged now and has several
issues with pull replications)


What it brings to the table:

- When new revisions of a document need to be replicated to the
target, it only replicates the attachments introduced in the missing
revisions, that is, unlike the current replicator it doesn't replicate
all the attachments (introduced since the first revision) - this makes
a huge difference for databases with many and/or large attachments;

- It exploits parallelism more aggressively - the current replicator
uses a single process for finding which revisions are missing in the
target, a single process to write documents to the target and for push
replications a single process that reads documents from the local
source (for pull replications it spawns several processes to read
documents from the remote source). The new replicator uses several (#
is configurable) processes to find the missing revisions and copy
documents (read from source and write to target);

- Progress is done faster when attachments are present. For a batch of
documents, the current replicator fetches first the body of the
documents and only when it decides to flush them to disk, for each
attachment it opens a separate connection to download each. This means
that only after all the attachments for the documents in the batch are
fetched, a checkpoint is done. The new replicator fetches documents
together with their attachments (in compressed form if they're
compressed at the source) in a single request and flushes them as soon
as possible;

- Better error isolation. Currently, HTTP connections are shared by
different replications. This behaviour is more like a "side effect" of
using ibrowse's load balancer. The new replicator has its own load
balancer implementation which ensures that all the requests in the
pipeline of a connection belong to the same replication;

- It was completely rewritten from scratch. Hopefully the code is
better organised now and slightly shorter;

- Better logging (more meaningful information about connectivity
errors and document write failures) and integration with the
replicator database.



There are now some new .ini replicator configuration options (
https://github.com/fdmanana/couchdb/blob/trunk_new_replicator/etc/couchdb/default.ini.tpl.in#L138
) under the [replicator] section:

"worker_processes" - the number of processes that copy documents from
the source database to the target database. For each one of them, a
missing revisions process is also spawned. For example, if set to 4
(default) we get 8 processes: 4 for copying documents and 4 to find
which revisions are missing in the target;

"worker_batch_size" - the maximum number of consecutive source
sequence numbers each worker processes at once. Default value is 1000.
Lower values mean that checkpoints are done more frequently and are
recommended when the source has very large documents and/or most
documents with many and/or large attachments. Higher values can make
the replicator process faster only when most documents are very small
(below a few kilobytes) and there are none or very few with
attachments;

"http_connections" - the maximum number of HTTP connections per replication;

"http_pipeline_size" - the maximum pipeline size for each HTTP connection;

"connection_timeout" - the period of time (in milliseconds) after
which a connection is considered dead because no data was received
(and up to 10 retries are done). The default value is 30 000
milliseconds;

"socket_options" - options to pass to the TCP sockets. All the
available options are listed in the man page of the Erlang module
'inet' (as well as in the man page for the system call setsockopt).
Some of these options, such as priority, sndbuf and recbuf, can make a
significant difference if set to the correct values, which depend on
the OS, hardware and network characteristics - a good tutorial for
this can be found at:
http://www.ibm.com/developerworks/linux/library/l-hisock.html

Any of these options can also be specified for individual replications
in the replication document/object. Example:

$ curl -H 'Content-Type: application/json' -X POST
http://myserver.com/_replicate  -d  '{ "source": "http://abc.com/foo";,
"target": "bar", "worker_processes": 6, "http_connections": 20,
"http_pipeline_size": 100}


Now, for a comparison with the current replicator, here are some
results for replications done from scratch (empty target) in a local
Wifi network and from my laptop (Europe, Portugal) to my CouchOne
account ( http://fdmanana.couchone.com/_utils/ ). Time was measured
using the 'time' command like this:

$ time curl -X POST http://foobar.com/_replicate .....

--- P U L L   R E P L I C A T I O N S ---

1) database "no_atts_db" - 79 743 documents, each with a size between
500 bytes and 3 Kb, no attachments

In a Wifi network,

current replicator:   5m2.898s
new replicator:        4m6.517s

>From remote source fdmanana.couchone.com/no_atts_db:

current replicator:   4m33.673s
new replicator:        4m7.709s

2) database "atts_db", 1100 documents, each with a size between 1 Kb
and 2 Kb, and all with 2 or 3 attachments, with sizes 30 bytes, 41 Kb
and 2 Mb

In a Wifi network,

current replicator:    17m27.785s
new replicator:         16m1.450s

>From remote source fdmanana.couchone.com/atts_db:

current replicator:     39m45.223s  (first run),   41m11.525s  (second run)
new replicator:          7m50.914s (first run),   7m38.588s  (second run)

3) database "large_6_14", 93 750 documents, each with a size between 6
Kb and 14 Kb, no attachments

In a Wifi network,

current replicator:      21m40.060s
new replicator:          18m59.212s

>From remote source fdmanana.couchone.com/large_6_14:

current replicator:     10m42.713s
new replicator:          10m2.444s

4) database "large1kb", 341 298 documents, each with a size of about 1
Kb, no attachments

In a Wifi network,

current replicator:    16m17.937s
new replicator:         11m7.115s

>From remote source fdmanana.couchone.com/large1kb:

current replicator:     21m49.032s
new replicator:          17m44.142s



--- P U S H   R E P L I C A T I O N S ---

1) database "no_atts_db" - 79 743 documents, each with a size between
500 bytes and 3 Kb, no attachments

In a Wifi network,

current replicator:    3m58.723s
new replicator:         3m23.846s

To remote target fdmanana.couchone.com/no_atts_db:

current replicator:    9m56.862s
new replicator:         8m24.879s

2) database "atts_db", 1100 documents, each with a size between 1 Kb
and 2 Kb, and all with 2 or 3 attachments, with sizes 30 bytes, 41 Kb
and 2 Mb

In a Wifi network,

current replicator:      16m56.750s
new replicator:           16m50.552s

To remote target fdmanana.couchone.com/atts_db:

current replicator:     56m3.895s
new replicator:          40m58.584s

3) database "large_6_14", 93 750 documents, each with a size between 6
Kb and 14 Kb, no attachments

In a Wifi network,

current replicator:     26m56.383s
new replicator:          22m22.682s

To remote target fdmanana.couchone.com/large_6_14:

current replicator:    couldn't finish, waited for more than one hour,
seemed to hang after replicating 1005 documents
new replicator:         47m3.608s

4) database "large1kb", 341 298 documents, each with a size of about 1
Kb, no attachments

In a Wifi network,

current replicator:    11m19.052s
new replicator:         7m26.456s

To remote target fdmanana.couchone.com/large1kb:

current replicator:    26m49.501s
new replicator:         18m01.606s


Each of these test replications was done at least 3 times, and the
response times were about the same without a significant variance.

Raising the number of http connections (and/or pipeline size) and
worker process to 6 or 8 didn't offer to me very significant gains,
since the whole process was already mostly network IO bound, but this
surely depends on the specific network, hardware and eventually the
OS.

Memory and CPU usage is about the same (with the defaults settings)
as the current replicator. Also the new replicator is more careful
about buffering too much data and JSON encoding very large lists of
documents (to send through the _bulk_docs API).


This is a very big change to add to CouchDB, therefore I would like to
have others test this branch and report eventual problems. If no one
has an objection, I would like to apply this to trunk in about 1 week
or so. I would also like to have Adam's review.
I'll fill in a Jira ticket by tomorrow.


Also, during the development many ibrowse issues were found, reported
and fixed (mostly related to streaming and chunked responses).
Therefore I would like to thank Chandru (ibrowse's author) for fixing
many of them and accepting patches for the remaining. Those issues
affected both the new and the current replicator.


regards,

--
Filipe David Manana,
[email protected], [email protected]

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

new replicator

Reply via email to