On May 25, 2010, at 2:50 AM, Filipe David Manana wrote: > Hi all, > > I've reworked on some implementation details. Namely, the replication > gen_servers now have an ID that is no longer based on the replication > document ID but instead in the md5 of the replication properties (source, > target, etc, like it is done currently when we post to _replicate). This > avoids having identical replications going on at the expense of a bit more > complex code. > > From the user point of view, everything is pretty much the same as I > announced before. The only few differences are: > > - when a replication is started by adding a replication document to the > _replicator DB, the replicator besides adding the field "state" with value > "triggered" to the replication document, also adds the field > "replication_id". (with this field's value, we can access the replication > log/checkpoint documents, as Adam suggested before). > > - if the user adds a second document that in fact describes a replication > already triggered by a previous document (same source, target, etc), this > second document will not get a "state" field added to it. However the > replicator adds the "replication_id" field to it. This is nice IMO, since we > can add a view whose keys are the "replication_id" values > and see which replication documents are duplicates. > > - deleting a duplicated replication document (a document that didn't > triggered a replication, since a former one already triggered that > replication) doesn't stop the replication. To stop it, we have to delete the > document that triggered the replication - we can find it by searching for a > document with the same "replication_id" and "state" set to "triggered". > > For more details, check the JavaScript test suite: > http://github.com/fdmanana/couchdb/blob/new_replicator_db/share/www/script/test/replicator_db.js > It's maybe easier to understand _replicator DB by looking at the tests. It's > very simple from a user's point of view. > > The whole patch can be found in a new branch at: > http://github.com/fdmanana/couchdb/compare/new_replicator_db >
hmm, 3rd time I've tried to send this... I've been working on the test cases for the replicator db, to remove wait() from the test. I think this will make them more robust as well. Instead of waiting, I wrote functions to check a replication doc for state == "complete" or another, to wait for the update_seq of two databases to match. There are a couple of places where I had to leave wait() in. These are in spots with assertions that a particular replication *did stop* when a document is deleted. So you have to wait and then see if the docs are there or not. I can't think of way to test for this, otherwise. (Unless maybe active_tasks is accurate enough to use in these assertions.) I plan to dig into the meat of the patch soon but wanted to start with the tests. The commit is here: http://github.com/jchris/couchdb/tree/fdm/nrd Thanks for all the hard work, Filipe, and everyone who's giving feedback. Chris > Later on I'll add a patch to a Jira ticket. > > cheers > > > > On Wed, May 19, 2010 at 10:31 AM, Filipe David Manana > <fdman...@gmail.com>wrote: > >> Dear all, >> >> I've been working on the _replicator DB along with Chris. Some of you have >> already heard about this DB in the mailing list, IRC, or whatever. Its >> purpose: >> >> - replications can be started by adding a replication document to the >> replicator DB _replicator (its name can be configured in the .ini files) >> >> - replication documents are basically the same JSON structures that we >> currently use when POSTing to _replicate/ (and we can give them an >> arbitrary id) >> >> - to cancel a replication, we simply delete the replication document >> >> - after the replication is started, the replicator adds the field "state" >> to the replication document with value "triggered" >> >> - when the replication finishes (for non continuous replications), the >> replication sets the doc's "state" field to "completed" >> >> - if an error occurs during a replication, the corresponding replication >> document will have the "state" field set to "error" >> >> - after detecting that an error was found, the replication is restarted >> after some time (10s for now, but maybe it should be configurable) >> >> - after a server restart/crash, CouchDB will remember replications and will >> restart them (this is specially useful for continuous replications) >> >> - in the replication document we can define a "user_ctx" property, which >> defines the user name and/or role(s) under which the replication will >> execute >> >> >> >> Some restrictions regarding the _replicator DB: >> >> - only server admins can add and delete replication documents >> >> - only the replicator itself can update replication documents - this is to >> avoid having race conditions between the replicator and server admins trying >> to update replication documents >> >> - the above point implies that to change a replication you have to add a >> new replication document >> >> All this restrictions are in replicator DB design doc - >> http://github.com/fdmanana/couchdb/blob/replicator_db/src/couchdb/couch_def_js_funs.hrl<http://github.com/fdmanana/couchdb/blob/_replicator_db/src/couchdb/couch_def_js_funs.hrl> >> >> >> The code is fully working and is located at: >> http://github.com/fdmanana/couchdb/tree/replicator_db >> >> It includes a comprehensive JavaScript test case. >> >> Feel free to try it and give your feedback. There are still some TODOs as >> comments in the code, so it's still subject to changes. >> >> >> For people more involved with CouchDB internals and development: >> >> That branch breaks the stats.js test and, occasionally, the >> delayed_commits.js tests. >> >> It breaks stats.js because: >> >> - internally CouchDB uses the _changes API to be aware of the >> addition/update/deletion of replication documents to/from the _replicator >> DB. The _changes implementation constantly opens and closes the DB (opens >> are triggered by a gen_event). This affects the stats open_databases and >> open_os_files. >> >> It breaks delayed_commits.js occasionally because: >> >> - by listening to _replicator DB changes an extra file descriptor is used >> which affects the max_open_dbs config parameter. This parameter is related >> to the max number of user opened DBs. This causes the error {error, >> all_dbs_active} (from couch_server.erl) during the execution of >> delayed_commits.js (as well as stats.js). >> >> I also have another branch that fixes these issues in a "dirty" way: >> http://github.com/fdmanana/couchdb/tree/_replicator_db (has a big comment >> in couch_server.erl explaining the hack) >> >> Basically it doesn't increment stats for the _replicator DB and bypasses >> the max_open_dbs when opening _replicator DB as well as doesn't allow it to >> be closed in favour of a user requested DB (like it assigned it a +infinite >> LRU time to this DB). >> >> Sometimes (although very rarely) I also get the all_dbs_active error when >> the authentication handlers are executing (because they open the _users DB). >> This is not originated by my _replicator DB code at all, since I get it with >> trunk as well. >> >> I would also like to collect feedback about what to do regarding this 2 >> issues, specially max_open_dbs. Somehow I feel that no matter how many user >> DBs are open, it should always be possible to open the _replicator DB >> internally (and the _users DB). >> >> >> cheers >> >> >> -- >> Filipe David Manana, >> fdman...@gmail.com >> >> "Reasonable men adapt themselves to the world. >> Unreasonable men adapt the world to themselves. >> That's why all progress depends on unreasonable men." >> >> > > > -- > Filipe David Manana, > fdman...@gmail.com > > "Reasonable men adapt themselves to the world. > Unreasonable men adapt the world to themselves. > That's why all progress depends on unreasonable men."