On Aug 16, 2011, at 10:31 PM, Jason Smith wrote:

> On Tue, Aug 16, 2011 at 9:26 PM, Adam Kocoloski <kocol...@apache.org> wrote:
>> One of the principal uses of the replicator is to "make this database look 
>> like that one".  We're unable to do that in the general case today because 
>> of the combination of validation functions and out-of-order document 
>> transfers.  It's entirely possible for a document to be saved in the source 
>> DB prior to the installation of a ddoc containing a validation function that 
>> would have rejected the document, for the replicator to install the ddoc in 
>> the target DB before replicating the other document, and for the other 
>> document to then be rejected by the target DB.
> 
> Somebody asked about this on Stack Overflow. It was a very simple but
> challenging question, but now I can't find it. Basically, he made your
> point above.
> 
> Aren't you identifying two problems, though?
> 
> 1. Sometimes you need to ignore validation to just make a nice, clean copy.
> 2. Replication batches (an optimization) are disobeying the change
> sequence, which can screw up the replica.

As far as I know the only reason one needs to ignore validation to make a nice 
clean copy is because the replicator does not guarantee the updates are applied 
on the target in the order they were received on the source.  It's all one 
issue to me.

> I responded to #1 already.
> 
> But my feeling about #2 is that the optimization goes too far.
> replication batches should always have boundaries immediately before
> and after design documents. In other words, batch all you want, but
> design documents [1] must always be in a batch size of 1. That will
> retain the semantics.
> 
> [1] Actually, the only ddocs needing their own private batches are
> those with a validate_doc_update field.

My standard retort to transaction boundaries is that there is no global 
ordering of events in a distributed system.  A clustered CouchDB can try to 
build a vector clock out of the change sequences of the individual servers and 
stick to that merged sequence during replication, but even then the ddoc entry 
in the feed could be "concurrent" with several other updates.  I rather like 
that the replicator aggressively mixes up the ordering of updates because it 
prevents us from making choices in the single-server case that aren't sensible 
in a cluster.

By the way, I don't consider this line of discussion presumptuous in the least. 
 Cheers,

Adam

Reply via email to