Re: CouchDB 2.0 & out of order changes

Robert Samuel Newson Sat, 03 Sep 2016 09:42:13 -0700

Hi,

It is important to understand that the order of rows in the _changes response 
is not important. In couchdb before 2.0 the response was totally ordered, but 
this was never necessary for correctness. The essential contract for _changes 
is that you are guaranteed to see all changes made since the 'since' parameter 
you pass. The order of those changes is not guaranteed and it is also not 
guaranteed that changes from _before_ that 'since' value are _not_ also 
returned. The consequence of this contract is that all consumers of the 
_changes response must apply each row idempotently. This is true for the 
replicator, of course.

The changes response in 2.0 is partially ordered. The changes from any given 
shard will be in a consistent order, but we merge the changes from each shard 
range of your database as they are collected from the various contributing 
nodes, we don't apply a total ordering over that. The reason is simple; it's 
expensive and unnecessary. It's important to also remember that replication, 
even before 2.0, would not replicate in strict source update order either, due 
to (valuable) parallelism when reading changes and applying them.

Your question: "Is it possible for the changes feed to send older changes 
before newer changes for the same document ID across multiple calls?" requires 
a little further background knowledge before answering.

While we call it a changes "feed" it's important to remember what it really is, 
internally, first. Every database in couchdb, prior to 2.0, is a single file 
with multiple b+trees recorded inside it that are kept in absolute sync with 
each other. One b+tree allows you to look up a document by the _id parameter. 
The other b+tree allows you to look up a document by its update order. It is 
essential to note that these two b+trees have the same number of key/value 
pairs in them at all times.

To illustrate this more clearly, consider an empty database. We add one 
document to it. It is retrievable by its _id and is also visible in the 
_changes response as change number 1. Now, we update that document. It is now 
change number 2. Change number 1 will never again appear in the _changes 
response. That is, every document appears in the _changes response at its most 
recent update number.

When you call _changes without the continuous parameter, couchdb is simply 
traversing that second b+tree and returning each row it finds. It may do this 
from the beginning (which was 1 before our update and 2 after) or it may do so 
from some update seq you supply with the 'since' parameter.

With that now understood, we can look at what changes when we do 
continuous=true which is what makes it a "feed" (that is, a potentially 
unending response of changes as they are made). This is sent in two phases. The 
first is exactly as the previous paragraph. Once all those changes have been 
sent, couchdb enters a loop where it returns updates as they happen (or shortly 
after).

It is only in a continuous=true response in couchdb before 2.0 that you would 
ever see more than one change for any given document.

So, to cut a long story short (too late), the answer to your question is "no". 
The changes feed is not a permanent history of all changes made to all 
documents. Once a document is updated, it is _moved_ to a newer position and no 
longer appears in its old one (and no record of that position is even 
preserved). Do note, though, that couchdb might return 'Doc A change (seq: 
2-XXXX)' even if your 'since' parameter is _after_ the last change to doc A. We 
won't return ' Doc A change (seq: 1-XXXX)' at all after its updated to 2-XXXX.

The algorithm for correctly processing the changes response is as follows, and 
any variation on this is likely broken;

1) call /_changes?since=0
2) for each returned row, ensure the target has the change in question (either 
use _id + _rev to prevent duplicate application of the change or apply the 
change in a way that is idempotent)
3) periodically store the update seq of the last processed row to stable 
storage (a _local document is a good choice)

If you wish to resume applying changes after a shutdown, reboot, or crash, 
repeat the above process but substitute your stored update sequence in the 
?since= parameter.

There are many things that use the changes feed in this way. Within couchdb, 
there's database replication (obviously) but also couchdb views. Outside of the 
core, software like pouchdb and couchdb-lucene use the changes feed to 
replicate data or update search indexes.

I hope this was useful, and I think it might expose some problems in your 
couchdb-to-sqlite synchronisation protocol. Your email is obviously silent on 
many details there, but if you've predicated its design on the total ordering 
properties of couchdb < 2.0, you likely have some work to do.

B.

> On 3 Sep 2016, at 00:04, Robert Payne <[email protected]> wrote:
> 
> Hey Everyone,
> 
> Reading up on the CouchDB 2.0 migration guides and getting a bit antsy around 
> the mentions of out of order changes feed and sorts. Is it possible for the 
> changes feed to send older changes before newer changes for the same document 
> ID across multiple calls?
> 
> Assuming start at ?since=“” and always pass in the “last_seq” on every 
> additional call could a situation like this occur in a single or multiple 
> HTTP calls:
> 
> — Changes feed emits Doc A change (seq: 2-XXXX)
> — Changes feed emits Doc B change (seq: 3-XXXX)
> — Changes feed emits Doc A change (seq: 1-XXXX)
> 
> I’m really hoping the case is just that across different doc ids changes can 
> be out of order. Our use case on mobile is a bit particular as we duplicate 
> edits into a separate SQLite table and use the changes feed to keep the local 
> database up to date with winning revs from the server, it just increases the 
> performance of sync by a ton since there is only 1 check and set in SQLite 
> per change that comes in.
> 
> Cheers,
> Robert

Re: CouchDB 2.0 & out of order changes

Reply via email to