Isaac Z. Schlueter created COUCHDB-2236:
-------------------------------------------

             Summary: Weird _users doc conflict when replicating from 1.5.1 -> 
1.6.0
                 Key: COUCHDB-2236
                 URL: https://issues.apache.org/jira/browse/COUCHDB-2236
             Project: CouchDB
          Issue Type: Bug
      Security Level: public (Regular issues)
            Reporter: Isaac Z. Schlueter


The upstream write-master for npm is a CouchDB 1.5.0.  (Since it is locked down 
at the IP level, we're not at risk to the DOS fixed in 1.5.1.)

All PUT/POST/DELETE requests are routed to this master box, as well as any 
request with `?write=true` on the URL.  (Used for cases where we still do the 
PUT/409/GET/PUT dance, rather than using a custom _update function.)

This master box replicates to a replication hub.  The read slaves all replicate 
from the replication hub.  Both the /registry and /_users databases replicate 
continuously using a doc in the /_replicator database.

As I understand it, since replication only goes in one direction, and all 
writes to go the upstream master, conflicts should be impossible.

We brought a 1.6.0 read slave online, version 1.6.0+build.fauxton-91-g5a2864b.

On this 1.6.0 read slave (and only there), we're seeing /_users doc conflicts, 
and it looks like it has a different password_sha and salt.  Here is one such 
example: https://gist.github.com/isaacs/63f332a15109bbfdb8ac  (actual 
passowors_sha and salt mostly redacted, but enough bytes left in so that you 
can see they're not matching.)

A few weeks ago, this issue popped up, affecting about 400 user docs, and we 
figured that it had to do with some instability or human error at the time when 
that box was set up.  We deleted all of the conflicts, and verified that all 
docs matched the upstream at that time.  We removed the /_replicator entries, 
and re-created them using the same script we use to create them on all the 
other read slaves.

If this was just one or two docs, or happening across more of the read slaves, 
I'd be more inclined to think that it has something to do with a particular 
user, or our particular setup.  However, the /_replicator docs are identical in 
the 1.6.0 box as on the other read slaves.  This is affecting about 150 users, 
and only on that one box.

We've taken the 1.6.0 read slave out of rotation for now, so it's not an urgent 
issue for us.  If anyone wants to log in and have a look around, I can grant 
access, but I hope that there's enough information here to track it down.  
Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to