[jira] [Updated] (COUCHDB-1364) Replication hanging/failing on docs with lots of revisions

Alex Markham (Updated) (JIRA) Thu, 15 Dec 2011 06:21:04 -0800

     [ 
https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alex Markham updated COUCHDB-1364:
----------------------------------

    Attachment: do_checkpoint error push.txt

Hi Felipe - which couch (on which end of the replication) needs to be updated?

I looked at wireshark for the pull and push replication, from host28 -> host25
For the Pull - the replication seems to start, fetch the changes list from seq 
390505 and then POSTs an ensure full commit. There doesn't seem to be a reply 
from this so it just ends up hanging. My replication script cancels the ongoing 
replication and then restarts it every 5 mins and this seems to take much 
longer than that.

POST /master_db/_ensure_full_commit?seq=3914198 HTTP/1.1
User-Agent: CouchDB/1.1.1
Accept: application/json
Accept-Encoding: gzip
Content-Type: application/json
Content-Length: 0
Host: host28:5984

I also have a different stack trace for what I think is the same problem - 
"do_checkpoint error.txt" where the last wireshark activity appeared to be a 
/_ensure_full_commit at 12:12:04 and at 12:12:34 the timeout error appeared and 
replication failed

POST /master_db/_ensure_full_commit HTTP/1.1
User-Agent: CouchDB/1.1.1
Accept: application/json
Accept-Encoding: gzip
Content-Type: application/json
Content-Length: 0
Host: host25:5984

                
> Replication hanging/failing on docs with lots of revisions
> ----------------------------------------------------------
>
>                 Key: COUCHDB-1364
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1364
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.0.3, 1.1.1
>         Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched 
> for COUCHDB-1340 and COUCHDB-1333
>            Reporter: Alex Markham
>              Labels: open_revs, replication
>         Attachments: COUCHDB-1364-11x.patch, do_checkpoint error push.txt, 
> replication error changes_loop died redacted.txt
>
>
> We have a setup where replication from a 1.1.1 couch is hanging - this is WAN 
> replication which previously worked 1.0.3 <-> 1.0.3.
> Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to 
> COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the 
> 1.0.3 couch to our 1.1.1 build which had this patched.
> However - the replication between the 2 1.1.1 couches is hanging at a certain 
> point when doing continuous pull replication - it doesn't checkpoint, just 
> stays on "starting" however, when cancelled and restarted it gets the latest 
> documents (so doc counts are equal). The last calls I see to the source db 
> when it hangs are multiple long GETs for a document with 2051 open revisions 
> on the source and 498 on the target.
> When doing a push replication the _replicate call just gives a 500 error (at 
> about the same seq id as the pull replication hangs at) saying:
> [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died 
> with reason {noproc,
>                                                        {gen_server,call,
>                                                         [<0.6382.115>,
>                                                          {pread_iolist,
>                                                           79043596434},
>                                                          infinity]}}
> when the last call in the target of the push replication is:
> [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' 
> /master_db/_missing_revs 200
> with no stack trace.
> Comparing the open_revs=all count on the documents with many open revs shows 
> differing numbers on each side of the replication WAN and between different 
> couches in the same datacentre. Some of these documents have not been updated 
> for months. Is it possible that 1.0.3 just skipped over this issue and 
> carried on replicating, but 1.1.1 does not?
> I know I can hack the replication to work by updating the checkpoint seq past 
> this point in the _local document, but I think there is a real bug here 
> somewhere.
> If wireshark/debug data is required, please say

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (COUCHDB-1364) Replication hanging/failing on docs with lots of revisions

Reply via email to