[jira] [Commented] (COUCHDB-1364) Replication hanging/failing on docs with lots of revisions

Filipe Manana (Commented) (JIRA) Thu, 15 Dec 2011 07:05:00 -0800

    [ 
https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170262#comment-13170262
 ]


Filipe Manana commented on COUCHDB-1364:
----------------------------------------

Alex, the patch is for the side doing the push replication. It's only meant to 
fix the following stack trace you pasted:

[Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died with 
reason {noproc, 
                                                       {gen_server,call, 
                                                        [<0.6382.115>, 
                                                         {pread_iolist, 
                                                          79043596434}, 
                                                         infinity]}} 

Your error with the _ensure_full_commit seems to be because that http request 
failed 10 times. At that point the replication process crashes.
Before the first retry it waits 0.5 seconds, before the 2nd retry it waits 1 
second, etc (it always doubles). So it takes about 8.5 minutes before it 
crashes.
Maybe your network is too busy.
                
> Replication hanging/failing on docs with lots of revisions
> ----------------------------------------------------------
>
>                 Key: COUCHDB-1364
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1364
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.0.3, 1.1.1
>         Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched 
> for COUCHDB-1340 and COUCHDB-1333
>            Reporter: Alex Markham
>              Labels: open_revs, replication
>         Attachments: COUCHDB-1364-11x.patch, do_checkpoint error push.txt, 
> replication error changes_loop died redacted.txt
>
>
> We have a setup where replication from a 1.1.1 couch is hanging - this is WAN 
> replication which previously worked 1.0.3 <-> 1.0.3.
> Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to 
> COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the 
> 1.0.3 couch to our 1.1.1 build which had this patched.
> However - the replication between the 2 1.1.1 couches is hanging at a certain 
> point when doing continuous pull replication - it doesn't checkpoint, just 
> stays on "starting" however, when cancelled and restarted it gets the latest 
> documents (so doc counts are equal). The last calls I see to the source db 
> when it hangs are multiple long GETs for a document with 2051 open revisions 
> on the source and 498 on the target.
> When doing a push replication the _replicate call just gives a 500 error (at 
> about the same seq id as the pull replication hangs at) saying:
> [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died 
> with reason {noproc,
>                                                        {gen_server,call,
>                                                         [<0.6382.115>,
>                                                          {pread_iolist,
>                                                           79043596434},
>                                                          infinity]}}
> when the last call in the target of the push replication is:
> [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' 
> /master_db/_missing_revs 200
> with no stack trace.
> Comparing the open_revs=all count on the documents with many open revs shows 
> differing numbers on each side of the replication WAN and between different 
> couches in the same datacentre. Some of these documents have not been updated 
> for months. Is it possible that 1.0.3 just skipped over this issue and 
> carried on replicating, but 1.1.1 does not?
> I know I can hack the replication to work by updating the checkpoint seq past 
> this point in the _local document, but I think there is a real bug here 
> somewhere.
> If wireshark/debug data is required, please say

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (COUCHDB-1364) Replication hanging/failing on docs with lots of revisions

Reply via email to