[ 
https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170150#comment-13170150
 ] 

Filipe Manana commented on COUCHDB-1364:
----------------------------------------

Hi Alex.
For the push replication case, right before the error, was the local source 
database compacted?
                
> Replication hanging/failing on docs with lots of revisions
> ----------------------------------------------------------
>
>                 Key: COUCHDB-1364
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-1364
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.0.3, 1.1.1
>         Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched 
> for COUCHDB-1340 and COUCHDB-1333
>            Reporter: Alex Markham
>              Labels: open_revs, replication
>         Attachments: replication error changes_loop died redacted.txt
>
>
> We have a setup where replication from a 1.1.1 couch is hanging - this is WAN 
> replication which previously worked 1.0.3 <-> 1.0.3.
> Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to 
> COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the 
> 1.0.3 couch to our 1.1.1 build which had this patched.
> However - the replication between the 2 1.1.1 couches is hanging at a certain 
> point when doing continuous pull replication - it doesn't checkpoint, just 
> stays on "starting" however, when cancelled and restarted it gets the latest 
> documents (so doc counts are equal). The last calls I see to the source db 
> when it hangs are multiple long GETs for a document with 2051 open revisions 
> on the source and 498 on the target.
> When doing a push replication the _replicate call just gives a 500 error (at 
> about the same seq id as the pull replication hangs at) saying:
> [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died 
> with reason {noproc,
>                                                        {gen_server,call,
>                                                         [<0.6382.115>,
>                                                          {pread_iolist,
>                                                           79043596434},
>                                                          infinity]}}
> when the last call in the target of the push replication is:
> [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' 
> /master_db/_missing_revs 200
> with no stack trace.
> Comparing the open_revs=all count on the documents with many open revs shows 
> differing numbers on each side of the replication WAN and between different 
> couches in the same datacentre. Some of these documents have not been updated 
> for months. Is it possible that 1.0.3 just skipped over this issue and 
> carried on replicating, but 1.1.1 does not?
> I know I can hack the replication to work by updating the checkpoint seq past 
> this point in the _local document, but I think there is a real bug here 
> somewhere.
> If wireshark/debug data is required, please say

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to