[
https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170158#comment-13170158
]
Alex Markham commented on COUCHDB-1364:
---------------------------------------
Yes - I compacted both sides in the hour preceding this.
> Replication hanging/failing on docs with lots of revisions
> ----------------------------------------------------------
>
> Key: COUCHDB-1364
> URL: https://issues.apache.org/jira/browse/COUCHDB-1364
> Project: CouchDB
> Issue Type: Bug
> Components: Replication
> Affects Versions: 1.0.3, 1.1.1
> Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched
> for COUCHDB-1340 and COUCHDB-1333
> Reporter: Alex Markham
> Labels: open_revs, replication
> Attachments: COUCHDB-1364-11x.patch, replication error changes_loop
> died redacted.txt
>
>
> We have a setup where replication from a 1.1.1 couch is hanging - this is WAN
> replication which previously worked 1.0.3 <-> 1.0.3.
> Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to
> COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the
> 1.0.3 couch to our 1.1.1 build which had this patched.
> However - the replication between the 2 1.1.1 couches is hanging at a certain
> point when doing continuous pull replication - it doesn't checkpoint, just
> stays on "starting" however, when cancelled and restarted it gets the latest
> documents (so doc counts are equal). The last calls I see to the source db
> when it hangs are multiple long GETs for a document with 2051 open revisions
> on the source and 498 on the target.
> When doing a push replication the _replicate call just gives a 500 error (at
> about the same seq id as the pull replication hangs at) saying:
> [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died
> with reason {noproc,
> {gen_server,call,
> [<0.6382.115>,
> {pread_iolist,
> 79043596434},
> infinity]}}
> when the last call in the target of the push replication is:
> [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST'
> /master_db/_missing_revs 200
> with no stack trace.
> Comparing the open_revs=all count on the documents with many open revs shows
> differing numbers on each side of the replication WAN and between different
> couches in the same datacentre. Some of these documents have not been updated
> for months. Is it possible that 1.0.3 just skipped over this issue and
> carried on replicating, but 1.1.1 does not?
> I know I can hack the replication to work by updating the checkpoint seq past
> this point in the _local document, but I think there is a real bug here
> somewhere.
> If wireshark/debug data is required, please say
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira