[ https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13170150#comment-13170150 ]
Filipe Manana commented on COUCHDB-1364: ---------------------------------------- Hi Alex. For the push replication case, right before the error, was the local source database compacted? > Replication hanging/failing on docs with lots of revisions > ---------------------------------------------------------- > > Key: COUCHDB-1364 > URL: https://issues.apache.org/jira/browse/COUCHDB-1364 > Project: CouchDB > Issue Type: Bug > Components: Replication > Affects Versions: 1.0.3, 1.1.1 > Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched > for COUCHDB-1340 and COUCHDB-1333 > Reporter: Alex Markham > Labels: open_revs, replication > Attachments: replication error changes_loop died redacted.txt > > > We have a setup where replication from a 1.1.1 couch is hanging - this is WAN > replication which previously worked 1.0.3 <-> 1.0.3. > Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to > COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the > 1.0.3 couch to our 1.1.1 build which had this patched. > However - the replication between the 2 1.1.1 couches is hanging at a certain > point when doing continuous pull replication - it doesn't checkpoint, just > stays on "starting" however, when cancelled and restarted it gets the latest > documents (so doc counts are equal). The last calls I see to the source db > when it hangs are multiple long GETs for a document with 2051 open revisions > on the source and 498 on the target. > When doing a push replication the _replicate call just gives a 500 error (at > about the same seq id as the pull replication hangs at) saying: > [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died > with reason {noproc, > {gen_server,call, > [<0.6382.115>, > {pread_iolist, > 79043596434}, > infinity]}} > when the last call in the target of the push replication is: > [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST' > /master_db/_missing_revs 200 > with no stack trace. > Comparing the open_revs=all count on the documents with many open revs shows > differing numbers on each side of the replication WAN and between different > couches in the same datacentre. Some of these documents have not been updated > for months. Is it possible that 1.0.3 just skipped over this issue and > carried on replicating, but 1.1.1 does not? > I know I can hack the replication to work by updating the checkpoint seq past > this point in the _local document, but I think there is a real bug here > somewhere. > If wireshark/debug data is required, please say -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira