[
https://issues.apache.org/jira/browse/COUCHDB-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Markham updated COUCHDB-1364:
----------------------------------
Attachment: do_checkpoint error push.txt
Hi Felipe - which couch (on which end of the replication) needs to be updated?
I looked at wireshark for the pull and push replication, from host28 -> host25
For the Pull - the replication seems to start, fetch the changes list from seq
390505 and then POSTs an ensure full commit. There doesn't seem to be a reply
from this so it just ends up hanging. My replication script cancels the ongoing
replication and then restarts it every 5 mins and this seems to take much
longer than that.
POST /master_db/_ensure_full_commit?seq=3914198 HTTP/1.1
User-Agent: CouchDB/1.1.1
Accept: application/json
Accept-Encoding: gzip
Content-Type: application/json
Content-Length: 0
Host: host28:5984
I also have a different stack trace for what I think is the same problem -
"do_checkpoint error.txt" where the last wireshark activity appeared to be a
/_ensure_full_commit at 12:12:04 and at 12:12:34 the timeout error appeared and
replication failed
POST /master_db/_ensure_full_commit HTTP/1.1
User-Agent: CouchDB/1.1.1
Accept: application/json
Accept-Encoding: gzip
Content-Type: application/json
Content-Length: 0
Host: host25:5984
> Replication hanging/failing on docs with lots of revisions
> ----------------------------------------------------------
>
> Key: COUCHDB-1364
> URL: https://issues.apache.org/jira/browse/COUCHDB-1364
> Project: CouchDB
> Issue Type: Bug
> Components: Replication
> Affects Versions: 1.0.3, 1.1.1
> Environment: Centos 5.6/x64 spidermonkey 1.8.5, couchdb 1.1.1 patched
> for COUCHDB-1340 and COUCHDB-1333
> Reporter: Alex Markham
> Labels: open_revs, replication
> Attachments: COUCHDB-1364-11x.patch, do_checkpoint error push.txt,
> replication error changes_loop died redacted.txt
>
>
> We have a setup where replication from a 1.1.1 couch is hanging - this is WAN
> replication which previously worked 1.0.3 <-> 1.0.3.
> Replicating from the 1.1.1 -> 1.0.3 showed an error very similar to
> COUCHDB-1340 - which I presumed meant the url was too long. So I upgraded the
> 1.0.3 couch to our 1.1.1 build which had this patched.
> However - the replication between the 2 1.1.1 couches is hanging at a certain
> point when doing continuous pull replication - it doesn't checkpoint, just
> stays on "starting" however, when cancelled and restarted it gets the latest
> documents (so doc counts are equal). The last calls I see to the source db
> when it hangs are multiple long GETs for a document with 2051 open revisions
> on the source and 498 on the target.
> When doing a push replication the _replicate call just gives a 500 error (at
> about the same seq id as the pull replication hangs at) saying:
> [Thu, 15 Dec 2011 10:09:17 GMT] [error] [<0.11306.115>] changes_loop died
> with reason {noproc,
> {gen_server,call,
> [<0.6382.115>,
> {pread_iolist,
> 79043596434},
> infinity]}}
> when the last call in the target of the push replication is:
> [Thu, 15 Dec 2011 10:09:17 GMT] [info] [<0.580.50>] 10.35.9.79 - - 'POST'
> /master_db/_missing_revs 200
> with no stack trace.
> Comparing the open_revs=all count on the documents with many open revs shows
> differing numbers on each side of the replication WAN and between different
> couches in the same datacentre. Some of these documents have not been updated
> for months. Is it possible that 1.0.3 just skipped over this issue and
> carried on replicating, but 1.1.1 does not?
> I know I can hack the replication to work by updating the checkpoint seq past
> this point in the _local document, but I think there is a real bug here
> somewhere.
> If wireshark/debug data is required, please say
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira