[ https://issues.apache.org/jira/browse/COUCHDB-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Markham updated COUCHDB-1505: ---------------------------------- Attachment: couchcrash171012redact.log Attaching a new log - couchcrash171012redact.log We had a couch die for 5 seconds yesterday, where it seems heart restarted it. The stack trace before the crash looks almost identical, except this time it took out the whole server rather than just print errors. > Error on cancelling replication - possbily related to hanging replications > -------------------------------------------------------------------------- > > Key: COUCHDB-1505 > URL: https://issues.apache.org/jira/browse/COUCHDB-1505 > Project: CouchDB > Issue Type: Bug > Components: Replication > Affects Versions: 1.2 > Environment: CentOS 5.6 x64. WAN replication (between datacentres). > Cronjob controlled replication curls every 5 mins. Using pull replication > with a filter. > Reporter: Alex Markham > Labels: cancel, hang, replication > Attachments: couchcrash171012redact.log, couchjs.txt, > replicationcancelerror1.log > > > We run a cronjob to cancel replication, and then start it again every 5 > minutes. Occasionally when cancelling replication jobs, a stack trace appears > in the couchdb log (attached) > Other observations : perhaps unrelated, but over time we slowly start to > gather "zombie" couchjs processes. After a month or so (different for each > server) we start to get up to near our os_process_limit of 200 and we restart > couchdb. "zombie" is speculation here, but there seems to be no need for the > hundred+ couchjs processes when just replicating 10 databases and occasional > indexing, after restart it drops right back down. The started time of those > processes are also weeks old. This may be normal, not sure. > Why do we cancel replication and restart it? We found that if we don't do > this then WAN replications can hang, where curling /_replicate would say that > the continuous replication is already running, but that the replications were > not updating, and the document counts in the databases would diverge. > Immediately after re-enabling the "cancel":true /_replicate beforehand, these > stack traces re-appeared and the replication caught up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira