[jira] [Updated] (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Lehnardt updated COUCHDB-597: - Fix Version/s: (was: 1.2) 1.3 Bump to 1.3.x Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson Fix For: 1.3 Attachments: 0001-Cleanup-597-fixes.patch, 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 597_fixes.patch, couchdb_597.patch If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Joseph Davis updated COUCHDB-597: -- Skill Level: Regular Contributors Level (Easy to Medium) Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson Fix For: 0.12 Attachments: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 0001-Cleanup-597-fixes.patch, 597_fixes.patch, couchdb_597.patch If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randall Leeds updated COUCHDB-597: -- Attachment: 597_fixes.patch Corrects problems with continuous replication timeouts introduced by r916518 and r916868. Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson Fix For: 0.11 Attachments: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 597_fixes.patch, couchdb_597.patch If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randall Leeds updated COUCHDB-597: -- Attachment: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch I went back and made this patch SUPER simple and straightforward. Applies to the very most current trunk. This should take no more than a minute to review; it's super simple now. Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson Attachments: couchdb_597.patch If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randall Leeds updated COUCHDB-597: -- Attachment: (was: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch) Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson Attachments: couchdb_597.patch If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randall Leeds updated COUCHDB-597: -- Attachment: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch Forgot to check the inclusion box. Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson Attachments: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, couchdb_597.patch If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Randall Leeds updated COUCHDB-597: -- Attachment: couchdb_597.patch I believe this patch fixes most of the problems we're seeing here. The solution, as discussed, is to remove the inactivity_timeout from options passed to ibrowse and handle timeouts manually (here using the timer module). In my testing, I could mostly reproduce timeouts caused by not reading data from ibrowse fast enough. In other words, replicating from a remote database was terminating because processing the changes was taking a long time to complete and the socket would be inactive while couch_rep_changes_feed had a full queue of rows. Therefore, a timeout is not set unless the missing revs server is waiting for more changes. Timeouts should still occur if the socket is idle and the local queue of received changes is empty. Errors should be caught appropriately such that real problems still bubble. I implemented retry logic for attachments in a manner similar to couch_rep_httpc. I had to add some after statements now that the inactivity_timeout is not set. The patch applies cleanly to trunk and 0.11.x, so please review!!! I think this would be a very good patch to get into 0.11 so long as Noah hasn't built the artifacts yet. Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson Attachments: couchdb_597.patch If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Kocoloski updated COUCHDB-597: --- Hi Robert, I can reproduce the crashes locally and I've discovered why they happen independently of the {ref(), integer()} problem. The basic issue is that attachment downloads do not employ the same retry checks that we do for regular document GETs. For instance, the attachment receiver process associated with a replication would be waiting an infinite amount for response headers, when in fact it had an error message in its mailbox informing it that the request had failed. Eventually the changes feed times out and the replication crashes. If I apply http://friendpaste.com/5IA5MlRx0OZhKmsLNPMeJe, crank up the changes feed timeout, and add the catchall handle_infos we've talked about before I can successfully run the script you posted here. We have more work to do, though, namely 1) Reworking the changes feed timeout. Currently it will trigger if there is no activity for X milliseconds on the connection handling the _changes feed. There are situations where this is actually normal, since the changes feed consumer is responsible for controlling the socket, and if the target is _really_ slow (or the documents are huge) it's quite possible that the changes feed will not be consulted for a long time. I think the solution is to handle inactivity timeouts in couch_rep_changes_feed.erl instead of in the underlying ibrowse system. 2a) Attachment retry logic that handles redirects and limits the number of retries. Basically, the same code as we have in couch_rep_httpc, but only applied until we receive the response headers. My friendpaste above is a primitive form of what I'd ultimately like to see here. 2b) When an attachment body download has started and then fails, we can't simply retry it. We need to do a Range request or find another way to skip the first N bytes of the retry. Currently we just give up on the entire replication if an attachment request ever fails mid-download. Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (COUCHDB-597) Replication tasks crash.
[ https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Kocoloski updated COUCHDB-597: --- Also, I don't really understand why the after clause is necessary in that paste. I tried adding a connect_timeout to ibrowse but didn't get any conn_failed messages. It really does seem like a connection is made but then the request just stalls. I suppose it's possible that a connection took 9 seconds (e.g. 3 consecutive TCP retransmits), and then CouchDB took more than 1 second to respond with the headers. Seems unlikely, though. It makes me think we need to add this after' clause to couch_rep_httpc too. Replication tasks crash. Key: COUCHDB-597 URL: https://issues.apache.org/jira/browse/COUCHDB-597 Project: CouchDB Issue Type: Bug Components: Database Core Affects Versions: 0.11 Reporter: Robert Newson If I kick off 10 replication tasks in quick succession, occasionally one or two of the replication tasks will die and not be resumed. It seems that the stat tracking is a little buggy, and under stress can eventually cause a permanent failure of the supervised replication task; [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0, {0.80.0,supervisor_report, [{supervisor,{local,couch_rep_sup}}, {errorContext,shutdown_error}, {reason,killed}, {offender, [{pid,0.6700.11}, {name,fcbb13200a1618cf983b347f4d2c9835+create_target}, {mfa, {gen_server,start_link, [couch_rep, [fcbb13200a1618cf983b347f4d2c9835, {[{create_target,true}, {source,http://node:5984/perf-p2;}, {target,perf-p2}]}, {user_ctx,null,[_admin]}], []]}}, {restart_type,temporary}, {shutdown,1}, {child_type,worker}]}]}} [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 0.6705.11 with exit value: {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.