[jira] [Updated] (COUCHDB-597) Replication tasks crash.

2012-01-05 Thread Jan Lehnardt (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Lehnardt updated COUCHDB-597:
-

Fix Version/s: (was: 1.2)
   1.3

Bump to 1.3.x

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson
 Fix For: 1.3

 Attachments: 0001-Cleanup-597-fixes.patch, 
 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 
 597_fixes.patch, couchdb_597.patch


 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (COUCHDB-597) Replication tasks crash.

2010-10-09 Thread Paul Joseph Davis (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Joseph Davis updated COUCHDB-597:
--

Skill Level: Regular Contributors Level (Easy to Medium)

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson
 Fix For: 0.12

 Attachments: 
 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 
 0001-Cleanup-597-fixes.patch, 597_fixes.patch, couchdb_597.patch


 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-597) Replication tasks crash.

2010-02-27 Thread Randall Leeds (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-597:
--

Attachment: 597_fixes.patch

Corrects problems with continuous replication timeouts introduced by r916518 
and r916868.

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson
 Fix For: 0.11

 Attachments: 
 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 
 597_fixes.patch, couchdb_597.patch


 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-597) Replication tasks crash.

2010-02-26 Thread Randall Leeds (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-597:
--

Attachment: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch

I went back and made this patch SUPER simple and straightforward.
Applies to the very most current trunk.
This should take no more than a minute to review; it's super simple now.

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson
 Attachments: couchdb_597.patch


 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-597) Replication tasks crash.

2010-02-26 Thread Randall Leeds (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-597:
--

Attachment: (was: 
0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch)

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson
 Attachments: couchdb_597.patch


 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-597) Replication tasks crash.

2010-02-26 Thread Randall Leeds (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-597:
--

Attachment: 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch

Forgot to check the inclusion box.

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson
 Attachments: 
 0001-changes-replication-timeouts-and-att.-fixes-COUCHDB-.patch, 
 couchdb_597.patch


 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-597) Replication tasks crash.

2010-02-25 Thread Randall Leeds (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Leeds updated COUCHDB-597:
--

Attachment: couchdb_597.patch

I believe this patch fixes most of the problems we're seeing here.

The solution, as discussed, is to remove the inactivity_timeout from options 
passed to ibrowse and handle timeouts manually (here using the timer module).

In my testing, I could mostly reproduce timeouts caused by not reading data 
from ibrowse fast enough. In other words, replicating from a remote database 
was terminating because processing the changes was taking a long time to 
complete and the socket would be inactive while couch_rep_changes_feed had a 
full queue of rows. Therefore, a timeout is not set unless the missing revs 
server is waiting for more changes.

Timeouts should still occur if the socket is idle and the local queue of 
received changes is empty. Errors should be caught appropriately such that real 
problems still bubble.

I implemented retry logic for attachments in a manner similar to 
couch_rep_httpc. I had to add some after statements now that the 
inactivity_timeout is not set.

The patch applies cleanly to trunk and 0.11.x, so please review!!! I think this 
would be a very good patch to get into 0.11 so long as Noah hasn't built the 
artifacts yet.

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson
 Attachments: couchdb_597.patch


 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-597) Replication tasks crash.

2009-12-19 Thread Adam Kocoloski (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kocoloski updated COUCHDB-597:
---


Hi Robert, I can reproduce the crashes locally and I've discovered why they 
happen independently of the {ref(), integer()} problem.  The basic issue is 
that attachment downloads do not employ the same retry checks that we do for 
regular document GETs.  For instance, the attachment receiver process 
associated with a replication would be waiting an infinite amount for response 
headers, when in fact it had an error message in its mailbox informing it that 
the request had failed.  Eventually the changes feed times out and the 
replication crashes.

If I apply http://friendpaste.com/5IA5MlRx0OZhKmsLNPMeJe, crank up the changes 
feed timeout, and add the catchall handle_infos we've talked about before I can 
successfully run the script you posted here.  We have more work to do, though, 
namely

1) Reworking the changes feed timeout.  Currently it will trigger if there is 
no activity for X milliseconds on the connection handling the _changes feed.  
There are situations where this is actually normal, since the changes feed 
consumer is responsible for controlling the socket, and if the target is 
_really_ slow (or the documents are huge) it's quite possible that the changes 
feed will not be consulted for a long time.  I think the solution is to handle 
inactivity timeouts in couch_rep_changes_feed.erl instead of in the underlying 
ibrowse system.

2a) Attachment retry logic that handles redirects and limits the number of 
retries.  Basically, the same code as we have in couch_rep_httpc, but only 
applied until we receive the response headers.  My friendpaste above is a 
primitive form of what I'd ultimately like to see here.

2b) When an attachment body download has started and then fails, we can't 
simply retry it.  We need to do a Range request or find another way to skip the 
first N bytes of the retry.  Currently we just give up on the entire 
replication if an attachment request ever fails mid-download.

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson

 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (COUCHDB-597) Replication tasks crash.

2009-12-19 Thread Adam Kocoloski (JIRA)

 [ 
https://issues.apache.org/jira/browse/COUCHDB-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Kocoloski updated COUCHDB-597:
---


Also, I don't really understand why the after clause is necessary in that 
paste.  I tried adding a connect_timeout to ibrowse but didn't get any 
conn_failed messages.  It really does seem like a connection is made but then 
the request just stalls.  I suppose it's possible that a connection took 9 
seconds (e.g. 3 consecutive TCP retransmits), and then CouchDB took more than 1 
second to respond with the headers.  Seems unlikely, though.  It makes me think 
we need to add this after' clause to couch_rep_httpc too.

 Replication tasks crash.
 

 Key: COUCHDB-597
 URL: https://issues.apache.org/jira/browse/COUCHDB-597
 Project: CouchDB
  Issue Type: Bug
  Components: Database Core
Affects Versions: 0.11
Reporter: Robert Newson

 If I kick off 10 replication tasks in quick succession, occasionally one or 
 two of the replication tasks will die and not be resumed. It seems that the 
 stat tracking is a little buggy, and under stress can eventually cause a 
 permanent failure of the supervised replication task;
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [0.80.0] {error_report,0.30.0,
 {0.80.0,supervisor_report,
  [{supervisor,{local,couch_rep_sup}},
   {errorContext,shutdown_error},
   {reason,killed},
   {offender,
   [{pid,0.6700.11},
{name,fcbb13200a1618cf983b347f4d2c9835+create_target},
{mfa,
{gen_server,start_link,
[couch_rep,
 [fcbb13200a1618cf983b347f4d2c9835,
  {[{create_target,true},
{source,http://node:5984/perf-p2;},
{target,perf-p2}]},
  {user_ctx,null,[_admin]}],
 []]}},
{restart_type,temporary},
{shutdown,1},
{child_type,worker}]}]}}
 [Fri, 11 Dec 2009 19:00:08 GMT] [error] [emulator] Error in process 
 0.6705.11 with exit value: 
 {badarg,[{ets,insert,[stats_hit_table,{{couchdb,open_os_files},-1}]},{couch_stats_collector,decrement,1}]}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.