Re: Issues blocking 1.2.0 release

2012-02-07 Thread Noah Slater
Old thread is old. But the new code works. Thanks.

On Mon, Jan 23, 2012 at 4:45 PM, Filipe David Manana fdman...@apache.orgwrote:

 On Mon, Jan 23, 2012 at 5:54 AM, Noah Slater nsla...@tumbolia.org wrote:
  Yep, that seems reasonable. Let me know when I can test again. :)

 Done. Can you test again?



Re: Issues blocking 1.2.0 release

2012-01-23 Thread Filipe David Manana
On Mon, Jan 23, 2012 at 5:54 AM, Noah Slater nsla...@tumbolia.org wrote:
 Yep, that seems reasonable. Let me know when I can test again. :)

Done. Can you test again?


Re: Issues blocking 1.2.0 release

2012-01-22 Thread Filipe David Manana
Noah, does it fail occasionally or every time for you?

I'm assuming you're with a slow machine or the machine is a bit overloaded.
Can you try with the following patch?

diff --git a/test/etap/242-replication-many-leaves.t
b/test/etap/242-replication-many-leaves.t
index d8d3eb9..ad9d180 100755
--- a/test/etap/242-replication-many-leaves.t
+++ b/test/etap/242-replication-many-leaves.t
@@ -77,6 +77,7 @@ test() -
 couch_server_sup:start_link(test_util:config_files()),
 ibrowse:start(),
 crypto:start(),
+couch_config:set(replicator, worker_processes, 1, false),

 Pairs = [
 {source_db_name(), target_db_name()},



On Sun, Jan 22, 2012 at 5:41 PM, Noah Slater nsla...@tumbolia.org wrote:
 Hello,

 The following test fails:

 ./test/etap/run test/etap/242-replication-many-leaves.t


 Log is:

 ok 38  - Document revisions updated with 2 attachments
 # Triggering replication again
 [info] [0.3567.0] 127.0.0.1 - - HEAD /couch_test_rep_db_b/ 200
 [info] [0.3567.0] 127.0.0.1 - - GET /couch_test_rep_db_b/ 200
 [info] [0.3567.0] 127.0.0.1 - - GET
 /couch_test_rep_db_b/_local/4296cc7705ed9d0c0cf63593b42a10b7 200
 [info] [0.6460.0] Replication `4296cc7705ed9d0c0cf63593b42a10b7` is
 using:
 4 worker processes
 a worker batch size of 500
 20 HTTP connections
 a connection timeout of 3 milliseconds
 10 retries per request
 socket options are: [{keepalive,true},{nodelay,false}]
 source start sequence 6
 [info] [0.2.0] starting new replication
 `4296cc7705ed9d0c0cf63593b42a10b7` at 0.6460.0 (`couch_test_rep_db_a` - `
 http://127.0.0.1:53314/couch_test_rep_db_b/`)
 [info] [0.3573.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
 [info] [0.3569.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
 [info] [0.3573.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.3567.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200
 [info] [0.3567.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 0.25 seconds due
 to error req_timedout
 [info] [0.3573.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 0.5 seconds due to
 error req_timedout
 [info] [0.3568.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 1.0 seconds due to
 error req_timedout
 [info] [0.3571.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 2.0 seconds due to
 error req_timedout
 [info] [0.6463.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 4.0 seconds due to
 error req_timedout
 [info] [0.8537.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 8.0 seconds due to
 error req_timedout
 [info] [0.9151.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 16.0 seconds due
 to error req_timedout
 [info] [0.9768.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 [info] [0.6468.0] Retrying POST request to
 http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 32.0 seconds due
 to error req_timedout
 [info] [0.10390.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201
 Bail out! Timeout waiting for replication to finish
 escript: exception error: bad argument
  in function  etap:diag/1
  in call from erl_eval:do_apply/5
  in call from erl_eval:exprs/5
  in call from escript:eval_exprs/5
  in call from erl_eval:local_func/5
  in call from escript:interpret/4
  in call from escript:start/1
  in call from init:start_it/1


 Please advise.

 Thanks,

 N



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.


Re: Issues blocking 1.2.0 release

2012-01-22 Thread Noah Slater
On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana fdman...@apache.orgwrote:

 Noah, does it fail occasionally or every time for you?


Fails every time.


 I'm assuming you're with a slow machine or the machine is a bit overloaded.


Shouldn't be, I'm not doing anything else right now, and this is a new MBA.


 Can you try with the following patch?


Yes. Will report back.


Re: Issues blocking 1.2.0 release

2012-01-22 Thread Noah Slater
No change, still fails.

On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org wrote:


 On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana 
 fdman...@apache.orgwrote:

 Noah, does it fail occasionally or every time for you?


 Fails every time.


 I'm assuming you're with a slow machine or the machine is a bit
 overloaded.


 Shouldn't be, I'm not doing anything else right now, and this is a new MBA.


 Can you try with the following patch?


 Yes. Will report back.




Re: Issues blocking 1.2.0 release

2012-01-22 Thread Filipe David Manana
On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org wrote:
 No change, still fails.

Noah, to try to find out if it's due to slowness of the machine or
some other issue, do you think you can try to increase the following
timeout in the test?

diff --git a/test/etap/242-replication-many-leaves.t
b/test/etap/242-replication-many-leaves.t
index d8d3eb9..737cd31 100755
--- a/test/etap/242-replication-many-leaves.t
+++ b/test/etap/242-replication-many-leaves.t
@@ -287,6 +287,6 @@ replicate(Source, Target) -
 receive
 {'DOWN', MonRef, process, Pid, Reason} -
 etap:is(Reason, normal, Replication finished successfully)
-after 30 -
+after 90 -
 etap:bail(Timeout waiting for replication to finish)
 end.


 On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org wrote:


 On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana 
 fdman...@apache.orgwrote:

 Noah, does it fail occasionally or every time for you?


 Fails every time.


 I'm assuming you're with a slow machine or the machine is a bit
 overloaded.


 Shouldn't be, I'm not doing anything else right now, and this is a new MBA.


 Can you try with the following patch?


 Yes. Will report back.





-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.


Re: Issues blocking 1.2.0 release

2012-01-22 Thread Noah Slater
OVAR 9000! (Testing now...)

On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana fdman...@apache.orgwrote:

 On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org wrote:
  No change, still fails.

 Noah, to try to find out if it's due to slowness of the machine or
 some other issue, do you think you can try to increase the following
 timeout in the test?

 diff --git a/test/etap/242-replication-many-leaves.t
 b/test/etap/242-replication-many-leaves.t
 index d8d3eb9..737cd31 100755
 --- a/test/etap/242-replication-many-leaves.t
 +++ b/test/etap/242-replication-many-leaves.t
 @@ -287,6 +287,6 @@ replicate(Source, Target) -
 receive
 {'DOWN', MonRef, process, Pid, Reason} -
 etap:is(Reason, normal, Replication finished successfully)
 -after 30 -
 +after 90 -
 etap:bail(Timeout waiting for replication to finish)
 end.

 
  On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org
 wrote:
 
 
  On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana 
 fdman...@apache.orgwrote:
 
  Noah, does it fail occasionally or every time for you?
 
 
  Fails every time.
 
 
  I'm assuming you're with a slow machine or the machine is a bit
  overloaded.
 
 
  Shouldn't be, I'm not doing anything else right now, and this is a new
 MBA.
 
 
  Can you try with the following patch?
 
 
  Yes. Will report back.
 
 



 --
 Filipe David Manana,

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.



Re: Issues blocking 1.2.0 release

2012-01-22 Thread Filipe David Manana
On Sun, Jan 22, 2012 at 7:20 PM, Noah Slater nsla...@tumbolia.org wrote:
 Works. How do we proceed?

For how much time does the test runs? On 2 different physical
machines, it takes about 1 minute and 10 seconds for me.

Perhaps some manual replication tests could confirm if there's
something wrong with the codebase, your environment or if simply
increasing the timeout is not alarming.


 On Sun, Jan 22, 2012 at 7:05 PM, Noah Slater nsla...@tumbolia.org wrote:

 OVAR 9000! (Testing now...)


 On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana 
 fdman...@apache.orgwrote:

 On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org
 wrote:
  No change, still fails.

 Noah, to try to find out if it's due to slowness of the machine or
 some other issue, do you think you can try to increase the following
 timeout in the test?

 diff --git a/test/etap/242-replication-many-leaves.t
 b/test/etap/242-replication-many-leaves.t
 index d8d3eb9..737cd31 100755
 --- a/test/etap/242-replication-many-leaves.t
 +++ b/test/etap/242-replication-many-leaves.t
 @@ -287,6 +287,6 @@ replicate(Source, Target) -
     receive
     {'DOWN', MonRef, process, Pid, Reason} -
         etap:is(Reason, normal, Replication finished successfully)
 -    after 30 -
 +    after 90 -
         etap:bail(Timeout waiting for replication to finish)
     end.

 
  On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org
 wrote:
 
 
  On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana 
 fdman...@apache.orgwrote:
 
  Noah, does it fail occasionally or every time for you?
 
 
  Fails every time.
 
 
  I'm assuming you're with a slow machine or the machine is a bit
  overloaded.
 
 
  Shouldn't be, I'm not doing anything else right now, and this is a new
 MBA.
 
 
  Can you try with the following patch?
 
 
  Yes. Will report back.
 
 



 --
 Filipe David Manana,

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.






-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.


Re: Issues blocking 1.2.0 release

2012-01-22 Thread Filipe David Manana
Noah, was able to reproduce your issue by tweaking the test to create
more leaf revisions for a document:

diff --git a/test/etap/242-replication-many-leaves.t
b/test/etap/242-replication-many-leaves.t
index d8d3eb9..4eb4765 100755
--- a/test/etap/242-replication-many-leaves.t
+++ b/test/etap/242-replication-many-leaves.t
@@ -56,7 +56,7 @@ doc_ids() -

 doc_num_conflicts(doc1) - 10;
 doc_num_conflicts(doc2) - 100;
-doc_num_conflicts(doc3) - 286.
+doc_num_conflicts(doc3) - 500.


 main(_) -


With that change, I get exactly the same timeout as you get when the
test runs the push replication. It turns out that some _bulk_docs
requests are taking more than 30 seconds (default replication
connection timeout) therefore the replication request retry messages.
Verified this by timing the _bulk_docs handler to log the time it
takes:

diff --git a/src/couchdb/couch_httpd_db.erl b/src/couchdb/couch_httpd_db.erl
index d7ecb4a..442571d 100644
--- a/src/couchdb/couch_httpd_db.erl
+++ b/src/couchdb/couch_httpd_db.erl
@@ -297,6 +297,7 @@
db_req(#httpd{path_parts=[_,_ensure_full_commit]}=Req, _Db) -
 send_method_not_allowed(Req, POST);

 db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db) -
+T0 = now(),
 couch_stats_collector:increment({httpd, bulk_requests}),
 couch_httpd:validate_ctype(Req, application/json),
 {JsonProps} = couch_httpd:json_body_obj(Req),
@@ -357,7 +358,9 @@
db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db)
-
 {ok, Errors} = couch_db:update_docs(Db, Docs, Options,
replicated_changes),
 ErrorsJson =
 lists:map(fun update_doc_result_to_json/1, Errors),
-send_json(Req, 201, ErrorsJson)
+Rr = send_json(Req, 201, ErrorsJson),
+?LOG_ERROR(BULK DOCS took ~p ms~n,
[timer:now_diff(now(), T0) / 1000]),
+Rr
 end
 end;
 db_req(#httpd{path_parts=[_,_bulk_docs]}=Req, _Db) -


I was getting _bulk_docs response times after 50 seconds.

This convinces me there's nothing wrong with the codebase, the
timeouts just needs to be increased:

diff --git a/test/etap/242-replication-many-leaves.t
b/test/etap/242-replication-many-leaves.t
index d8d3eb9..6508112 100755
--- a/test/etap/242-replication-many-leaves.t
+++ b/test/etap/242-replication-many-leaves.t
@@ -77,6 +77,7 @@ test() -
 couch_server_sup:start_link(test_util:config_files()),
 ibrowse:start(),
 crypto:start(),
+couch_config:set(replicator, connection_timeout, 9, false),

 Pairs = [
 {source_db_name(), target_db_name()},
@@ -287,6 +288,6 @@ replicate(Source, Target) -
 receive
 {'DOWN', MonRef, process, Pid, Reason} -
 etap:is(Reason, normal, Replication finished successfully)
-after 30 -
+after 90 -
 etap:bail(Timeout waiting for replication to finish)
 end.

Alternatively the test can be updated to create less revisions for the
document doc3. The current revisions # is 286 but for the tests'
purpose 205+ is enough, which should make it faster - 7000 (max url
length) / length(DocRevision) = 205

If it's ok for you, updating the timeouts plus reducing the # from 286
to 210 is fine for me.



On Mon, Jan 23, 2012 at 12:00 AM, Noah Slater nsla...@tumbolia.org wrote:
 I'm just the dumb QA guy.

 If you have some diagnostics you want me to run on my machine, I am happy
 to.

 On Sun, Jan 22, 2012 at 11:31 PM, Filipe David Manana
 fdman...@apache.orgwrote:

 On Sun, Jan 22, 2012 at 7:20 PM, Noah Slater nsla...@tumbolia.org wrote:
  Works. How do we proceed?

 For how much time does the test runs? On 2 different physical
 machines, it takes about 1 minute and 10 seconds for me.

 Perhaps some manual replication tests could confirm if there's
 something wrong with the codebase, your environment or if simply
 increasing the timeout is not alarming.

 
  On Sun, Jan 22, 2012 at 7:05 PM, Noah Slater nsla...@tumbolia.org
 wrote:
 
  OVAR 9000! (Testing now...)
 
 
  On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana 
 fdman...@apache.orgwrote:
 
  On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org
  wrote:
   No change, still fails.
 
  Noah, to try to find out if it's due to slowness of the machine or
  some other issue, do you think you can try to increase the following
  timeout in the test?
 
  diff --git a/test/etap/242-replication-many-leaves.t
  b/test/etap/242-replication-many-leaves.t
  index d8d3eb9..737cd31 100755
  --- a/test/etap/242-replication-many-leaves.t
  +++ b/test/etap/242-replication-many-leaves.t
  @@ -287,6 +287,6 @@ replicate(Source, Target) -
      receive
      {'DOWN', MonRef, process, Pid, Reason} -
          etap:is(Reason, normal, Replication finished successfully)
  -    after 30 -
  +    after 90 -
          etap:bail(Timeout waiting for replication to finish)
      end.
 
  
   On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org
  wrote:
  
  
   On Sun, Jan 22, 2012 at 6:01 PM, Filipe 

Re: Issues blocking 1.2.0 release

2012-01-22 Thread Noah Slater
Yep, that seems reasonable. Let me know when I can test again. :)

On Mon, Jan 23, 2012 at 1:34 AM, Filipe David Manana fdman...@apache.orgwrote:

 Noah, was able to reproduce your issue by tweaking the test to create
 more leaf revisions for a document:

 diff --git a/test/etap/242-replication-many-leaves.t
 b/test/etap/242-replication-many-leaves.t
 index d8d3eb9..4eb4765 100755
 --- a/test/etap/242-replication-many-leaves.t
 +++ b/test/etap/242-replication-many-leaves.t
 @@ -56,7 +56,7 @@ doc_ids() -

  doc_num_conflicts(doc1) - 10;
  doc_num_conflicts(doc2) - 100;
 -doc_num_conflicts(doc3) - 286.
 +doc_num_conflicts(doc3) - 500.


  main(_) -


 With that change, I get exactly the same timeout as you get when the
 test runs the push replication. It turns out that some _bulk_docs
 requests are taking more than 30 seconds (default replication
 connection timeout) therefore the replication request retry messages.
 Verified this by timing the _bulk_docs handler to log the time it
 takes:

 diff --git a/src/couchdb/couch_httpd_db.erl
 b/src/couchdb/couch_httpd_db.erl
 index d7ecb4a..442571d 100644
 --- a/src/couchdb/couch_httpd_db.erl
 +++ b/src/couchdb/couch_httpd_db.erl
 @@ -297,6 +297,7 @@
 db_req(#httpd{path_parts=[_,_ensure_full_commit]}=Req, _Db) -
 send_method_not_allowed(Req, POST);

  db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db) -
 +T0 = now(),
 couch_stats_collector:increment({httpd, bulk_requests}),
 couch_httpd:validate_ctype(Req, application/json),
 {JsonProps} = couch_httpd:json_body_obj(Req),
 @@ -357,7 +358,9 @@
 db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db)
 -
 {ok, Errors} = couch_db:update_docs(Db, Docs, Options,
 replicated_changes),
 ErrorsJson =
 lists:map(fun update_doc_result_to_json/1, Errors),
 -send_json(Req, 201, ErrorsJson)
 +Rr = send_json(Req, 201, ErrorsJson),
 +?LOG_ERROR(BULK DOCS took ~p ms~n,
 [timer:now_diff(now(), T0) / 1000]),
 +Rr
 end
 end;
  db_req(#httpd{path_parts=[_,_bulk_docs]}=Req, _Db) -


 I was getting _bulk_docs response times after 50 seconds.

 This convinces me there's nothing wrong with the codebase, the
 timeouts just needs to be increased:

 diff --git a/test/etap/242-replication-many-leaves.t
 b/test/etap/242-replication-many-leaves.t
 index d8d3eb9..6508112 100755
 --- a/test/etap/242-replication-many-leaves.t
 +++ b/test/etap/242-replication-many-leaves.t
 @@ -77,6 +77,7 @@ test() -
 couch_server_sup:start_link(test_util:config_files()),
 ibrowse:start(),
 crypto:start(),
 +couch_config:set(replicator, connection_timeout, 9, false),

 Pairs = [
 {source_db_name(), target_db_name()},
 @@ -287,6 +288,6 @@ replicate(Source, Target) -
  receive
 {'DOWN', MonRef, process, Pid, Reason} -
 etap:is(Reason, normal, Replication finished successfully)
 -after 30 -
 +after 90 -
 etap:bail(Timeout waiting for replication to finish)
 end.

 Alternatively the test can be updated to create less revisions for the
 document doc3. The current revisions # is 286 but for the tests'
 purpose 205+ is enough, which should make it faster - 7000 (max url
 length) / length(DocRevision) = 205

 If it's ok for you, updating the timeouts plus reducing the # from 286
 to 210 is fine for me.



 On Mon, Jan 23, 2012 at 12:00 AM, Noah Slater nsla...@tumbolia.org
 wrote:
  I'm just the dumb QA guy.
 
  If you have some diagnostics you want me to run on my machine, I am happy
  to.
 
  On Sun, Jan 22, 2012 at 11:31 PM, Filipe David Manana
  fdman...@apache.orgwrote:
 
  On Sun, Jan 22, 2012 at 7:20 PM, Noah Slater nsla...@tumbolia.org
 wrote:
   Works. How do we proceed?
 
  For how much time does the test runs? On 2 different physical
  machines, it takes about 1 minute and 10 seconds for me.
 
  Perhaps some manual replication tests could confirm if there's
  something wrong with the codebase, your environment or if simply
  increasing the timeout is not alarming.
 
  
   On Sun, Jan 22, 2012 at 7:05 PM, Noah Slater nsla...@tumbolia.org
  wrote:
  
   OVAR 9000! (Testing now...)
  
  
   On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana 
  fdman...@apache.orgwrote:
  
   On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org
   wrote:
No change, still fails.
  
   Noah, to try to find out if it's due to slowness of the machine or
   some other issue, do you think you can try to increase the following
   timeout in the test?
  
   diff --git a/test/etap/242-replication-many-leaves.t
   b/test/etap/242-replication-many-leaves.t
   index d8d3eb9..737cd31 100755
   --- a/test/etap/242-replication-many-leaves.t
   +++ b/test/etap/242-replication-many-leaves.t
   @@ -287,6 +287,6 @@ replicate(Source, Target) -
   receive
   {'DOWN', MonRef, process, Pid, Reason} -
   etap:is(Reason, normal, Replication finished