Re: Issues blocking 1.2.0 release
Old thread is old. But the new code works. Thanks. On Mon, Jan 23, 2012 at 4:45 PM, Filipe David Manana fdman...@apache.orgwrote: On Mon, Jan 23, 2012 at 5:54 AM, Noah Slater nsla...@tumbolia.org wrote: Yep, that seems reasonable. Let me know when I can test again. :) Done. Can you test again?
Re: Issues blocking 1.2.0 release
On Mon, Jan 23, 2012 at 5:54 AM, Noah Slater nsla...@tumbolia.org wrote: Yep, that seems reasonable. Let me know when I can test again. :) Done. Can you test again?
Re: Issues blocking 1.2.0 release
Noah, does it fail occasionally or every time for you? I'm assuming you're with a slow machine or the machine is a bit overloaded. Can you try with the following patch? diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..ad9d180 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -77,6 +77,7 @@ test() - couch_server_sup:start_link(test_util:config_files()), ibrowse:start(), crypto:start(), +couch_config:set(replicator, worker_processes, 1, false), Pairs = [ {source_db_name(), target_db_name()}, On Sun, Jan 22, 2012 at 5:41 PM, Noah Slater nsla...@tumbolia.org wrote: Hello, The following test fails: ./test/etap/run test/etap/242-replication-many-leaves.t Log is: ok 38 - Document revisions updated with 2 attachments # Triggering replication again [info] [0.3567.0] 127.0.0.1 - - HEAD /couch_test_rep_db_b/ 200 [info] [0.3567.0] 127.0.0.1 - - GET /couch_test_rep_db_b/ 200 [info] [0.3567.0] 127.0.0.1 - - GET /couch_test_rep_db_b/_local/4296cc7705ed9d0c0cf63593b42a10b7 200 [info] [0.6460.0] Replication `4296cc7705ed9d0c0cf63593b42a10b7` is using: 4 worker processes a worker batch size of 500 20 HTTP connections a connection timeout of 3 milliseconds 10 retries per request socket options are: [{keepalive,true},{nodelay,false}] source start sequence 6 [info] [0.2.0] starting new replication `4296cc7705ed9d0c0cf63593b42a10b7` at 0.6460.0 (`couch_test_rep_db_a` - ` http://127.0.0.1:53314/couch_test_rep_db_b/`) [info] [0.3573.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [0.3569.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [0.3573.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.3567.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_revs_diff 200 [info] [0.3567.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 0.25 seconds due to error req_timedout [info] [0.3573.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 0.5 seconds due to error req_timedout [info] [0.3568.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 1.0 seconds due to error req_timedout [info] [0.3571.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 2.0 seconds due to error req_timedout [info] [0.6463.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 4.0 seconds due to error req_timedout [info] [0.8537.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 8.0 seconds due to error req_timedout [info] [0.9151.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 16.0 seconds due to error req_timedout [info] [0.9768.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 [info] [0.6468.0] Retrying POST request to http://127.0.0.1:53314/couch_test_rep_db_b/_bulk_docs in 32.0 seconds due to error req_timedout [info] [0.10390.0] 127.0.0.1 - - POST /couch_test_rep_db_b/_bulk_docs 201 Bail out! Timeout waiting for replication to finish escript: exception error: bad argument in function etap:diag/1 in call from erl_eval:do_apply/5 in call from erl_eval:exprs/5 in call from escript:eval_exprs/5 in call from erl_eval:local_func/5 in call from escript:interpret/4 in call from escript:start/1 in call from init:start_it/1 Please advise. Thanks, N -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Issues blocking 1.2.0 release
On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana fdman...@apache.orgwrote: Noah, does it fail occasionally or every time for you? Fails every time. I'm assuming you're with a slow machine or the machine is a bit overloaded. Shouldn't be, I'm not doing anything else right now, and this is a new MBA. Can you try with the following patch? Yes. Will report back.
Re: Issues blocking 1.2.0 release
No change, still fails. On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org wrote: On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana fdman...@apache.orgwrote: Noah, does it fail occasionally or every time for you? Fails every time. I'm assuming you're with a slow machine or the machine is a bit overloaded. Shouldn't be, I'm not doing anything else right now, and this is a new MBA. Can you try with the following patch? Yes. Will report back.
Re: Issues blocking 1.2.0 release
On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org wrote: No change, still fails. Noah, to try to find out if it's due to slowness of the machine or some other issue, do you think you can try to increase the following timeout in the test? diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..737cd31 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -287,6 +287,6 @@ replicate(Source, Target) - receive {'DOWN', MonRef, process, Pid, Reason} - etap:is(Reason, normal, Replication finished successfully) -after 30 - +after 90 - etap:bail(Timeout waiting for replication to finish) end. On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org wrote: On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana fdman...@apache.orgwrote: Noah, does it fail occasionally or every time for you? Fails every time. I'm assuming you're with a slow machine or the machine is a bit overloaded. Shouldn't be, I'm not doing anything else right now, and this is a new MBA. Can you try with the following patch? Yes. Will report back. -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Issues blocking 1.2.0 release
OVAR 9000! (Testing now...) On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana fdman...@apache.orgwrote: On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org wrote: No change, still fails. Noah, to try to find out if it's due to slowness of the machine or some other issue, do you think you can try to increase the following timeout in the test? diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..737cd31 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -287,6 +287,6 @@ replicate(Source, Target) - receive {'DOWN', MonRef, process, Pid, Reason} - etap:is(Reason, normal, Replication finished successfully) -after 30 - +after 90 - etap:bail(Timeout waiting for replication to finish) end. On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org wrote: On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana fdman...@apache.orgwrote: Noah, does it fail occasionally or every time for you? Fails every time. I'm assuming you're with a slow machine or the machine is a bit overloaded. Shouldn't be, I'm not doing anything else right now, and this is a new MBA. Can you try with the following patch? Yes. Will report back. -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Issues blocking 1.2.0 release
On Sun, Jan 22, 2012 at 7:20 PM, Noah Slater nsla...@tumbolia.org wrote: Works. How do we proceed? For how much time does the test runs? On 2 different physical machines, it takes about 1 minute and 10 seconds for me. Perhaps some manual replication tests could confirm if there's something wrong with the codebase, your environment or if simply increasing the timeout is not alarming. On Sun, Jan 22, 2012 at 7:05 PM, Noah Slater nsla...@tumbolia.org wrote: OVAR 9000! (Testing now...) On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana fdman...@apache.orgwrote: On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org wrote: No change, still fails. Noah, to try to find out if it's due to slowness of the machine or some other issue, do you think you can try to increase the following timeout in the test? diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..737cd31 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -287,6 +287,6 @@ replicate(Source, Target) - receive {'DOWN', MonRef, process, Pid, Reason} - etap:is(Reason, normal, Replication finished successfully) - after 30 - + after 90 - etap:bail(Timeout waiting for replication to finish) end. On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org wrote: On Sun, Jan 22, 2012 at 6:01 PM, Filipe David Manana fdman...@apache.orgwrote: Noah, does it fail occasionally or every time for you? Fails every time. I'm assuming you're with a slow machine or the machine is a bit overloaded. Shouldn't be, I'm not doing anything else right now, and this is a new MBA. Can you try with the following patch? Yes. Will report back. -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Issues blocking 1.2.0 release
Noah, was able to reproduce your issue by tweaking the test to create more leaf revisions for a document: diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..4eb4765 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -56,7 +56,7 @@ doc_ids() - doc_num_conflicts(doc1) - 10; doc_num_conflicts(doc2) - 100; -doc_num_conflicts(doc3) - 286. +doc_num_conflicts(doc3) - 500. main(_) - With that change, I get exactly the same timeout as you get when the test runs the push replication. It turns out that some _bulk_docs requests are taking more than 30 seconds (default replication connection timeout) therefore the replication request retry messages. Verified this by timing the _bulk_docs handler to log the time it takes: diff --git a/src/couchdb/couch_httpd_db.erl b/src/couchdb/couch_httpd_db.erl index d7ecb4a..442571d 100644 --- a/src/couchdb/couch_httpd_db.erl +++ b/src/couchdb/couch_httpd_db.erl @@ -297,6 +297,7 @@ db_req(#httpd{path_parts=[_,_ensure_full_commit]}=Req, _Db) - send_method_not_allowed(Req, POST); db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db) - +T0 = now(), couch_stats_collector:increment({httpd, bulk_requests}), couch_httpd:validate_ctype(Req, application/json), {JsonProps} = couch_httpd:json_body_obj(Req), @@ -357,7 +358,9 @@ db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db) - {ok, Errors} = couch_db:update_docs(Db, Docs, Options, replicated_changes), ErrorsJson = lists:map(fun update_doc_result_to_json/1, Errors), -send_json(Req, 201, ErrorsJson) +Rr = send_json(Req, 201, ErrorsJson), +?LOG_ERROR(BULK DOCS took ~p ms~n, [timer:now_diff(now(), T0) / 1000]), +Rr end end; db_req(#httpd{path_parts=[_,_bulk_docs]}=Req, _Db) - I was getting _bulk_docs response times after 50 seconds. This convinces me there's nothing wrong with the codebase, the timeouts just needs to be increased: diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..6508112 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -77,6 +77,7 @@ test() - couch_server_sup:start_link(test_util:config_files()), ibrowse:start(), crypto:start(), +couch_config:set(replicator, connection_timeout, 9, false), Pairs = [ {source_db_name(), target_db_name()}, @@ -287,6 +288,6 @@ replicate(Source, Target) - receive {'DOWN', MonRef, process, Pid, Reason} - etap:is(Reason, normal, Replication finished successfully) -after 30 - +after 90 - etap:bail(Timeout waiting for replication to finish) end. Alternatively the test can be updated to create less revisions for the document doc3. The current revisions # is 286 but for the tests' purpose 205+ is enough, which should make it faster - 7000 (max url length) / length(DocRevision) = 205 If it's ok for you, updating the timeouts plus reducing the # from 286 to 210 is fine for me. On Mon, Jan 23, 2012 at 12:00 AM, Noah Slater nsla...@tumbolia.org wrote: I'm just the dumb QA guy. If you have some diagnostics you want me to run on my machine, I am happy to. On Sun, Jan 22, 2012 at 11:31 PM, Filipe David Manana fdman...@apache.orgwrote: On Sun, Jan 22, 2012 at 7:20 PM, Noah Slater nsla...@tumbolia.org wrote: Works. How do we proceed? For how much time does the test runs? On 2 different physical machines, it takes about 1 minute and 10 seconds for me. Perhaps some manual replication tests could confirm if there's something wrong with the codebase, your environment or if simply increasing the timeout is not alarming. On Sun, Jan 22, 2012 at 7:05 PM, Noah Slater nsla...@tumbolia.org wrote: OVAR 9000! (Testing now...) On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana fdman...@apache.orgwrote: On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org wrote: No change, still fails. Noah, to try to find out if it's due to slowness of the machine or some other issue, do you think you can try to increase the following timeout in the test? diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..737cd31 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -287,6 +287,6 @@ replicate(Source, Target) - receive {'DOWN', MonRef, process, Pid, Reason} - etap:is(Reason, normal, Replication finished successfully) - after 30 - + after 90 - etap:bail(Timeout waiting for replication to finish) end. On Sun, Jan 22, 2012 at 6:08 PM, Noah Slater nsla...@tumbolia.org wrote: On Sun, Jan 22, 2012 at 6:01 PM, Filipe
Re: Issues blocking 1.2.0 release
Yep, that seems reasonable. Let me know when I can test again. :) On Mon, Jan 23, 2012 at 1:34 AM, Filipe David Manana fdman...@apache.orgwrote: Noah, was able to reproduce your issue by tweaking the test to create more leaf revisions for a document: diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..4eb4765 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -56,7 +56,7 @@ doc_ids() - doc_num_conflicts(doc1) - 10; doc_num_conflicts(doc2) - 100; -doc_num_conflicts(doc3) - 286. +doc_num_conflicts(doc3) - 500. main(_) - With that change, I get exactly the same timeout as you get when the test runs the push replication. It turns out that some _bulk_docs requests are taking more than 30 seconds (default replication connection timeout) therefore the replication request retry messages. Verified this by timing the _bulk_docs handler to log the time it takes: diff --git a/src/couchdb/couch_httpd_db.erl b/src/couchdb/couch_httpd_db.erl index d7ecb4a..442571d 100644 --- a/src/couchdb/couch_httpd_db.erl +++ b/src/couchdb/couch_httpd_db.erl @@ -297,6 +297,7 @@ db_req(#httpd{path_parts=[_,_ensure_full_commit]}=Req, _Db) - send_method_not_allowed(Req, POST); db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db) - +T0 = now(), couch_stats_collector:increment({httpd, bulk_requests}), couch_httpd:validate_ctype(Req, application/json), {JsonProps} = couch_httpd:json_body_obj(Req), @@ -357,7 +358,9 @@ db_req(#httpd{method='POST',path_parts=[_,_bulk_docs]}=Req, Db) - {ok, Errors} = couch_db:update_docs(Db, Docs, Options, replicated_changes), ErrorsJson = lists:map(fun update_doc_result_to_json/1, Errors), -send_json(Req, 201, ErrorsJson) +Rr = send_json(Req, 201, ErrorsJson), +?LOG_ERROR(BULK DOCS took ~p ms~n, [timer:now_diff(now(), T0) / 1000]), +Rr end end; db_req(#httpd{path_parts=[_,_bulk_docs]}=Req, _Db) - I was getting _bulk_docs response times after 50 seconds. This convinces me there's nothing wrong with the codebase, the timeouts just needs to be increased: diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..6508112 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -77,6 +77,7 @@ test() - couch_server_sup:start_link(test_util:config_files()), ibrowse:start(), crypto:start(), +couch_config:set(replicator, connection_timeout, 9, false), Pairs = [ {source_db_name(), target_db_name()}, @@ -287,6 +288,6 @@ replicate(Source, Target) - receive {'DOWN', MonRef, process, Pid, Reason} - etap:is(Reason, normal, Replication finished successfully) -after 30 - +after 90 - etap:bail(Timeout waiting for replication to finish) end. Alternatively the test can be updated to create less revisions for the document doc3. The current revisions # is 286 but for the tests' purpose 205+ is enough, which should make it faster - 7000 (max url length) / length(DocRevision) = 205 If it's ok for you, updating the timeouts plus reducing the # from 286 to 210 is fine for me. On Mon, Jan 23, 2012 at 12:00 AM, Noah Slater nsla...@tumbolia.org wrote: I'm just the dumb QA guy. If you have some diagnostics you want me to run on my machine, I am happy to. On Sun, Jan 22, 2012 at 11:31 PM, Filipe David Manana fdman...@apache.orgwrote: On Sun, Jan 22, 2012 at 7:20 PM, Noah Slater nsla...@tumbolia.org wrote: Works. How do we proceed? For how much time does the test runs? On 2 different physical machines, it takes about 1 minute and 10 seconds for me. Perhaps some manual replication tests could confirm if there's something wrong with the codebase, your environment or if simply increasing the timeout is not alarming. On Sun, Jan 22, 2012 at 7:05 PM, Noah Slater nsla...@tumbolia.org wrote: OVAR 9000! (Testing now...) On Sun, Jan 22, 2012 at 6:56 PM, Filipe David Manana fdman...@apache.orgwrote: On Sun, Jan 22, 2012 at 6:47 PM, Noah Slater nsla...@tumbolia.org wrote: No change, still fails. Noah, to try to find out if it's due to slowness of the machine or some other issue, do you think you can try to increase the following timeout in the test? diff --git a/test/etap/242-replication-many-leaves.t b/test/etap/242-replication-many-leaves.t index d8d3eb9..737cd31 100755 --- a/test/etap/242-replication-many-leaves.t +++ b/test/etap/242-replication-many-leaves.t @@ -287,6 +287,6 @@ replicate(Source, Target) - receive {'DOWN', MonRef, process, Pid, Reason} - etap:is(Reason, normal, Replication finished