Ok, so I've tracked it down to the specific location where it happens - couch_rep_reader:spawn_document_request/2 is called - in the SpawnFun defined in there, it calls couch_rep_reader:open_doc - open_doc gets an error, not_found response (not sure why, shouldn't the doc be there already?) - open_doc returns [] back to the SpawnFun - SpawnFun calls gen_server:call(Server, {add_docs, nil, Results}... with Results being [] - handle_call(add_docs) calls handle_add_docs, which increments the document count..by 0.. - and then returns {noreply,...} - then everything just sits there, because each part is waiting for another part to do something
It seems the solution here is to either add a retry into spawn_document_request's SpawnFun, or at the very least, fail when open_doc returns [], rather than continuing on, since that results in a set of deadlocked processes. On Thu, Jun 10, 2010 at 9:28 AM, Paul Bonser <mister...@gmail.com> wrote: > Nope, just a regular 7200RPM SATA drive. > > So you guys may already know tihs, but I've tracked it down to a couch_rep > gen_server never terminating, and thus not calling do_terminate, and thus > the call to gen_server:call(Server, get_result, infinity) in > couch_rep:get_result just hangs forever. > > > On Thu, Jun 10, 2010 at 4:39 AM, Jan Lehnardt <j...@apache.org> wrote: > >> Hi Paul, >> >> thanks for the report. Out of curiosity, are you running an SSD drive in >> the box that reproduces the hangs? >> >> And anyone: Can you reproduce this on non-SSD machines? >> >> Cheers >> Jan >> -- >> >> On 10 Jun 2010, at 02:26, Paul Bonser wrote: >> >> > Oh, I should also mention that I got the exact same error in multiple >> > freezes. Twice it was in the same exact order, and once it was in this >> > order: >> > >> > [info] [<0.95.0>] starting replication >> "15c25eda4ea6308af6bea9864d5319ef" at >> > <0.1845.0> >> > [debug] [<0.1207.0>] OAuth Params: [{"att_encoding_info","true"}] >> > [info] [<0.1207.0>] 127.0.0.1 - - 'GET' >> > /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200 >> > [debug] [<0.1207.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs {1,1} >> > Headers: [{'Accept',"application/json"}, >> > {'Accept-Encoding',"gzip"}, >> > {'Content-Length',"167"}, >> > {'Host',"localhost:5985"}, >> > {'User-Agent',"CouchDB/0.12.0a953193"}, >> > {"X-Couch-Full-Commit","false"}] >> > [debug] [<0.1207.0>] OAuth Params: [] >> > [info] [<0.1207.0>] 127.0.0.1 - - 'POST' >> > /test_suite_rep_docs_db_b/_bulk_docs 201 >> > [debug] [<0.1076.0>] 'GET' >> > /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1} >> > Headers: [{'Accept',"application/json"}, >> > {'Accept-Encoding',"gzip"}, >> > {'Host',"localhost:5985"}, >> > {'User-Agent',"CouchDB/0.12.0a953193"}] >> > [debug] [<0.1076.0>] OAuth Params: [{"att_encoding_info","true"}] >> > [debug] [<0.1076.0>] Minor error in HTTP request: {not_found,missing} >> > [debug] [<0.1076.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4}, >> > {couch_httpd_db,db_doc_req,3}, >> > {couch_httpd_db,do_db_req,2}, >> > {couch_httpd,handle_request_int,5}, >> > {mochiweb_http,headers,5}, >> > {proc_lib,init_p_do_apply,3}] >> > [info] [<0.1076.0>] 127.0.0.1 - - 'GET' >> > /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404 >> > [debug] [<0.1076.0>] httpd 404 error response: >> > {"error":"not_found","reason":"missing"} >> > >> > >> > Could it be some sort of race condition? >> > >> > >> > >> > On Wed, Jun 9, 2010 at 8:22 PM, Paul Bonser <mister...@gmail.com> >> wrote: >> > >> >> >> >> >> >> On Wed, Jun 9, 2010 at 7:33 PM, J Chris Anderson <jch...@apache.org >> >wrote: >> >> >> >>> Devs, >> >>> >> >>> Is anyone else seeing the replicator test hang and never finish? >> >>> >> >>> It never hangs the first few runs, but after running ten or so times, >> I'll >> >>> end up with the test suite waiting for a replication that never >> finishes. >> >>> It's the same story on 0.11.0, 0.11.x, and trunk. >> >>> >> >>> Is anyone else able to reproduce this? Am I crazy? >> >>> >> >> >> >> It just froze for me on the first try, using 0.12.0a935298, then ran >> >> successfully 3 times, then froze again. The last thing logged the first >> time >> >> was a _bulk_docs requests, the last thing logged this time was a PUT to >> >> /test_suite_db_b/_local%2F6598a76aa55cd39645e4730b4cb28c00 >> >> >> >> I'm running a Firefox 3.6 nightly build on Linux. For me, it froze the >> >> first time when I did a "run all" and the second time when just >> directly >> >> running the replication test. >> >> >> >> After svn up-ing to the latest in trunk, it froze on the first try, >> >> directly running the replication test. >> >> >> >> Here's the debug output for the last _replicate request where it >> freezes. >> >> It's requesting a document that isn't there. >> >> >> >> >> >> [info] [<0.95.0>] starting new replication >> >> "15c25eda4ea6308af6bea9864d5319ef" at <0.848.0> >> >> [debug] [<0.191.0>] 'GET' >> >> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true {1,1} >> >> Headers: [{'Accept',"application/json"}, >> >> {'Accept-Encoding',"gzip"}, >> >> {'Host',"localhost:5985"}, >> >> {'User-Agent',"CouchDB/0.12.0a953193"}] >> >> [debug] [<0.191.0>] OAuth Params: [{"att_encoding_info","true"}] >> >> [info] [<0.191.0>] 127.0.0.1 - - 'GET' >> >> /test_suite_rep_docs_db_a/foo2?att_encoding_info=true 200 >> >> [debug] [<0.189.0>] 'GET' >> >> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true {1,1} >> >> Headers: [{'Accept',"application/json"}, >> >> {'Accept-Encoding',"gzip"}, >> >> {'Host',"localhost:5985"}, >> >> {'User-Agent',"CouchDB/0.12.0a953193"}] >> >> [debug] [<0.189.0>] OAuth Params: [{"att_encoding_info","true"}] >> >> [debug] [<0.189.0>] Minor error in HTTP request: {not_found,missing} >> >> [debug] [<0.189.0>] Stacktrace: [{couch_httpd_db,couch_doc_open,4}, >> >> {couch_httpd_db,db_doc_req,3}, >> >> {couch_httpd_db,do_db_req,2}, >> >> {couch_httpd,handle_request_int,5}, >> >> {mochiweb_http,headers,5}, >> >> {proc_lib,init_p_do_apply,3}] >> >> [info] [<0.189.0>] 127.0.0.1 - - 'GET' >> >> /test_suite_rep_docs_db_a/foo666?att_encoding_info=true 404 >> >> [debug] [<0.189.0>] httpd 404 error response: >> >> {"error":"not_found","reason":"missing"} >> >> >> >> [debug] [<0.191.0>] 'POST' /test_suite_rep_docs_db_b/_bulk_docs {1,1} >> >> Headers: [{'Accept',"application/json"}, >> >> {'Accept-Encoding',"gzip"}, >> >> {'Content-Length',"167"}, >> >> {'Host',"localhost:5985"}, >> >> {'User-Agent',"CouchDB/0.12.0a953193"}, >> >> {"X-Couch-Full-Commit","false"}] >> >> [debug] [<0.191.0>] OAuth Params: [] >> >> [info] [<0.191.0>] 127.0.0.1 - - 'POST' >> >> /test_suite_rep_docs_db_b/_bulk_docs 201 >> >> >> >> >> >> >> >> >> >> -- >> >> Paul Bonser >> >> http://probablyprogramming.com >> >> >> > >> > >> > >> > -- >> > Paul Bonser >> > http://probablyprogramming.com >> >> > > > -- > Paul Bonser > http://probablyprogramming.com > -- Paul Bonser http://probablyprogramming.com