eric casteleijn wrote:
eric casteleijn wrote:
We have the following setup:

2 near identical public facing django servers communicating with one couchdb server. The couchdb server is oauth authenticated and people can access it directly (well, through an apache proxy) if they have the tokens to do so. New users are signed up through these django servers, after which they add the user and their tokens to couchdb. (the user through a POST to _users and the tokens through PUTs to _config)

We see this failing a lot, now to the point where we think it fails all the time (since all those systems have separate logs not all of which we have access to, this is not trivial to piece together.)

The errors the API servers get back all look like these (the lines starting with '(500':

'2009-10-27 22:35:15,357 ERROR UbuntuOne.couch: failed to add ***** = 40693 to section [oauth_token_users] of local.ini:

(500, (u'timeout', u'{gen_server,call,\n [couch_config,\n {set,"oauth_token_users","*****","40693",true}]}'))'

'2009-10-27 22:35:20,399 ERROR UbuntuOne.couch: failed to add ***** = ***** to section [oauth_token_secrets] of local.ini:

(500, (u'timeout', u'{gen_server,call,\n [couch_config,\n {set,"oauth_token_secrets","*****",\n "*****",\n true}]}'))'

Corresponding errors in the couchdb.log look like:

Oops, sent before pasting that in:

[Tue, 27 Oct 2009 23:29:42 GMT] [error] [<0.9591.67>] Uncaught error in HTTP request: {exit,
                                 {timeout,
                                  {gen_server,call,
                                   [couch_config,
                                    {set,"oauth_token_secrets",
                                     "*****",
                                     "*****",
                                     true}]}}}

[Tue, 27 Oct 2009 23:29:42 GMT] [info] [<0.9591.67>] Stacktrace: [{gen_server,call,2},
             {couch_httpd_misc_handlers,handle_config_req,1},
             {couch_httpd,handle_request,5},
             {mochiweb_http,headers,5},
             {proc_lib,init_p_do_apply,3}]

[Tue, 27 Oct 2009 23:29:42 GMT] [debug] [<0.9591.67>] httpd 500 error response: {"error":"timeout","reason":"{gen_server,call,\n [couch_config,\n {set,\"oauth_token_secrets\",\"*****\",\n \"*****\",\n true}]}"}


[Tue, 27 Oct 2009 23:29:42 GMT] [info] [<0.9591.67>] 91.189.89.54 - - 'PUT' /_config/oauth_token_secrets/***** 500


And another entry I see a *lot* in the couchdb logs, more than the 500s even is this:


[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] 'PUT' /_config/oauth_token_secrets/k10rBFRc6Lv20sWWBtSX {1,1}
Headers: [{'Accept',"application/json"},
          {'Accept-Encoding',"identity"},
          {'Content-Length',"82"},
          {'Host',"marang.canonical.com:9030"},
          {'User-Agent',"couchdb-python ?"}]

[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] OAuth Params: []

[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] Minor error in HTTP request: {unauthorized,<<"Authentication required.">>}

[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] Stacktrace: [{couch_httpd,authenticate_request,2},
             {couch_httpd,handle_request,5},
             {mochiweb_http,headers,5},
             {proc_lib,init_p_do_apply,3}]

[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] httpd 401 error response:
 {"error":"unauthorized","reason":"Authentication required."}

Which I think is just couch telling the client: sorry, you have to authenticate, but the logs doesn't show any subsequent resolution, just a lot of these 401s after eachother. (which may be because we retry a number of times when a request fails.)

Could one of the problems be that (the way we use) couchdb-python does not always know that it needs to reauthenticate when it gets a 401?

My theory was that these writes to _config fail because the local.ini is somehow corrupted, but I can't access that file directly (since it has users' secrets) or copy it to my machine to test this theory, and helping someone who is allowed to see it look for anything weird is like searching for the proverbial needle in the haystack: we have lots of users, and users can have multiple tokens. Add to that the fact that you cannot ever delete a line from the .ini file (DELETEs against keys in _config just empty the value and leave a line like 'foo = \n'!

After speaking to Jan on the channel he proposed that it may be that the gen_server message inbox overflows and the gen_server times out.

Could that be, under high load, and how can we solve this? Can we increase the size of this inbox, or can we possibly have multiple processes handling the access? Whether it's high load or corruption or something else again, right now it looks like NO new tokens can be added, and hence no new users can use our system. In short: HALP!






--
- eric casteleijn
https://launchpad.net/~thisfred
http://www.canonical.com

Reply via email to