eric casteleijn wrote:
eric casteleijn wrote:
We have the following setup:
2 near identical public facing django servers communicating with one
couchdb server. The couchdb server is oauth authenticated and people
can access it directly (well, through an apache proxy) if they have
the tokens to do so. New users are signed up through these django
servers, after which they add the user and their tokens to couchdb.
(the user through a POST to _users and the tokens through PUTs to
_config)
We see this failing a lot, now to the point where we think it fails
all the time (since all those systems have separate logs not all of
which we have access to, this is not trivial to piece together.)
The errors the API servers get back all look like these (the lines
starting with '(500':
'2009-10-27 22:35:15,357 ERROR UbuntuOne.couch: failed to add *****
= 40693 to section [oauth_token_users] of local.ini:
(500, (u'timeout', u'{gen_server,call,\n [couch_config,\n
{set,"oauth_token_users","*****","40693",true}]}'))'
'2009-10-27 22:35:20,399 ERROR UbuntuOne.couch: failed to add *****
= ***** to section [oauth_token_secrets] of local.ini:
(500, (u'timeout', u'{gen_server,call,\n [couch_config,\n
{set,"oauth_token_secrets","*****",\n
"*****",\n true}]}'))'
Corresponding errors in the couchdb.log look like:
Oops, sent before pasting that in:
[Tue, 27 Oct 2009 23:29:42 GMT] [error] [<0.9591.67>] Uncaught error in
HTTP request: {exit,
{timeout,
{gen_server,call,
[couch_config,
{set,"oauth_token_secrets",
"*****",
"*****",
true}]}}}
[Tue, 27 Oct 2009 23:29:42 GMT] [info] [<0.9591.67>] Stacktrace:
[{gen_server,call,2},
{couch_httpd_misc_handlers,handle_config_req,1},
{couch_httpd,handle_request,5},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]
[Tue, 27 Oct 2009 23:29:42 GMT] [debug] [<0.9591.67>] httpd 500 error
response:
{"error":"timeout","reason":"{gen_server,call,\n
[couch_config,\n {set,\"oauth_token_secrets\",\"*****\",\n
\"*****\",\n true}]}"}
[Tue, 27 Oct 2009 23:29:42 GMT] [info] [<0.9591.67>] 91.189.89.54 - -
'PUT' /_config/oauth_token_secrets/***** 500
And another entry I see a *lot* in the couchdb logs, more than the 500s
even is this:
[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] 'PUT'
/_config/oauth_token_secrets/k10rBFRc6Lv20sWWBtSX {1,1}
Headers: [{'Accept',"application/json"},
{'Accept-Encoding',"identity"},
{'Content-Length',"82"},
{'Host',"marang.canonical.com:9030"},
{'User-Agent',"couchdb-python ?"}]
[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] OAuth Params: []
[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] Minor error in
HTTP request: {unauthorized,<<"Authentication required.">>}
[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] Stacktrace:
[{couch_httpd,authenticate_request,2},
{couch_httpd,handle_request,5},
{mochiweb_http,headers,5},
{proc_lib,init_p_do_apply,3}]
[Tue, 27 Oct 2009 23:29:37 GMT] [debug] [<0.9589.67>] httpd 401 error
response:
{"error":"unauthorized","reason":"Authentication required."}
Which I think is just couch telling the client: sorry, you have to
authenticate, but the logs doesn't show any subsequent resolution, just
a lot of these 401s after eachother. (which may be because we retry a
number of times when a request fails.)
Could one of the problems be that (the way we use) couchdb-python does
not always know that it needs to reauthenticate when it gets a 401?
My theory was that these writes to _config fail because the local.ini
is somehow corrupted, but I can't access that file directly (since it
has users' secrets) or copy it to my machine to test this theory, and
helping someone who is allowed to see it look for anything weird is
like searching for the proverbial needle in the haystack: we have lots
of users, and users can have multiple tokens. Add to that the fact
that you cannot ever delete a line from the .ini file (DELETEs against
keys in _config just empty the value and leave a line like 'foo = \n'!
After speaking to Jan on the channel he proposed that it may be that
the gen_server message inbox overflows and the gen_server times out.
Could that be, under high load, and how can we solve this? Can we
increase the size of this inbox, or can we possibly have multiple
processes handling the access? Whether it's high load or corruption or
something else again, right now it looks like NO new tokens can be
added, and hence no new users can use our system. In short: HALP!
--
- eric casteleijn
https://launchpad.net/~thisfred
http://www.canonical.com