We had a period - about 36 hours long - where the xmlrpc internal server was simply saturated and overloaded. During this period folk would experience a backtrace as described in https://bugs.launchpad.net/launchpad/+bug/674416.
The master bug for this is https://bugs.launchpad.net/launchpad-code/+bug/674305. I'm writing to let everyone know the current status in case either: - the issue is not fixed - the fix triggers a different failure Francis and I had discussed this earlier and thought that disabling the importds (which are still disabled AFAIK) would mitigate the problem, but pretty much immediately after he signed off we got more reports. This was a pretty high visibility problem - while occuring it would happen reliably push after push; its scary and unlike our timeouts unclear about its implications. So, I escalated it within ISO - got Charlie and James at dinner, and James popped into IRC when they had finished. What we did was to execute RT 41465 - the highest priority RT for LP at the time, because that was filed with the explicit goal of sharing the load from our XMLRPC services over more workers. We considered other options: - just leave it (not good, it was widespread, pervasive and very worrying for users (is there data safe? Yes - but how do they know) - put a loop in in the codehosting server - high risk, code change with unknown knock on effects - disable more services (couldn't: mailing lists, codehosting, apache rewrites for codebrowse were all that remained driving traffic to the service). Having done that consideration, James Troup spent a number of hours this evening reconfiguring the xmlrpc internal server : its now served from the main lpnet cluster (basically the same configuration that e.g. qastaging and staging use) - which gives these APIs 10 times the resources - 52 worker threads rather than 4 (though shared amongst many more users too). To revert this (in the event of a pear-shaping occurence): - DNS needs to be changed back - a /etc/hosts on the codehosting machines needs to be changed back similarly. If we're right about the driving issue in this, we should see many less timeouts for internal XMLRPC operations from tomorrow. As I write this, https://lpstats.canonical.com/graphs/CodehostingPerformance/20101107/20101114/ shows a period of high-latency ssh connection spikes (which are driven by xml responsiveness - its how we get the ssh keys) may be over. https://lpstats.canonical.com/graphs/CodehostingPerformance/20101113/20101114/ gives a closer view - you can clearly see that single requests were spiking up to > 10 seconds. Similarly the current days oops are at: Time Out Counts by Page ID Hard Soft Page ID 384 578 CodehostingApplication:CodehostingAPI 94 832 CodeImportSchedulerApplication:CodeImportSchedulerAPI 89 14 Person:+commentedbugs 14 39 BugTask:+index 10 16 Archive:EntryResource:getBuildSummariesForSourceIds 4 0 https://api.edge.launchpad.net 4 0 ProjectGroup:+milestones 2 37 Distribution:+bugtarget-portlet-bugfilters-stats 2 7 DistroSeriesLanguage:+index 2 4 Distribution:+archivemirrors (from https://devpad.canonical.com/~lpqateam/lpnet-oops.html#time-outs) Again, if it was over-saturation causing the issue, we should expect the first two rows to stay constant (or nearly so) over the remainder of the day. Its been stable for a good 30+ minutes now - in fact +commented bugs just passed the codeimportschedulerAPI, with the xml resources staying constant. This is a good sign. We have some followup work we should do: - reprovision the old xmlrpc server as a regular appserver - do the single-threaded appserver experiment - delete the now unused OOPS prefix for the gone appserver instance & delete the production config for it. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

