On Mar 25, 2011, at 12:21 AM, Ian Booth wrote:

> Hi folks
> 
> An update on the status of this. The system appears ok *for now*.

Thank you, Ian.  Also thanks to Danilo for his help today in looking at the 
symptoms and talking over possible solutions.

I'm regarding this as critical and no longer an incident, and I think that's 
where you left this.  Practically, that means I think I work on it until it is 
fixed, but we don't have to work on it over the weekend.

> A cowboy was applied to prod (the optimisation gary refers to below) but
> in reality it didn't make much difference to the sql being run. By far
> the latest number of sql statements were single selects to get
> subscribers' email details and subscription details. There were 100's of
> them. These should each be consolidated into a bulk select. Or something
> else done to address the issue. See bug
> https://bugs.launchpad.net/launchpad/+bug/742230

I analyzed the new bug and the logs more.  I used a trick to add some 
timestamps to our SQL logging: http://pastebin.ubuntu.com/585447/ .  It's a bit 
fussy, but gave a lot more visibility than nothing at all.  Running this on 
qastaging led to a red herring: sendmail was consuming 2/3 of the runtime on 
qastaging.  That's doubly odd because we don't even send anything out from 
qastaging to my knowledge.  However, returning to the production logs, this 
symptom does not seem to be the case: I don't believe sendmail is the culprit 
for the particular problem we are experiencing now, because correlating the 
logs with the code (and the timestamped version we saw on qastaging), we were 
most often sending emails quite fast--86 a second in one case Danilo looked at.

I got one run from the LOSAs of that SQL logging trick on production, but the 
bug notification queue was empty; the LOSAs did not get around to fulfilling my 
request for a second run.  Limited LOSA availability has made looking at this 
problem even harder than it might be otherwise.

I did some more work on optimizing the query problems we saw.  This branch 
reduces the pertinent queries in my analysis locally, and should make a 
particular difference for reducing queries for notifications based on team 
subscriptions.  That is, on a small scale of notifications, this code is 1/3 to 
1/2 better; but I believe the characteristics of these changes will make a much 
bigger difference in production, hopefully in an order of magnitude.

lp:~gary/launchpad/bug741684

I'll try to get that cowboyed Monday to see if that addresses the query 
situation better.

We also should run representative unique SQL generated by this code with 
EXPLAIN ANALYZE, ideally in the production database, since qastaging and 
staging do not generate notifications in the quantity that production does.  
Since I did not get a response for the next step of my previously mentioned 
request to the LOSAs, I never got to making this request of them.

> 
> The latest dbr report shows the bugnotification user using approx 10%
> cpu over the past few hours. Bug email is going out but the cpu is still
> far too high. Why is the system running "ok" now? I'm not sure and
> that's a worry.

As I saw some other people hypothesize, I'm guessing we're running just barely 
OK most of the time right now, but we can be flipped over into cascading errors 
whenever we tip over into extra capacity.

> 
> Another issue I noticed from the logs is as follows. The cronjob runs
> every 5 minutes. The logs over the past few days show that there are
> times when notifications are still being sent when the next script
> invocation occurs eg
> 
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Creating lockfile:
> /var/lock/launchpad-send-bug-notifications.lock
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> 2011-03-23 15:15:08 INFO    Notifying xxx about bug 736049.
> ...
> 2011-03-23 15:15:16 INFO    Notifying xxx about bug 733732.
> 
> Often also the notification lines in the log are several seconds or more
> apart, indicating the call to sendmail() blocks for a time. So I have 2
> questions:
> 
> 1. How is the new script invocation happening if the old one appears to
> still be running? My theory is that the new script starts and blocks
> until the old one finishes. And if the next one is slow too, then it all
> compounds....

That doesn't quite jibe with what I think we see here, but I could be wrong.  
The core issue does appear to be that it seems to be possible for a script to 
run simultaneously, though we haven't caught that smoking gun yet.

I don't intend to address this at the moment, though I'll be interested in 
tackling it when we go to bug rotation if someone else has not already looked 
at it.

> 
> 2. Why do some calls to sendmail() take so long to complete. And given
> they do, what can be done about it.

This is a good question, but the production logs appear to show that the core 
problem at the moment is in LP code.

> 
> So that's all for now. There's no definitive fix that's been applied,
> but the logs have given perhaps a little more insight into where to
> start looking next.

Thanks again, Ian and Danilo.

Gary



> 
> 
> On 24/03/11 23:05, Gary Poster wrote:
>> Thanks for the heads up, Stuart.
>> 
>> Stuart clarified on IRC that this is currently a scalability problem rather 
>> than a performance problem, though that could change.
>> 
>> This started happening after the most recent DB deploy.  The most likely 
>> cause is my own work to address bug 164196, "Quickly-undone actions 
>> shouldn't send mail notifications" (code: 
>> http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/revision/12533#lib/lp/bugs/scripts/bugnotification.py).
>> 
>> There is an obvious optimization to try (get the activity record along with 
>> the notification that points to it). I expect that will reduce the database 
>> usage, but I would be very surprised if it would get us down to the previous 
>> number.  We need to look at more data than before in order to answer these 
>> questions.
>> 
>> I've created https://bugs.launchpad.net/launchpad/+bug/741684 for this 
>> optimization.
>> 
>> Gary
>> 
>> On Mar 24, 2011, at 4:25 AM, Stuart Bishop wrote:
>> 
>>> Hi.
>>> 
>>> The database utilization report has picked up that bug notifications
>>> is now chewing 17% of a master database CPU core. 2 months ago, it was
>>> using <1% of a master database CPU core.
>>> 
>>> This seems excessive, and could be a red flag to making things more
>>> complex with more advanced filters.
>>> 
>>> -- 
>>> Stuart Bishop <[email protected]>
>>> http://www.stuartbishop.net/
>> 
>> 
>> _______________________________________________
>> Mailing list: https://launchpad.net/~launchpad-dev
>> Post to     : [email protected]
>> Unsubscribe : https://launchpad.net/~launchpad-dev
>> More help   : https://help.launchpad.net/ListHelp
>> 
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp


_______________________________________________
Mailing list: https://launchpad.net/~launchpad-dev
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~launchpad-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to