As part of performance day, I wanted to pick something new to work on, and the oops reports (which sadly are not able to be made public as yet - they could disclose personal data) provide a good basis to do that. So I analysed the timeouts from the most recent edge report I had handy. Edge because it has most of our fixes.
I've elided non performance, and personal data, from this. A few over-view thoughts: - the misgrouping of timeout OOPS makes this *really really hard*. I wonder if turning it off entirely would help, and just group by urls ? Nearly all of the OOPS's I looked at were linked to an irrelevant bug. - getting ++profile++ in place on staging will really help for analysing the non-sql-slowness we're seeing on a growing fraction of OOPS. - Many were death-by-sql-roundtrips - some were python - a couple look like db-tuning exercises - some were just plain weird I'll happily dig out detail from OOPS if anyone wants to look into a particular timeout. On Mon, Aug 9, 2010 at 12:55 PM, OOPS Summaries <[email protected]> wrote: > * 64 Time Outs This is great! > === Top 10 Time Out Counts by Page ID === > > Hard / Soft Page ID > 13 / 44 BugTask:+index > 12 / 6 DistroSeries:+index > 8 / 1974 DistroSeries:EntryResource:getBuildRecords > 8 / 160 Person:+participation > 7 / 4 BugTask:+editstatus-page > 3 / 77 Sprint:+temp-meeting-export > 3 / 4 DistributionSourcePackage:+filebug > 3 / 3 Person:+map > 2 / 0 BugTask:+distrotask > 1 / 0 Milestone:+index > === Top 15 Time Outs (total of 25 unique items) === > > 10 SELECT Packaging.datecreated, Packaging.distroseries, Packaging.id, > Packaging.owner, Packaging.pa ... ductSeries.product = Product.id WHERE > Packaging.distroseries = %s ORDER BY SourcePackageName.name: > GET: 10 Robots: 1 Local: 5 > 5 https://edge.launchpad.net/ubuntu/lucid/+index (DistroSeries:+index) > OOPS-1681EB2995, OOPS-1681EB3871, OOPS-1681EB3901, OOPS-1681EC3753, > OOPS-1681ED3039 > 5 https://edge.launchpad.net/ubuntu/maverick/+index > (DistroSeries:+index) > OOPS-1681EB2638, OOPS-1681EC2433, OOPS-1681ED1407, OOPS-1681ED2293, > OOPS-1681ED2619 - this is https://bugs.edge.launchpad.net/launchpad-registry/+bug/612358 which is inprogress, but possibly we're getting confused signals from the oops - the 'obvious' reading may be wrong and the portlet is simply the first-after culprit from some slow non-sql code. > 10 SELECT PublishedPackage.archive, > PublishedPackage.binarypackagedescription, PublishedPackage.bina ... = $INT > AND PublishedPackage.archive IN ($INT, $INT) ORDER BY PublishedPackage.id > DESC LIMIT $INT: > POST: 10 Robots: 0 Local: 10 > 3 https://bugs.edge.launchpad.net/ubuntu/+source/linux/+filebug/BLOBID > (DistributionSourcePackage:+filebug) > OOPS-1681EA1958, OOPS-1681EA1963, OOPS-1681EB1966 > 1 https://bugs.edge.launchpad.net/ubuntu/+bug/615153/+editstatus > (BugTask:+editstatus-page) > OOPS-1681ED4365 > 1 > https://bugs.edge.launchpad.net/ubuntu/+source/firefox-3.5/+bug/614861/+editstatus > (BugTask:+editstatus-page) > OOPS-1681EB1816 > [5 other URLs] This is fixed in db-devel - its in our inventory now and should disappear completely midweek :). Its caused by guessPackageNames. > 9 SELECT Person.account, Person.addressline1, Person.addressline2, > Person.city, Person.country, Per ... r, Person.verbose_bugnotifications, > Person.visibility FROM Person WHERE Person.id = %s LIMIT $INT: > GET: 9 Robots: 0 Local: 4 > 2 > https://bugs.edge.launchpad.net/ubuntu/+source/flashplugin-nonfree/+bug/410407/+index > (BugTask:+index) > OOPS-1681ED2737, OOPS-1681ED3789 > 2 https://edge.launchpad.net/sprints/uds-m/+temp-meeting-export > (Sprint:+temp-meeting-export) > OOPS-1681EC3472, OOPS-1681ED873 > 2 https://edge.launchpad.net/ubuntu/lucid/+index (DistroSeries:+index) > OOPS-1681EC3318, OOPS-1681ED3079 > [3 other URLs] This is, I think, three different oops being misgrouped. The bug page one is likely being driven by a slow formatter - it has lots of python time between calls to is_valid_person. https://bugs.edge.launchpad.net/malone/+bug/607935 it does have lots of sql calls going on too, which won't be helping. > 5 SELECT COUNT(*) FROM Archive, BinaryPackageBuild, BuildFarmJob, > PackageBuild WHERE distro_arch_se ... tus=$INT AND Archive.purpose IN > ($INT,$INT) AND Archive.id = PackageBuild.archive AND ($INT=$INT): > GET: 5 Robots: 0 Local: 0 > 5 https://api.edge.launchpad.net/1.0/ubuntu/maverick > (DistroSeries:EntryResource:getBuildRecords) > OOPS-1681EA1667, OOPS-1681EB1345, OOPS-1681EB3523, OOPS-1681EC1367, > OOPS-1681ED346 -- https://lp-oops.canonical.com/oops.py/?oopsid=1681EA1667 is very odd - 4 second timeout, there is a bug about this already - I'd love it if stuart/gary could comment on it - https://bugs.edge.launchpad.net/launchpad-foundations/+bug/439945 -- https://lp-oops.canonical.com/oops.py/?oopsid=1681EB1345 is more reasonable: 10 seconds sql 4 seconds nonsql. - APIs are involved - 4 seconds of sql time this request shouldn't need: https://bugs.launchpad.net/soyuz/+bug/590708 - also the query mentioned in that same bug is still a problem and needs fixing. We have two different things on one bug, may want to split that for QA clarity. > 4 SELECT MailingList.date_activated, MailingList.date_registered, > MailingList.date_reviewed, Mailin ... gList.team = Person.id AND Person.name > = $STRING AND Person.teamowner IS NOT NULL AND ($INT=$INT): > GET: 4 Robots: 0 Local: 4 > 1 https://bugs.edge.launchpad.net/%7Ecody-somerville/+participation > (Person:+participation) > OOPS-1681ED4390 > 1 https://edge.launchpad.net/%7Elifeless/+participation > (Person:+participation) > OOPS-1681EC4106 > 1 https://edge.launchpad.net/%7Emcasadevall/+participation > (Person:+participation) > OOPS-1681EB2147 > [1 other URLs] This is death by a thousand cuts from mailing list attribute lookups - needs prepopulation a-la _all_members or similar, its in progress already. https://bugs.launchpad.net/launchpad-registry/+bug/607879 > 3 SELECT BinaryPackageBuild.distro_arch_series, BinaryPackageBuild.id, > BinaryPackageBuild.package_b ... kageBuild.archive AND ($INT=$INT) ORDER BY > BuildFarmJob.date_finished DESC LIMIT $INT OFFSET $INT: > GET: 3 Robots: 0 Local: 0 > 3 https://api.edge.launchpad.net/1.0/ubuntu/maverick > (DistroSeries:EntryResource:getBuildRecords) > OOPS-1681EB3256, OOPS-1681EB3322, OOPS-1681EB4189 This is the same issue two sections up - grouping issue again. > 3 SELECT PersonLocation.date_created, PersonLocation.date_last_modified, > PersonLocation.id, PersonL ... isible IS TRUE AND Person.id = > PersonLocation.person AND Person.teamowner IS NULL AND ($INT=$INT): > GET: 3 Robots: 3 Local: 0 > 3 https://edge.launchpad.net/%7Elocoteams/+map (Person:+map) > OOPS-1681EB2233, OOPS-1681EC4659, OOPS-1681EC951 These oops all seem to spend ~ 30 seconds in lala land after a 570ms query - we need a staging profile generated for this. ++profile++ would help immensely here. > 2 SELECT Bug.heat FROM Bug, Bugtask, DistroSeries WHERE Bugtask.bug = > Bug.id AND Bugtask.distroseries = DistroSeries.id AND > DistroSeries.distribution = $INT ORDER BY Bug.heat DESC LIMIT $INT: > POST: 2 Robots: 0 Local: 2 > 2 https://bugs.edge.launchpad.net/ubuntu/+bug/614983/+distrotask > (BugTask:+distrotask) > OOPS-1681ED2225, OOPS-1681ED2288 Also misgrouped in the oops server, these are timing out due to SELECT Bug.heat FROM Bug, Bugtask, DistroSeries WHERE Bugtask.bug = Bug.id AND Bugtask.distroseries = DistroSeries.id AND DistroSeries.distribution = 3 ORDER BY Bug.heat DESC LIMIT 1; taking forever - filed a bug and tagged for Stuart to look at. https://bugs.edge.launchpad.net/malone/+bug/615644 > 2 SELECT Person.account, Person.addressline1, Person.addressline2, > Person.city, Person.country, Per ... erson.id AND TeamParticipation.person = > %s AND ($INT=$INT) ORDER BY Person.datecreated LIMIT $INT: > GET: 2 Robots: 0 Local: 2 > 1 https://bugs.edge.launchpad.net/%7Epitti/+participation > (Person:+participation) > OOPS-1681EB1999 > 1 https://edge.launchpad.net/%7Emdz/+participation > (Person:+participation) > OOPS-1681ED615 This is +participation again, see under 'misgrouping' > 1 SELECT %s FROM (SELECT BranchRevision.branch, BranchRevision.id, > BranchRevision.revision, BranchR ... OT IN ( SELECT revision FROM > BranchRevision WHERE branch = $INT) LIMIT $INT) AS "_tmp" LIMIT $INT: > GET: 1 Robots: 0 Local: 1 > 1 > https://code.edge.launchpad.net/%7Ejames-w/launchpad/no-more-sampledata-2/+merge/31891/+index > (BranchMergeProposal:+index) > OOPS-1681EC3984 https://bugs.edge.launchpad.net/launchpad-code/+bug/615647 - There is a single query that is 3ms on staging, but took 13seconds on production, which may fall under the 'we need more diagnostics to be able to fix these things' category. > 1 SELECT %s FROM (SELECT BugSubscription.bug, BugSubscription.date_created, > BugSubscription.id, Bug ... (SELECT id FROM Bug WHERE duplicateof = $INT) > AND person = $INT LIMIT $INT) AS "_tmp" LIMIT $INT: > GET: 1 Robots: 0 Local: 0 > 1 > https://bugs.edge.launchpad.net/ubuntu/+source/arts/+bug/320915/+index > (BugTask:+index) > OOPS-1681EB4115 The OOPS is again misgrouped by oopstools. This is a death by lots-of-sql - 800 queries! Probably very amenable to the sort of thing my /participants branch does. The actual bug is https://bugs.launchpad.net/bugs/607935. It also has pretty high python time, which will need some attention to. And I ran out of steam here with 4 groups still to analyse - but we're down to 1-per-day, so fairly far down the 'look at it now pile'. >From here, I think its time to pop the top bug off the timeout pile. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

