https://bugs.launchpad.net/launchpad/+bug/758587 describes a challenge we have; counting bugs is expensive.
portlets like those on https://bugs.staging.launchpad.net/ubuntu/+bugs?search=Search&field.status=In+Progress and https://bugs.staging.launchpad.net/ubuntu show various aggregates of bugs. They've been inaccurate (counting multiple tasks towards the count) in the past; I recently made them more accurate as part of making them faster; wgrant has just this week optimised the tags portlet. We're now about as fast as we can get while we do table scans of this data in our current architecture (*). Scanning the 700K bug and 900K bugtask tables together accounts for ~4 seconds of DB time once you factor in privacy [though there is some promise of improving on that, it won't be radically better]. A fact table (en.wikipedia.org/wiki/Fact_table) can answer a number of aggregrates much more quickly, but at a cost. The cost is twofold: we have to maintain the table, and its losing some precision, so somethings become more vague and best answered from the source data. I have put together a prototype at the SQL level; it performs adequately: we can satisfy queries for ubuntu scale tags (queries are at the end of this mail for those interested - or read the gory details in the bug): top 10 tags (logged in) completes in < 0.5 seconds and considers 33K rows open bugs (logged in) completes in < 0.3 seconds and considers 14K rows top 10 tags (anonymous) completes in < 0.2 seconds and considers 10K rows open bugs (with privacy) completes in < 0.2 seconds and considers 110 rows But what about the costs? And thats where we come to why I'm writing this mail. Maintaining the table is easy - we can start with a background job every few minutes, it takes 2 minutes to update globally on staging, I'm sure on prod it would be even snappier. We can also update it with triggers or from our appserver code. The *results* of queries on this table will be inaccurate on private bugs though, if a user has 2 or more paths to visibility. Consider bug X, which I am directly subscribed to, and which ~launchpad is also assigned to. The schema I've come up with for this table will have the bug X counted twice: - once with me as a viewer - once with ~launchpad as a viewer but because we've discarded the bug ids that contributed to those rows in the summary table, we cannot tell that we shouldn't add the counts together, so we see 1 too many bugs when querying the table. We have a choice: - make the data utterly precise - be a bit simpler and accept some inaccuracy I'd like to go with simpler an inaccurate for now because: - we went for years with inaccurate counts unnoticed by users [ who can tell that 700 should be 580, really?] - the error rate is pretty low - my tests so far suggest ~5% [of private bugs] - so the numbers won't be *far* off. - being more precise will be more complex - we'll need to query privacy separately So, specifically to jml & Deryck, but also to the whole team and any users following this discussion: is it ok to be slightly inflated but fast? -Rob (*) A horizontally sharded system could in principle calculate from raw data concurrently on 10 or 20 nodes and get the same numbers in 1/10th or 1/20th the time. Such system usually recommend having summary indices though which make that unnecessary. sample queries: tags (with privacy) explain analyze select tag, sum(count) from bugsummary left join teamparticipation on viewedby=teamparticipation.team where distribution=1 and (viewedby is NULL or teamparticipation.person=2) and sourcepackagename is NULL and tag is not null group by tag order by sum(count) desc limit 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=68724.35..68724.38 rows=10 width=14) (actual time=430.363..430.367 rows=10 loops=1) -> Sort (cost=68724.35..68728.58 rows=1692 width=14) (actual time=430.363..430.364 rows=10 loops=1) Sort Key: (sum(bugsummary.count)) Sort Method: top-N heapsort Memory: 25kB -> HashAggregate (cost=68666.64..68687.79 rows=1692 width=14) (actual time=427.296..429.017 rows=4460 loops=1) -> Nested Loop Left Join (cost=0.00..66672.08 rows=398912 width=14) (actual time=106.304..414.371 rows=10341 loops=1) Filter: ((bugsummary.viewedby IS NULL) OR (teamparticipation.person = 2)) -> Seq Scan on bugsummary (cost=0.00..7440.35 rows=55700 width=18) (actual time=106.230..135.308 rows=33483 loops=1) Filter: ((sourcepackagename IS NULL) AND (tag IS NOT NULL) AND (distribution = 1)) -> Index Scan using teamparticipation_team_key on teamparticipation (cost=0.00..1.05 rows=1 width=8) (actual time=0.003..0.006 rows=5 loops=33483) Index Cond: (bugsummary.viewedby = teamparticipation.team) Total runtime: 430.495 ms (12 rows) tags (anonymous user) explain analyze select tag, sum(count) from bugsummary where distribution=1 and (viewedby is NULL) and sourcepackagename is NULL and tag is not null group by tag order by sum(count) desc limit 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=7626.05..7626.07 rows=10 width=14) (actual time=191.697..191.704 rows=10 loops=1) -> Sort (cost=7626.05..7628.39 rows=935 width=14) (actual time=191.695..191.699 rows=10 loops=1) Sort Key: (sum(count)) Sort Method: top-N heapsort Memory: 25kB -> HashAggregate (cost=7594.16..7605.84 rows=935 width=14) (actual time=185.884..189.067 rows=4446 loops=1) -> Seq Scan on bugsummary (cost=0.00..7440.35 rows=30761 width=14) (actual time=142.195..168.853 rows=10006 loops=1) Filter: ((viewedby IS NULL) AND (sourcepackagename IS NULL) AND (tag IS NOT NULL) AND (distribution = 1)) Total runtime: 191.804 ms (8 rows) open bugs (with privacy) explain analyze select sum(count), status from bugsummary left join teamparticipation on bugsummary.viewedby=teamparticipation.team where distribution=1 and (viewedby is NULL or teamparticipation.person=2) and tag is NULL and status in (10, 15, 20, 21, 22, 25) and sourcepackagename is NULL group by status order by status; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=50096.87..50096.88 rows=1 width=8) (actual time=210.216..210.217 rows=6 loops=1) Sort Key: bugsummary.status Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=50096.85..50096.86 rows=1 width=8) (actual time=210.195..210.198 rows=6 loops=1) -> Nested Loop Left Join (cost=2778.16..49589.40 rows=101490 width=8) (actual time=95.659..209.967 rows=133 loops=1) Filter: ((bugsummary.viewedby IS NULL) OR (teamparticipation.person = 2)) -> Bitmap Heap Scan on bugsummary (cost=2778.16..8831.38 rows=14171 width=12) (actual time=95.608..118.717 rows=6803 loops=1) Recheck Cond: (tag IS NULL) Filter: ((sourcepackagename IS NULL) AND (distribution = 1) AND (status = ANY ('{10,15,20,21,22,25}'::integer[]))) -> Bitmap Index Scan on bugsummary_tag (cost=0.00..2774.62 rows=162711 width=0) (actual time=35.177..35.177 rows=162711 loops=1) Index Cond: (tag IS NULL) -> Index Scan using teamparticipation_team_key on teamparticipation (cost=0.00..2.86 rows=1 width=8) (actual time=0.009..0.012 rows=3 loops=6803) Index Cond: (bugsummary.viewedby = teamparticipation.team) Total runtime: 210.336 ms (14 rows) Time: 281.852 ms open bugs(anonymous user) explain analyze select sum(count), status from bugsummary where distribution=1 and (viewedby is NULL ) and tag is NULL and status in (10, 15, 20, 21, 22, 25) and sourcepackagename is NULL group by status order by status; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=8868.95..8868.95 rows=1 width=8) (actual time=101.659..101.662 rows=6 loops=1) Sort Key: status Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=8868.92..8868.94 rows=1 width=8) (actual time=101.634..101.639 rows=6 loops=1) -> Bitmap Heap Scan on bugsummary (cost=2776.57..8829.79 rows=7826 width=8) (actual time=91.224..101.527 rows=110 loops=1) Recheck Cond: (tag IS NULL) Filter: ((viewedby IS NULL) AND (sourcepackagename IS NULL) AND (distribution = 1) AND (status = ANY ('{10,15,20,21,22,25}'::integer[]))) -> Bitmap Index Scan on bugsummary_tag (cost=0.00..2774.62 rows=162711 width=0) (actual time=28.688..28.688 rows=162711 loops=1) Index Cond: (tag IS NULL) Total runtime: 101.770 ms (10 rows) Time: 103.844 ms _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

