On Wed, Apr 13, 2011 at 11:15 PM, John Arbash Meinel <[email protected]> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > >> I have put together a prototype at the SQL level; it performs >> adequately: we can satisfy queries for ubuntu scale tags (queries are >> at the end of this mail for those interested - or read the gory >> details in the bug): >> top 10 tags (logged in) completes in < 0.5 seconds and considers 33K rows >> open bugs (logged in) completes in < 0.3 seconds and considers 14K rows >> top 10 tags (anonymous) completes in < 0.2 seconds and considers 10K rows >> open bugs (with privacy) completes in < 0.2 seconds and considers 110 rows > > I'm a little curious about 'blowout'. Because you have one row per > tag*status*milestone. 'status' is fortunately bounded, but milestone can > grow considerably. (In case of bzr, we currently have 150 milestones on > LP, though only a few open ones.) > > 'tag' also has the tendency to accrue many items. I think you mentioned > that Launchpad has 151 *official* tags, and I don't know how many more > unofficial tags you have. > > So for a bzr/launchpad target, you could have 150*151*10 = 226,500 rows. > I'm not 100% sure how many actual bugs we have, but it is on the order > of <5,000 bugs (including fixed, open, and invalid bugs). > ... > > I'm sure you could optimize and not include the full cross product. Any > rows that sum to 0 would not be included, etc. If I read your comments > correctly, the full expansion on staging was only 2.6M rows? (and 1.3M > were for ubuntu itself?) > > I do realize your queries are nicely efficient at getting a subset. But > turning a 900k BugTask table + 700k Bug table into a 2.6M row table > doesn't seem to be scaling in the right direction.
Its not that bad :) - and the selectivity is great. select id from product where name='bzr'; id ------ 1186 (1 row) Time: 257.583 ms lpmain_staging=> select count(*) from bugsummary where product=1186; count ------- 1170 (1 row) select count(*) from bugtask where product=1186; count ------- 5833 Time: 13.447 ms lpmain_staging=> select count(*) from bugsummary; count -------- 371308 So its 1/5th the size for bzr, and 1/3 the size for everything - but its one table not three, so its actually closer to 1/10th for bzr and 1/6th for everything. > I know that bzr has driven New and Critical to 0 quite often. If privacy > is the only inaccurate measure, you could always split the numbering up. > So you would see New 10 (~2 private). You could even include the 2 in > the count, and then put ~2 private in a mouse over tooltip. > > Certainly I found it annoying when closing the last New bug and I > returned to the page to have it say "1 New" and clicking on the New link > gave me a search result with 0 entries in it. (I don't know that I > bothered to file a bug, maybe.) That was memcache caching, which if we maintain this page in realtime won't be needed. > Below a certain threshold, I imagine the actual queries for counts are > actually pretty fast. I don't know the queries, but if you have only 10 > New bugs, it shouldn't be very slow to compute. For counts above some > threshold (where the DB wants to do a table scan), you could use > aggregate numbers and just truncate them. So instead of seeing 81,962 > New, you would see 82k New. (That makes it hard to truncate at, eg 100 > bugs, though.) We have some code to estimate counts; the problem is tying it all in :) - in some ways this aggregate is better than estimation because unlike estimation it won't be far out ever (estimation can sometimes be -very- wrong). > Anyway, my big concern is that it turns into a BranchRevision table. > Which AFAIK also has good selectivity but its sheer size makes it very > cumbersome to deal with. Also, as far as accuracy, if you only update it We have queries on it that bring back 200000 rows to do set differences; thats 10 times the data for the fact table - and its being joined on awkwardly; making those do a graphtraversal in the query (SELECT WITH RECURSIVE) should get it processing a few thousand rows most of the time - which will be a lot faster. > in Batch mode, then it is going to have the same accuracy problem that > memcache did. And you'll close the last bug without the satisfaction of > seeing "0 New". (There is a real feedback loop of 'I accomplished > something' that I think you want to preserve.) Sure, but we have that problem now with memcache. What we have now is: - stale figures (cached at all) - cached wrongly (you can see the figures Martin did a few minutes ago) I think we can easily move to - cached (for a few minutes) and inflated-on-private-bugs only with a second pass to get to - live but inflated-on-private-bugs only And thus the question - should we? :) -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

