I tried that, and the answers are very close. I need to do some more testing as my dataset grows to see if it ties up with what is required.
My sql server takes ~4 minutes to do it "my" way, and ~1 minute to do: select avg(scantime+processingtime)+(std(scantime+processingtime)*2) from percentile where.... The table has around 8,000,000 rows currently (when in full production use this will increase to ~100million rows. Each calculation is based on a selection of about 1-1.5 million rows. I also have an index as follows: create index total_idx on table (scantime,processingtime); Thanks for all help supplied, Karl -----Original Message----- From: Adam Nelson [mailto:[EMAIL PROTECTED]] Sent: 05 February 2003 18:32 To: 'Benjamin Pflugmann'; Karl Dyson Cc: [EMAIL PROTECTED] Subject: RE: Percentile calculations I believe you can do: select avg(scantime) from percentile; select std(scantime) from percentile; std() is the same as stddev() which finds the standard deviation. If the scantimes are gaussian (pretty good assumption if it's a large dataset), then the 95th percentile will be avg() + 2*stddev(). This should be very fast and quite accurate. Really it's probably better since I would think that you want the value within which 2 standard deviations fall rather than exactly 95%. This is more useful from a statistics point of view. > -----Original Message----- > From: Benjamin Pflugmann [mailto:[EMAIL PROTECTED]] > Sent: Saturday, February 01, 2003 2:17 AM > To: Karl Dyson > Cc: [EMAIL PROTECTED] > Subject: Re: Percentile calculations > > > Hi. > > On Fri 2003-01-31 at 17:22:37 -0000, [EMAIL PROTECTED] wrote: > [...] > > select count(*) from percentile where <criteria>; > > > > Work out 95% or this value. > > > > create temporary table percentile (id int unsigned auto_increment > > primary key, scantime decimal(20,10)); > > > > insert into percentile (scantime) select processingtime from emails > > where <same criteria> order by processingtime desc limit <value from > > above>; > > > > select min(scantime) from percentile; > > If I am not mistaken, you can pick the row in question without a > separat table: > > Basically you limit the query to the first n rows (in descending > order) and then pick the one with the lowerst value, which should have > been the last in the limited range. So you could pick that row > directly, I think: > > SELECT processingtime FROM emails > WHERE <blah> ORDER BY processingtime DESC LIMIT <value>,1 > > Second, why use 95% of the table, if you can do with 5%? In other > words, calculate the remainder, sort ascending and limit by the > remainder you calculated. The difference should especially in your > case, where you copy those rows and can avoid to do so for 90% of the > columns. > > HTH, > > Benjamin. > > -- > [EMAIL PROTECTED] > ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs SkyScan service. For more information on a proactive anti-virus service working around the clock, around the globe, visit http://www.messagelabs.com ________________________________________________________________________ ________________________________________________________________________ This email has been scanned for all viruses by the MessageLabs SkyScan service. For more information on a proactive anti-virus service working around the clock, around the globe, visit http://www.messagelabs.com ________________________________________________________________________ --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php