I tried that, and the answers are very close. I need to do some more
testing as my dataset grows to see if it ties up with what is required.

My sql server takes ~4 minutes to do it "my" way, and ~1 minute to do:

select avg(scantime+processingtime)+(std(scantime+processingtime)*2)
from percentile where....

The table has around 8,000,000 rows currently (when in full production
use this will increase to ~100million rows. Each calculation is based on
a selection of about 1-1.5 million rows. I also have an index as
follows:

create index total_idx on table (scantime,processingtime);

Thanks for all help supplied,

Karl

-----Original Message-----
From: Adam Nelson [mailto:[EMAIL PROTECTED]] 
Sent: 05 February 2003 18:32
To: 'Benjamin Pflugmann'; Karl Dyson
Cc: [EMAIL PROTECTED]
Subject: RE: Percentile calculations


I believe you can do:

select avg(scantime) from percentile;
select std(scantime) from percentile;

std() is the same as stddev() which finds the standard deviation.  If
the scantimes are gaussian (pretty good assumption if it's a large
dataset), then the 95th percentile will be avg() + 2*stddev().

This should be very fast and quite accurate.  Really it's probably
better since I would think that you want the value within which 2
standard deviations fall rather than exactly 95%.  This is more useful
from a statistics point of view.

> -----Original Message-----
> From: Benjamin Pflugmann [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, February 01, 2003 2:17 AM
> To: Karl Dyson
> Cc: [EMAIL PROTECTED]
> Subject: Re: Percentile calculations
> 
> 
> Hi.
> 
> On Fri 2003-01-31 at 17:22:37 -0000, [EMAIL PROTECTED] wrote: 
> [...]
> > select count(*) from percentile where <criteria>;
> > 
> > Work out 95% or this value.
> > 
> > create temporary table percentile (id int unsigned auto_increment 
> > primary key, scantime decimal(20,10));
> > 
> > insert into percentile (scantime) select processingtime from emails 
> > where <same criteria> order by processingtime desc limit <value from
> > above>;
> > 
> > select min(scantime) from percentile;
> 
> If I am not mistaken, you can pick the row in question without a 
> separat table:
> 
> Basically you limit the query to the first n rows (in descending
> order) and then pick the one with the lowerst value, which should have

> been the last in the limited range. So you could pick that row 
> directly, I think:
> 
>   SELECT processingtime FROM emails
>   WHERE <blah> ORDER BY processingtime DESC LIMIT <value>,1
> 
> Second, why use 95% of the table, if you can do with 5%? In other 
> words, calculate the remainder, sort ascending and limit by the 
> remainder you calculated. The difference should especially in your 
> case, where you copy those rows and can avoid to do so for 90% of the 
> columns.
> 
> HTH,
> 
>       Benjamin.
> 
> --
> [EMAIL PROTECTED]
> 




________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs SkyScan
service. For more information on a proactive anti-virus service working
around the clock, around the globe, visit http://www.messagelabs.com
________________________________________________________________________

________________________________________________________________________
This email has been scanned for all viruses by the MessageLabs SkyScan
service. For more information on a proactive anti-virus service working
around the clock, around the globe, visit http://www.messagelabs.com
________________________________________________________________________

---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to