Hi Hayden,

The most usual form of percentile function (I believe) is where you have record values in a selection set (or "sample" from a "population" for the astute statisticians), and you wish to know if you sorted them all by value, and then grouped them in clusters of 1% so that you have a 100 such clusters in a row all numbered from 1 to 100, then in which n-th of those little clusters would entry x be?

Of course in real maths the "clusters" are really exact values and non-integer and in 
a set [ c | 0 < c < 100 ].

One of the short-cut methods of getting a percentile is simple indexing, but it won't give us an accurate x.xxx-th percentile, only an integer n-th percentile - which might suffice for your application and is a lot quicker to do.

Indexing simply requires ordering the data-values from small to large, then finding "the first value bigger than your value" in the list, subtract 1, divide by the total number of values in the list ( Count() ) and round it to find value that will be very very close to the actual percentile (but not exact).

Similarly, finding which value represents the n-th percentile where you specify the percentile and need the value for it (as is your question's case), you could simply find the n-th percentile index or indices and see which value(s) are in there, the last value of which (in case of multiple indices) usually represents it the best, especially if you reverse-engineer it back to a percentile.

This shortcut in SQL terms then:
SELECT `latency` FROM `latencytable` WHERE 1 ORDER BY `latency` ASC LIMIT (SELECT (Count() / ( 100 / ?1 ) ) FROM `latencytable` WHERE 1), 1;

The value ?1 in this case needs to be the required percentile, so if you are 
looking for the 25th percentile, then ?1 = 25.
(Note: ?1 = 0 may throw exceptions )

Essentially the query counts the values, finds the x-th percentile index (call it n), then lists the table ordered ascending starting from offset n and listing only 1 value, et voila.

Now this will be accurate enough on large samples. On really small samples where values are widely differing it will be more accurate to find a set of every index that falls within a percentile and then averaging it - which can easily be added to the above, but should not be needed and will take much more processing.

And if you have sets that are very very small (<50 items) then you should not 
have a need to know the percentile.


Hope this helps!
Ryan

(PS. in the query, I use "WHERE 1" and lots of quotes and spaces which all help to see the query structure better, but they are very superfluous in SQLite terms and may well be omitted.)



On 2013/11/30 01:38, Hayden Livingston wrote:
Is there any documentation or use case for percentile queries? I've only
seen the source code checked in by "drh".

Is it like the EXCEL 2013 function?

I have a table {operationId, latency) with a set of rows that have floating
values for latencies of a given operation.

I want to find the 25th %ile latency for this rowset.

(A) How do I create a Percentile Query?
(B) How do I make it work for the SQLite System.Data.Sqlite adapter in .NET?

Many thanks!
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to