Ken Geis wrote:
The docs for org.apache.commons.math.stat.descriptive.rank.Percentile state that "To compute percentiles, the data must be (totally) ordered." This is flat out wrong.

Sorted, no, ordered, yes. The comment says, "To compute percentiles, the data must be (totally) ordered. Input arrays are copied and then sorted using Arrays.sort(double[]). The ordering used by Arrays.sort(double[]) is the one determined by Double.compareTo(Double). This ordering makes Double.NaN larger than any other value (including Double.POSITIVE_INFINITY). Therefore, for example, the median (50th percentile) of {0, 1, 2, 3, 4, Double.NaN} evaluates to 2.5."


In order for order statistics to be *defined*, you need to have a total ordering defined over the set of values. That was the intention of the first sentence above. The last part of the comment effectively communicates what ordering we are using. Whatever implementation we use, we will still need to define and document a total ordering (natural one to use the one above). I understand that the first sentence could be interpreted to mean that there is no way to compute percentiles without copying and sorting the data, which I agree is not true.


Sorting first with quicksort leads to O(n lg n) time, and there are selection algorithms for selection that are O(n). Details can be found in the standard texts such as Introduction To Algorithms or Numerical Recipes.

If no one else cares to implement this, I certainly will within the next several weeks.

Patches welcome :-) Pls make sure to fully document the algorithm (similar to the current javadoc) and avoid lifting anything directly from Numerical Recipes (there are copyright restrictions on their algorithms and code / pseudocode). Thanks in advance.


Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to