[Rd] Suggestion for quantile.default()

Rob Hyndman Mon, 19 Jul 2004 21:45:23 -0700

I'm not sure who is responsible for quantile(), but I assume they read this list. Ivan Frohne and I have produced a revision of the quantile.default() function which enables the computation of alternative sample quantile definitions. The code and .Rd file are attached.

This enables the user to produce quantiles that are equivalent to those in various statistics package. There is a type argument that allows one to choose between the various sample quantile methods. type=7 gives identical results to the current R function quantile.default(). In our revised function, type=8 is the default following the recommendation of Hyndman and Fan (American Statistician, 1996).

We suggest the attached function replaces the current quantile.default() function in R as it provides additional functionality without increasing computation time or affecting ease-of-use. If backwards-compatibility is important, we are happy to set type=7 as the default. However, we prefer type=8. For moderate to large sample sizes, the difference is negligble.

The code and associated documentation is as close as possible to what already exists.

Regards,
Rob Hyndman


__________________________________________________
Professor Rob J Hyndman
Department of Econometrics & Business Statistics
Monash University, VIC 3800, Australia.
http://www-personal.buseco.monash.edu.au/~hyndman/

quantile.default <- function(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
                           names = TRUE, type = 8, ...)                                
                                             
{
    if (na.rm) 
        x <- x[!is.na(x)]
    else if (any(is.na(x)))
        stop("Missing values and NaN's not allowed if `na.rm' is FALSE")
    if (any((p.ok <- !is.na(probs)) & (probs < 0 | probs > 1))) 
        stop("probs outside [0,1]")
    if (na.p <- any(!p.ok)) {
        o.pr <- probs
        probs <- probs[p.ok]
    }
    type <- as.integer(type)   
    if (is.na(type) || (type < 1 | type > 9))
        stop("type outside range [1,9]")
    np <- length(probs)
    n <- length(x)
    if (n > 0 && np > 0) {                              
        if (type <= 3) {
            ## Types 1, 2 and 3 are discontinuous sample qs.
            if (type == 3)
                nppm <- n * probs - .5          # n * probs + m; m = -0.5
            else 
                nppm <- n * probs               # m = 0
            j <- floor(nppm) 
            switch(type,
                        h <- ifelse(nppm > j, 1, 0),        # type 1
                        h <- ifelse(nppm > j, 1, 0.5),      # type 2
                        h <- ifelse((nppm == j) &&
                             ((j %% 2) == 0), 0, 1))        # type 3
        }
        else {
            ## Types 4 through 9 are continuous sample qs.
            switch(type - 3,
                            {a <- 0; b <- 1},               # type 4
                            a <- b <- 0.5,                  # type 5
                            a <- b <- 0,                    # type 6
                            a <- b <- 1,                    # type 7
                            a <- b <- 1 / 3,                # type 8
                            a <- b <- 3 / 8)                # type 9
            nppm <- a + probs * (n + 1 - a - b) # n*probs + m
            j <- floor(nppm)                    # m = a + probs*(1 - a - b)
            h <- nppm - j 
        }
        x <- sort(x, partial = unique(c(1,j[j>0 & j<=n],(j+1)[j>0 & j<n],n)))
        x <- c(x[1], x[1], x, x[n], x[n])
        qs <- (1 - h) * x[j + 2] + h * x[j + 3]
    }
    else {
        qs <- rep(as.numeric(NA), np)
    }
    if (names && np > 0) {
        dig <- max(2, getOption("digits"))
        names(qs) <- paste(if (np < 100) 
            formatC(100 * probs, format = "fg", wid = 1, digits = dig)
        else format(100 * probs, trim = TRUE, digits = dig), 
            "%", sep = "")
    }
    if (na.p) {
        o.pr[p.ok] <- qs
        names(o.pr) <- rep("", length(o.pr))
        names(o.pr)[p.ok] <- names(qs)
        o.pr
    }
    else qs
}

\name{quantile}
\title{Sample Quantiles}
\alias{quantile}
\alias{quantile.default}
\description{
    The generic function \code{quantile} produces sample quantiles corresponding to the
    given probabilities. The smallest observation corresponds to a probability of
    0 and the largest to a probability of 1.
}
\usage{
quantile(x, \dots)


\method{quantile}{default}(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
         names = TRUE, type = 8, ...)
}
\arguments{
    \item{x}{numeric vector whose sample quantiles are wanted.}
    \item{probs}{numeric vector of probabilities with values in \eqn{[0,1]}{[0,1]}.}
    \item{na.rm}{logical; if \code{TRUE} any \code{NA} or \code{NaN}
        is removed from \code{x} before the quantiles are computed.  If
        \code{FALSE} the presence of \code{NA} or \code{NaN} in \code{x}
        aborts the function.}
    \item{names}{logical; if \code{TRUE}, the result has a \code{\link[base]{names}} 
attribute.
    Set to FALSE for efficiency when \code{probs} is long.}
    \item{type}{an integer between 1 and 9 selecting one of the
         nine quantile algorithms detailed below to be used.}
    \item{\dots}{further arguments passed to or from other methods}
}
\value{
A vector of the same length as \code{probs} is returned;
if \code{names = TRUE}, it has a \code{\link[base]{names}} attribute.

\code{\link[base]{NA}} and \code{\link[base]{NaN}} values in \code{probs} are 
propagated to the result.
}
\details{
\code{quantile} returns estimates of underlying distribution quantiles
based on one or two order statistics from the supplied elements in \code{x} at
probabilities in \code{probs}.  One of the nine quantile algorithms discussed in
Hyndman and Fan (1996), selected by \code{type}, is employed.

Sample quantiles of type \eqn{i} are defined by
\deqn{Q_{i}(p) = (1 - \gamma)x_{j} + \gamma x_{j+1}}{Q[i](p) = (1 - gamma) x[j] + 
gamma x[j+1],} where
\eqn{1 \le i \le 9}{1 <= i <= 9}, \eqn{  }
\eqn{\frac{j - m}{n} \le p < \frac{j - m + 1}{n}}{(j - m) / n <= p < (j - m + 1) / n}, 
\eqn{  }
\eqn{x_{j}}{x[j]} is the \eqn{j}th order statistic,
\eqn{n} is the sample size, and \eqn{m} is a constant determined by
the sample quantile type.
Here \eqn{\gamma}{gamma} depends on
\eqn{j = \lfloor np + m\rfloor}{j = floor(np + m),} and
\eqn{g = np + m - j.}

For the continuous sample quantile types (4 through 9), the sample quantiles can be 
obtained
by linear interpolation between the \eqn{k}th order statistic and \eqn{p(k)}:
\deqn{p(k) = \frac{k - \alpha} {n - \alpha - \beta + 1}}{p(k) = (k - alpha) / (n - 
alpha - beta + 1),}
where \eqn{\alpha}{alpha} and \eqn{\beta}{beta} are constants determined by the type.
Further,
\eqn{m = \alpha + p \left( 1 - \alpha - \beta \right)}{m = alpha + p(1 - alpha - 
beta),}
and \eqn{\gamma = g = np + m - j}{gamma = g = np + m - j.}


            \strong{Discontinuous sample quantile types 1, 2, and 3}

\describe{
\item{Type 1}{Inverse of empirical distribution function.}
\item{Type 2}{Similar to type 1 but with averaging at discontinuities.}
\item{Type 3}{SAS definition: nearest even order statistic.}
}


            \strong{Continuous sample quantile types 4 through 9}

\describe{
\item{Type 4}{\eqn{p(k) = \frac{k}{n}}{p(k) = k / n}.
    That is, linear interpolation of the empirical cdf.}

\item{Type 5}{\eqn{p(k) = \frac{k - 0.5}{n}}{p(k) = (k - 0.5) / n}.
    That is a piecewise linear function where the knots are the values
    midway through the steps of the empirical cdf. This is popular amongst 
hydrologists.}

\item{Type 6}{\eqn{p(k) = \frac{k}{n + 1}}{p(k) = k / (n + 1)}.
    Thus \eqn{p(k) = \mbox{E}[F(x_{k})]}{p(k) = E[F(x[k])]}. This is used by Minitab 
and by SPSS.}

\item{Type 7}{\eqn{p(k) = \frac{k - 1}{n - 1}}{p(k) = (k - 1) / (n - 1)}.
    In this case, \eqn{p(k) = \mbox{mode}[F(x_{k})]}{p(k) = mode[F(x[k])]}. This is 
used by S-Plus.}

\item{Type 8}{\eqn{p(k) = \frac{k - \frac{1}{3}}{n + \frac{1}{3}}}{p(k) = (k - 1/3) / 
(n + 1/3)}.
    Then \eqn{p(k) \approx \mbox{median}[F(x_{k})]}{p(k) =~ median[F(x[k])]}.
    The resulting quantile estimates are approximately median-unbiased regardless of 
the
    distribution of \code{x}.}

\item{Type 9}{\eqn{p(k) = \frac{k - \frac{3}{8}}{n + \frac{1}{4}}}{p(k) = (k - 3/8) / 
(n + 1/4)}.
    The resulting quantile estimates are approximately unbiased if \code{x} is 
normally distributed.}
}

Hyndman and Fan (1996) recommend type 8 which is the default method for this function.
}
\references{
Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in
statistical packages, \emph{American Statistician}, \bold{50}, 361-365.
}
\seealso{
\code{\link[stats]{ecdf}} for empirical distributions of which quantile is
the \dQuote{inverse}; \code{\link[graphics]{boxplot.stats}} and 
\code{\link[stats]{fivenum}}
for computing \dQuote{versions} of quartiles, etc.
}
\examples{
quantile(x <- rnorm(1001))# Extremes & Quartiles by default

### Compare different methods
p <- c(0.1,0.5,1,2,5,10,50)/100
quantile(x,  p, type=1)
quantile(x,  p, type=2)
quantile(x,  p, type=3)
quantile(x,  p, type=4)
quantile(x,  p, type=5)
quantile(x,  p, type=6)
quantile(x,  p, type=7)
quantile(x,  p, type=8)
quantile(x,  p, type=9)
}
\keyword{univar}
\author{Ivan Frohne and Rob J Hyndman}

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

[Rd] Suggestion for quantile.default()

Reply via email to