[Rd] Statistical mode

2011-05-26 Thread Arni Magnusson
One descriptive statistic that is conspicuously missing from core R is the 
statistical mode - the most frequent value in a discrete distribution.


I would like to propose adding the attached 'statmode' (or a similar 
function) to the 'stats' package.


Currently, it can be quite cumbersome to calculate the mode of a 
distribution in R, both for experts and beginners. The lack of a function 
to do this is felt, both when teaching introductory R courses, and when 
using sapply() or the like.


Looking forward to your feedback,

Arnistatmode <- function(x, all=FALSE, ...)
{
  if(is.list(x))
  {
output <- sapply(x, statmode, all=all, ...)
  }
  else
  {
freq <- table(x, ...)
if(all)
  output <- names(freq)[freq==max(freq)]
else
  output <- names(freq)[which.max(freq)]
## Coerce to original data type, using any() to handle mts, xtabs, etc.
if(any(class(x) %in% 
c("integer","numeric","ts","complex","matrix","table")))
  output <- as(output, storage.mode(x))
  }
  return(output)
}
\name{statmode}
\alias{statmode}
\title{Statistical Mode}
\description{
  Compute the statistical mode, the most frequent value in a discrete
  distribution.
}
\usage{
statmode(x, all = FALSE, \dots)
}
\arguments{
  \item{x}{an \R object, usually vector, matrix, or data frame.}
  \item{all}{whether all statistical modes should be returned.}
  \item{\dots}{further arguments passed to the \code{\link{table}}
function.}
}
\details{The default is to return only the first statistical mode.}
\value{
  The most frequent value in \code{x}, possibly a vector or list,
  depending on the class of \code{x} and whether \code{all=TRUE}.
}
\seealso{
  \code{\link{mean}}, \code{\link{median}}, \code{\link{table}}.

  \code{\link{density}} can be used to compute the statistical mode of a
  continuous distribution.
}
\examples{
## Different location statistics
fw <- faithful$waiting
hist(fw)
barplot(table(fw))
mean(fw)
median(fw)
statmode(fw)
plot(density(fw))
with(density(fw), x[which.max(y)])

## Different classes
statmode(chickwts$feed)  # factor
statmode(volcano)# matrix
statmode(discoveries)# ts
statmode(mtcars) # data frame

## Multiple modes
table(mtcars$carb)
statmode(mtcars$carb)
statmode(mtcars$carb, TRUE)
statmode(mtcars, TRUE)
}
\keyword{univar}
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Statistical mode

2011-05-27 Thread Kevin Wright
Arni,

Here are two examples:

R> statmode(iris)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width  Species
 "5"  "3""1.4""0.2" "setosa"
R> table(iris$Species)

setosa versicolor  virginica
50 50 50

R> library(lattice)
R> statmode(barley)
 yieldvariety   year   site
"20.6" "Svansota" "1932" "Grand Rapids"

My thoughts:
1. The mode is not so interesting for continuous data.  I would much rather
use something like density().
2. Both the iris and barley data sets are balanced (each factor level
appears equally often), and the current output from the statmode function is
misleading by only showing one level.
3. I think the describe() function in the Hmisc package is much more useful
and informative, even for introductory stat classes.  I always use
describe() after importing data into R.

Kevin



On Thu, May 26, 2011 at 3:26 PM, Arni Magnusson  wrote:

> One descriptive statistic that is conspicuously missing from core R is the
> statistical mode - the most frequent value in a discrete distribution.
>
> I would like to propose adding the attached 'statmode' (or a similar
> function) to the 'stats' package.
>
> Currently, it can be quite cumbersome to calculate the mode of a
> distribution in R, both for experts and beginners. The lack of a function to
> do this is felt, both when teaching introductory R courses, and when using
> sapply() or the like.
>
> Looking forward to your feedback,
>
> Arni
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Statistical mode

2011-05-27 Thread Arni Magnusson

Thank you, Kevin, for the feedback.

1. The mode is not so interesting for continuous data. I would much 
rather use something like density().


Absolutely. The help page for statmode() says it is for discrete data, and 
points to density() for continuous data.



2. Both the iris and barley data sets are balanced (each factor level 
appears equally often), and the current output from the statmode 
function is misleading by only showing one level.


Try statmode(iris,TRUE). It points out that petal lengths 1.4 and 1.5 are 
equally common in the data. I decided to make all=FALSE the default 
behavior, but I'd be equally happy with all=TRUE as the default.


As for the barley data, statmode(barley,TRUE) is just the honest answer. 
The yield is continuous, so the discrete mode is not of interest, and the 
factors levels are all equally common as you point out.



3. I think the describe() function in the Hmisc package is much more 
useful and informative, even for introductory stat classes.  I always 
use describe() after importing data into R.


The describe() function is a verbose summary, usually of a data frame. The 
statmode() function is the discrete mode, usually of a vector. 
Importantly, describe(faithful$waiting) points out the mean, median and 
range, but not the mode.


---

Allow me to include two more valid comments, from Sarah Goslee and David 
Winsemius, respectively:




4. The 'modeest' package does this and more, see for example mfv().


I think core R should come with a basic function to get the mode of a 
discrete vector. One option would be to lift mfv() into the 'stats' 
package, but something like statmode() could also cover factors and 
strings. Might as well provide all=TRUE/FALSE functionality, too, and 
retain integers as integers.


It's common to find rudimentary basic functionality in the 'stats' 
package, and dedicated packages for more details; time series models and 
robust statistics come to mind. The 'modeest' package is impressive 
indeed.




5. Isn't this just table(Vec)[which.max(table(Vec))]?


Yes it is, only less cumbersome. Much like sd(Vec) is less cumbersome than 
sqrt(var(Vec)). Moreover, I find it confusing to see the count as well,


  table(volcano)[which.max(table(volcano))]
  # 110
  # 177

although this can be debated. Finally, I think the examples

  statmode(mtcars)
  statmode(mtcars, TRUE)

demonstrate practical functionality beyond 
table(Vec)[which.max(table(Vec))].


The mean, median, and mode are often mentioned together as fundamental 
descriptive statistics, and I just find it odd that statmode() is not 
already in core R. Sure, we could get by without the sd() function in core 
R, but why should we?


All the best,

Arni

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel