[Note: CC'ing to R-SIG-robust, the "Special Interest Group on using Robust Statistics in R" ]
>>>>> "PP" == Petr PIKAL <[EMAIL PROTECTED]> >>>>> on Tue, 19 Jun 2007 12:55:37 +0200 writes: PP> [EMAIL PROTECTED] napsal dne 19.06.2007 PP> 12:23:58: >> Hi >> >> It often depends on your attitude to limits for outlying >> observations. Boxplot has some identifying routine for >> selecting outlying points. >> >> Any procedure usually requires somebody to choose which >> observation is outlying and why. You can use e.g. all >> values which are beyond some threshold based on sd but >> that holds only if distribution is normal. yes, and that's never true for the "alternative", i.e. for the case where there *are* outliers. >> set.seed(1) >> x<-rnorm(x) PP> Sorry, it shall be PP> x <- rnorm(1000) PP> ul <- mean(x) +3*sd(x) PP> ll <- mean(x) -3*sd(x) PP> beyond <- (x>ul) | ( x <ll) PP> PP> > x[beyond] PP> [1] 3.810277 >> Regards Petr No, really, do NOT do the above! It only works with very few and relatively mild outliers. There are much more robust alternatives. I show them for the simple example x <- c(1:10, 100) 1) As mentioned by Petr, use instead what boxplot() does, just type boxplot.stats and ``see what to do''. This gives Median +/- 1.5 * IQR : i.e., ## Boxplot's default rule str(bp.st <- boxplot.stats(x)) bp.st$stats[ c(1,5) ] ## 1 10 2) Use the recommendations of Hampel (1985) @ARTICLE{HamF85, author = "Hampel, F.", title = "The breakdown points of the mean combined with some rejection rules", journal = "Technometrics", year = 1985, volume = 27, pages = "95--107", } i.e. Median +/- 5 * MAD where MAD = is the *NON*-scaled MAD, ~= mad(*, constant=1) i.e., in R M <- median(x) (FH.interval <- M + c(-5, 5) * mad(x, center=M, const=1)) ## -9 21 3) or something slightly more efficient (under approximate normality of the non-outliers), e.g., based on MASS::rlm() : n <- length(x) s.rm <- summary(robm <- MASS::rlm(x ~ 1)) s.rm (cc <- coef(s.rm)) ## "approximate" robust degrees of freedom; this is a hack ## which could well be correct ## asymptotically {where the weights would be 0/1} : (df.resid <- sum(robm$w) - robm$rank) (Tcrit <- qt(0.995, df = df.resid)) ## Std.error of mean ~= sqrt(1/n Var(X_i)) = 1/sqrt(n) sqrt(Var(X_i)) cc[,1] + c(-1,1) * sqrt(n) * Tcrit * cc[,"Std. Error"] ## -6.391201 18.555177 --- Martin Maechler, ETH Zurich ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.