Re: [R] R code for to check outliers
Really appreciate the discussion on outliers. I come from an engineering signal processing background, and my thinking has generally been that an outlier is outside a threshold of - distance from the mean - rarity that we don't need/want to capture in whatever model we're building. In my recent work (bioinformatics), I've seen that it's common to Winsorize the data. I am a bit uncomfortable with this, though it seems to be standard practice. Do people have thoughts here? Cheers, -Gus [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
HI, Check this link: http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code Hope it would be helpful. A.K. - Original Message - From: Sajeeka Nanayakkara To: "r-help@r-project.org" Cc: Sent: Wednesday, July 18, 2012 9:27 AM Subject: [R] R code for to check outliers What is the R code to check whether data series have outliers or not? Thanks, Sajeeka Nanayakkara [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
Hello, Inline Em 18-07-2012 18:44, Nordlund, Dan (DSHS/RDA) escreveu: -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- project.org] On Behalf Of Sajeeka Nanayakkara Sent: Wednesday, July 18, 2012 6:28 AM To: r-help@r-project.org Subject: [R] R code for to check outliers What is the R code to check whether data series have outliers or not? Thanks, Sajeeka Nanayakkara [[alternative HTML version deleted]] Sajeeka, You have been given lots of good information and appropriate warnings. Let me add another caveat to think about in the context of outliers/unusual values. A value may only be unusual in a multivariate context. If we have a dataset with human heights in it, a value of 73 inches would not be unusual. If we then learned that this particular individual was female, it would be somewhat unusual but certainly within the realm of possibility. Uma Thurman! If we then learn that the individual is 3 years old, it would be highly unusual. So, you can see why people on the list are somewhat unwilling to say here is THE function "to check whether data series have outliers or not." Now having said that, can you define what YOU mean by "outlier" and why you are concerned about finding them. Someone may be able to offer advice that will help you achieve your goal. Agreeing with what has being said and not wanting to misdirect no one, there's a function in the graphics package that gives outliers, boxplot. They are computed based on boxplot.stats so see ?boxplot.stats in particular the parameter coef and it's default value. See also the return values from both these functions. Rui Barradas Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
> -Original Message- > From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > project.org] On Behalf Of Sajeeka Nanayakkara > Sent: Wednesday, July 18, 2012 6:28 AM > To: r-help@r-project.org > Subject: [R] R code for to check outliers > > > > > > What is the R code to check whether data series have outliers or not? > > Thanks, > > Sajeeka Nanayakkara > [[alternative HTML version deleted]] Sajeeka, You have been given lots of good information and appropriate warnings. Let me add another caveat to think about in the context of outliers/unusual values. A value may only be unusual in a multivariate context. If we have a dataset with human heights in it, a value of 73 inches would not be unusual. If we then learned that this particular individual was female, it would be somewhat unusual but certainly within the realm of possibility. If we then learn that the individual is 3 years old, it would be highly unusual. So, you can see why people on the list are somewhat unwilling to say here is THE function "to check whether data series have outliers or not." Now having said that, can you define what YOU mean by "outlier" and why you are concerned about finding them. Someone may be able to offer advice that will help you achieve your goal. Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
> Bert Gunter > on Wed, 18 Jul 2012 07:14:31 -0700 writes: > checkforoutliers <- function(series) NULL > Cheers, Bert > *Explanation: There is no such thing as a statistical > outlier -- or, rather,"outlier" is a fraudulent > statistical concept, defined arbitrarily and without > scientific legitimacy. The typical unstated purpose of > such identification is to remove contaminating or > irrelevant data, but such a judgment can only be made by a > subject matter expert with knowledge of the context and, > usually, the specific cause for the unusual data. Do not > be misled by the large body of statistical literature on > this topic into believing that statistical analysis alone > can provide objective criteria to do this. That is a path > to scientific purgatory. > For the record: 1. I am a statistician > 2. Lots of highly knowledgeable, smart statisticians will condemn what I > have just said as stupid ranting. I entirely agree with you that outlier-removing procedures are mostly misused, and dangerous because of that misuse {and hence should typically NOT be taught, or not the way I have seen them taught (on occasions, not here at ETH!)...} and I even more fervently agree with Michael Weylandt's recommendation to use robust statistics rather than outlier detection --- at least in those cases where "robust statistics" is *not* ill-re-defined as {outlier detection}+{classical stats}. However, I don't think 'outlier' to be a fraudulent concept. Rather I think outliers can be pretty well defined along the line of "outlier WITH RESPECT TO A MODEL" (and 'model' means 'statistical model', i.e., with some randomness built in) : Outlier wrt model M := an observation which is highly improbable to be observed under model M (and "highly improbable" of course is somewhat vague, but that's not a problem per se.) A version of the above is Outlier := an observation that has unduely large influence on the estimators/inference performed where 'estimator / inference' imply a model of course. So I think outlier is a useful concept for those who think about *models* (rather than just data sets), and I agree that without an implicit or explicit model, "outlier" is not well defined. > The perils of a mailing list. > -- Bert :-) Martin > On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara .. wrote: >> >> What is the R code to check whether data series have >> outliers or not? >> >> Thanks, >> >> Sajeeka Nanayakkara > -- > Bert Gunter Genentech Nonclinical Biostatistics __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
On 18/07/2012 10:14 AM, Bert Gunter wrote: checkforoutliers <- function(series)NULL Cheers, Bert *Explanation: There is no such thing as a statistical outlier -- or, rather,"outlier" is a fraudulent statistical concept, defined arbitrarily and without scientific legitimacy. The typical unstated purpose of such identification is to remove contaminating or irrelevant data, but such a judgment can only be made by a subject matter expert with knowledge of the context and, usually, the specific cause for the unusual data. Do not be misled by the large body of statistical literature on this topic into believing that statistical analysis alone can provide objective criteria to do this. That is a path to scientific purgatory. For the record: 1. I am a statistician 2. Lots of highly knowledgeable, smart statisticians will condemn what I have just said as stupid ranting. The perils of a mailing list. I think you are assuming that Sajeeka will handle the outliers incorrectly. It happens often enough, but I don't think it's polite to make that assumption. My answer to the question would have been to ask the question, "how do you define outliers?" Certainly it's possible to define outliers in the context of a model, and their presence is an indication of problems with the model. The correct response might be to weaken the assumptions of your model and use a robust procedure as Michael suggested (which might mean throwing away the outliers), or it might be to change the model in some other way. Your advice to consult a subject matter expert is good, but in my experience, they often put more faith in their models than they should, so as a statistician, I think you should point out discrepancies like outliers. Which means it's good to have a function to detect them. Duncan Murdoch -- Bert On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara wrote: > > > > > What is the R code to check whether data series have outliers or not? > > Thanks, > > Sajeeka Nanayakkara > [[alternative HTML version deleted]] > > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
> >> What is the R code to check whether data series have >>> outliers or not? In case noone else has pointed you there, you could try the 'outliers' package. That contains some of the 'standard' methods of outlier testing for univariate data. What you do with them when you find them is a rather more complicated and, as you have already seen, controversial question. S Ellison *** This email and any attachments are confidential. Any use...{{dropped:8}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
To further what Bert says: You would almost certainly prefer to use robust statistics than "outlier detection". I believe Greg Snow's TeachingDemos package has a data set "outliers" suggesting some of the perils of doing things the outlier-removal way. Best, Michael On Wed, Jul 18, 2012 at 9:14 AM, Bert Gunter wrote: > checkforoutliers <- function(series)NULL > > Cheers, > Bert > > *Explanation: There is no such thing as a statistical outlier -- or, > rather,"outlier" is a fraudulent statistical concept, defined arbitrarily > and without scientific legitimacy. The typical unstated purpose of such > identification is to remove contaminating or irrelevant data, but such a > judgment can only be made by a subject matter expert with knowledge of the > context and, usually, the specific cause for the unusual data. Do not be > misled by the large body of statistical literature on this topic into > believing that statistical analysis alone can provide objective criteria to > do this. That is a path to scientific purgatory. > > For the record: > 1. I am a statistician > 2. Lots of highly knowledgeable, smart statisticians will condemn what I > have just said as stupid ranting. > > The perils of a mailing list. > > -- Bert > > On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara > wrote: > >> >> >> >> >> What is the R code to check whether data series have outliers or not? >> >> Thanks, >> >> Sajeeka Nanayakkara >> [[alternative HTML version deleted]] >> >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > > > -- > > Bert Gunter > Genentech Nonclinical Biostatistics > > Internal Contact Info: > Phone: 467-7374 > Website: > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for to check outliers
checkforoutliers <- function(series)NULL Cheers, Bert *Explanation: There is no such thing as a statistical outlier -- or, rather,"outlier" is a fraudulent statistical concept, defined arbitrarily and without scientific legitimacy. The typical unstated purpose of such identification is to remove contaminating or irrelevant data, but such a judgment can only be made by a subject matter expert with knowledge of the context and, usually, the specific cause for the unusual data. Do not be misled by the large body of statistical literature on this topic into believing that statistical analysis alone can provide objective criteria to do this. That is a path to scientific purgatory. For the record: 1. I am a statistician 2. Lots of highly knowledgeable, smart statisticians will condemn what I have just said as stupid ranting. The perils of a mailing list. -- Bert On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara wrote: > > > > > What is the R code to check whether data series have outliers or not? > > Thanks, > > Sajeeka Nanayakkara > [[alternative HTML version deleted]] > > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.