Re: [R] R code for to check outliers

2012-07-20 Thread Angus Wallace
Really appreciate the discussion on outliers.

I come from an engineering signal processing background, and my thinking
has generally been that an outlier is outside a threshold of

   - distance from the mean
   - rarity

that we don't need/want to capture in whatever model we're building.

In my recent work (bioinformatics), I've seen that it's common to Winsorize
the data. I am a bit uncomfortable with this, though it seems to be
standard practice. Do people have thoughts here?

Cheers,
-Gus

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread arun
HI,

Check this link:
http://stackoverflow.com/questions/1444306/how-to-use-outlier-tests-in-r-code

Hope it would be helpful.
A.K.



- Original Message -
From: Sajeeka Nanayakkara 
To: "r-help@r-project.org" 
Cc: 
Sent: Wednesday, July 18, 2012 9:27 AM
Subject: [R] R code for to check outliers





 What is the R code to check whether data series have outliers or not?

Thanks,

Sajeeka Nanayakkara
    [[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread Rui Barradas

Hello,

Inline

Em 18-07-2012 18:44, Nordlund, Dan (DSHS/RDA) escreveu:

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
project.org] On Behalf Of Sajeeka Nanayakkara
Sent: Wednesday, July 18, 2012 6:28 AM
To: r-help@r-project.org
Subject: [R] R code for to check outliers





  What is the R code to check whether data series have outliers or not?

Thanks,

Sajeeka Nanayakkara
[[alternative HTML version deleted]]


Sajeeka,

You have been given lots of good information and appropriate warnings.  Let me 
add another caveat to think about in the context of outliers/unusual values.  A 
value may only be unusual in a multivariate context.  If we have a dataset with 
human heights in it, a value of 73 inches would not be unusual.  If we then 
learned that this particular individual was female, it would be somewhat 
unusual but certainly within the realm of possibility.


Uma Thurman!


 If we then learn that the individual is 3 years old, it would be highly 
unusual.

So, you can see why people on the list are somewhat unwilling to say here is THE function 
"to check whether data series have outliers or not."

Now having said that, can you define what YOU mean by "outlier" and why you are 
concerned about finding them.  Someone may be able to offer advice that will help you 
achieve your goal.


Agreeing with what has being said and not wanting to misdirect no one, 
there's a function in the graphics package that gives outliers, boxplot. 
They are computed based on boxplot.stats so see


?boxplot.stats

in particular the parameter coef and it's default value.
See also the return values from both these functions.

Rui Barradas



Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread Nordlund, Dan (DSHS/RDA)
> -Original Message-
> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r-
> project.org] On Behalf Of Sajeeka Nanayakkara
> Sent: Wednesday, July 18, 2012 6:28 AM
> To: r-help@r-project.org
> Subject: [R] R code for to check outliers
> 
> 
> 
> 
> 
>  What is the R code to check whether data series have outliers or not?
> 
> Thanks,
> 
> Sajeeka Nanayakkara
>   [[alternative HTML version deleted]]

Sajeeka,

You have been given lots of good information and appropriate warnings.  Let me 
add another caveat to think about in the context of outliers/unusual values.  A 
value may only be unusual in a multivariate context.  If we have a dataset with 
human heights in it, a value of 73 inches would not be unusual.  If we then 
learned that this particular individual was female, it would be somewhat 
unusual but certainly within the realm of possibility.  If we then learn that 
the individual is 3 years old, it would be highly unusual.

So, you can see why people on the list are somewhat unwilling to say here is 
THE function "to check whether data series have outliers or not."

Now having said that, can you define what YOU mean by "outlier" and why you are 
concerned about finding them.  Someone may be able to offer advice that will 
help you achieve your goal.

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread Martin Maechler
> Bert Gunter 
> on Wed, 18 Jul 2012 07:14:31 -0700 writes:

> checkforoutliers <- function(series) NULL 

  > Cheers, Bert

> *Explanation: There is no such thing as a statistical
> outlier -- or, rather,"outlier" is a fraudulent
> statistical concept, defined arbitrarily and without
> scientific legitimacy. The typical unstated purpose of
> such identification is to remove contaminating or
> irrelevant data, but such a judgment can only be made by a
> subject matter expert with knowledge of the context and,
> usually, the specific cause for the unusual data. Do not
> be misled by the large body of statistical literature on
> this topic into believing that statistical analysis alone
> can provide objective criteria to do this. That is a path
> to scientific purgatory.

> For the record: 1. I am a statistician 
> 2. Lots of highly knowledgeable, smart statisticians will condemn what I
> have just said as stupid ranting.

I entirely agree with you that  outlier-removing
procedures are mostly misused, and dangerous because of that
misuse {and hence should typically NOT be taught, or not the way
I have seen them taught (on occasions, not here at ETH!)...}

and I even more fervently agree with Michael Weylandt's 
recommendation to use robust statistics rather than outlier
detection --- at least in those cases where "robust statistics"
is *not* ill-re-defined  as  {outlier detection}+{classical stats}.

However, I don't think 'outlier' to be a fraudulent concept.
Rather I think outliers can be pretty well defined along the
line of "outlier WITH RESPECT TO A MODEL" 
 (and 'model' means 'statistical model', i.e., with some
 randomness built in) :

Outlier wrt model M := 
  an observation which is highly
  improbable to be observed under model M

(and "highly improbable" of course is somewhat vague, but that's
 not a problem per se.)
A version of the above is 

 Outlier := an observation that has unduely large influence on
 the estimators/inference performed

where 'estimator / inference'  imply a model of course.

So I think outlier is a useful concept for those who think about
*models* (rather than just data sets), and I agree that without
an implicit or explicit model, "outlier" is not well defined.

> The perils of a mailing list.
> -- Bert

:-)

Martin



> On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara .. wrote:

>> 
>> What is the R code to check whether data series have
>> outliers or not?
>> 
>> Thanks,
>> 
>> Sajeeka Nanayakkara


> -- 
> Bert Gunter Genentech Nonclinical Biostatistics

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread Duncan Murdoch

On 18/07/2012 10:14 AM, Bert Gunter wrote:

checkforoutliers <- function(series)NULL

Cheers,
Bert

*Explanation: There is no such thing as a statistical outlier -- or,
rather,"outlier" is a fraudulent statistical concept, defined arbitrarily
and without scientific legitimacy. The typical unstated purpose of such
identification is to remove contaminating or irrelevant data, but such a
judgment can only be made by a subject matter expert with knowledge of the
context and, usually, the specific cause for the unusual data. Do not be
misled by the large body of statistical literature on this topic into
believing that statistical analysis alone can provide objective criteria to
do this. That is a path to scientific purgatory.

For the record:
1. I am a statistician
2. Lots of highly knowledgeable, smart statisticians will condemn what I
have just said as stupid ranting.

The perils of a mailing list.


I think you are assuming that Sajeeka will handle the outliers 
incorrectly.   It happens often enough, but I don't think it's polite to 
make that assumption.


My answer to the question would have been to ask the question, "how do 
you define outliers?"  Certainly it's possible to define outliers in the 
context of a model, and their presence is an indication of problems with 
the model.  The correct response might be to weaken the assumptions of 
your model and use a robust procedure as Michael suggested (which might 
mean throwing away the outliers), or it might be to change the model in 
some other way.  Your advice to consult a subject matter expert is good, 
but in my experience, they often put more faith in their models than 
they should, so as a statistician, I think you should point out 
discrepancies like outliers.  Which means it's good to have a function 
to detect them.


Duncan Murdoch



-- Bert

On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara wrote:

>
>
>
>
>  What is the R code to check whether data series have outliers or not?
>
> Thanks,
>
> Sajeeka Nanayakkara
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread S Ellison
 

> >>  What is the R code to check whether data series have 
>>> outliers or not?

In case noone else has pointed you there, you could try the 'outliers' package. 
That contains some of the 'standard' methods of outlier testing for univariate 
data.

What you do with them when you find them is a rather more complicated and, as 
you have already seen, controversial question.

S Ellison

***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread R. Michael Weylandt
To further what Bert says:

You would almost certainly prefer to use robust statistics than
"outlier detection".

I believe Greg Snow's TeachingDemos package has a data set "outliers"
suggesting some of the perils of doing things the outlier-removal way.

Best,
Michael

On Wed, Jul 18, 2012 at 9:14 AM, Bert Gunter  wrote:
> checkforoutliers <- function(series)NULL
>
> Cheers,
> Bert
>
> *Explanation: There is no such thing as a statistical outlier -- or,
> rather,"outlier" is a fraudulent statistical concept, defined arbitrarily
> and without scientific legitimacy. The typical unstated purpose of such
> identification is to remove contaminating or irrelevant data, but such a
> judgment can only be made by a subject matter expert with knowledge of the
> context and, usually, the specific cause for the unusual data. Do not be
> misled by the large body of statistical literature on this topic into
> believing that statistical analysis alone can provide objective criteria to
> do this. That is a path to scientific purgatory.
>
> For the record:
> 1. I am a statistician
> 2. Lots of highly knowledgeable, smart statisticians will condemn what I
> have just said as stupid ranting.
>
> The perils of a mailing list.
>
> -- Bert
>
> On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara 
> wrote:
>
>>
>>
>>
>>
>>  What is the R code to check whether data series have outliers or not?
>>
>> Thanks,
>>
>> Sajeeka Nanayakkara
>> [[alternative HTML version deleted]]
>>
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for to check outliers

2012-07-18 Thread Bert Gunter
checkforoutliers <- function(series)NULL

Cheers,
Bert

*Explanation: There is no such thing as a statistical outlier -- or,
rather,"outlier" is a fraudulent statistical concept, defined arbitrarily
and without scientific legitimacy. The typical unstated purpose of such
identification is to remove contaminating or irrelevant data, but such a
judgment can only be made by a subject matter expert with knowledge of the
context and, usually, the specific cause for the unusual data. Do not be
misled by the large body of statistical literature on this topic into
believing that statistical analysis alone can provide objective criteria to
do this. That is a path to scientific purgatory.

For the record:
1. I am a statistician
2. Lots of highly knowledgeable, smart statisticians will condemn what I
have just said as stupid ranting.

The perils of a mailing list.

-- Bert

On Wed, Jul 18, 2012 at 6:27 AM, Sajeeka Nanayakkara wrote:

>
>
>
>
>  What is the R code to check whether data series have outliers or not?
>
> Thanks,
>
> Sajeeka Nanayakkara
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>


-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.