Re: [R] removing outlier --> use robust regression !

2015-09-15 Thread Martin Maechler
> Juli  
> on Sat, 12 Sep 2015 02:32:39 -0700 writes:

 > Hi Jim, thank you for your help. :)

 > My point is, that there are outlier and I don´t really
 > know how to deal with that.

 > I need the dataframe for a regression and read often that
 > only a few outlier can change your results very much. In
 > addition, regression diacnostics didn´t indcate me the
 > best results.  Yes, and I know its not the core of
 > statistics to work in a way you get results you would
 > like to have ;).

 > So what is your suggestion?

Use robust regression, e.g.
MASS::rlm()  {part of every R installation},
or a somewhat better and more sophisticated version.
lmrob()  from package 'robustbase' {yes, shameless promotion}.

Further: 

1) Removing outliers is not at all the best way to deal with such
  problems (intuitively, because it is a *dis*continuous method).
  Rather they should be downweighted (continuously, as it
  happens with methods used in  rlm() or lmrob() see above)

2) Removing outliers in *multivariate* setting, if you want to do
   it in spite of 1)  by using univariate treatment {each column
   separately as you do here} is often completely insufficient.  E.g.
   the bivariate outlier  in
  xy <- cbind(x= c(2,1:9), y=c(8,1:9));  plot(xy)
   cannot be found by looking at 'x' and 'y' separately.
   
3) If, in spite of 1) and 2) you are considering univariate
   treatment, using mean() and sd() for detecting univariate outliers
   has been proven to be insufficient more than 50 years ago (*1), and
   if one looks closer into the literature (say "L_1") even
   considerably longer ago. 
   Using  median() and mad() instead, is one possibility (*2) of
   what you should do. Hampel's rule (*3)
   proposes declaring outliers for the observations outside
   the interval   median(x) +/- 3.5*mad(x)


*1 Tukey, J. W. (1960) A survey of sampling from contaminated distributions. 
   In Contributions to Probability and Statistics, 
   eds I. Olkin, S. Ghurye, W. Hoeffding, W. Madow and H. Mann,
   pp. 448–485. Stanford: Stanford University Press.

*2 Another (less robust, but still infinitely better than mean/sd) approach
   uses  median() and IQR() which is
   basically/approximately what boxplots do to identify outliers.


*3 Frank R. Hampel (1985)
   The Breakdown Points of the Mean Combined With Some Rejection Rules,
   Technometrics, 27:2, 95-107
  [ http://dx.doi.org/10.1080/00401706.1985.10488027 ]

   See also section 
   "1.4b. How Well Are Objective and Subjective Metbods for
   the ReJection of Outliers Doing in the Context of Robust
   Estimation?",
page 62 ff  od
  of
Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw and Werner A. 
Stahel
(1986) Robust Statistics: The Approach Based on Influence Functions.
John Wiley & Sons, Inc.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] removing outlier

2015-09-13 Thread Bert Gunter
... and this, of course, is a nice example of how statistics
contributes to the "irreproducibility crisis" now roiling Science.

Cheers,
Bert

(Quote from a long ago engineering colleague: "Whenever I see an
outlier, I never know whether to throw it away or patent it.")


Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Sep 12, 2015 at 9:52 AM, David Winsemius  wrote:
>
> On Sep 12, 2015, at 2:32 AM, Juli wrote:
>
>> Hi Jim,
>>
>> thank you for your help. :)
>>
>> My point is, that there are outlier and I don´t really know how to deal with
>> that.
>>
>> I need the dataframe for a regression and read often that only a few outlier
>> can change your results very much. In addition, regression diacnostics
>> didn´t indcate me the best results.
>> Yes, and I know its not the core of statistics to work in a way you get
>> results you would like to have ;).
>>
>> So what is your suggestion?
>>
>> And if I remove the outliers, my problem ist, that as you said, they differ
>> in length. I need the data frame for a regression, so can I remove the whole
>> column or is there a call to exclude the data?
>
> Most regression methods have a 'subset' parameter which would allow you to 
> distort the data to your desired specification. But why not think about 
> examining a different statistical model or using robust methods? That way you 
> can keep all your data. (Sounds like you don't really have a lot.)
>
> --
> David.
>>
>> JULI
>>
>>
>>
>> --
>> View this message in context: 
>> http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712170.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] removing outlier

2015-09-13 Thread David Winsemius


If this mailing list accepted formatted submissions I would have used the
trèsModernSarcastic font for my first sentence. Failing the availability of
that mode of communication I am (top) posting through Nabble (perhaps)  in
"Comic Sans".

On Sat, Sep 12, 2015 at 9:52 AM, David Winsemius dwinsemius@ wrote:
>
> On Sep 12, 2015, at 2:32 AM, Juli wrote:

>> And if I remove the outliers, my problem ist, that as you said, they
>> differ
>> in length. I need the data frame for a regression, so can I remove the
>> whole
>> column or is there a call to exclude the data?
>
*> Most regression methods have a 'subset' parameter which would allow you
to distort the data to your desired specification.*


Bert Gunter-2 wrote
> 
/
> ... and this, of course, is a nice example of how statistics
> contributes to the "irreproducibility crisis" now roiling Science.
/
> 
> Cheers,
> Bert
> 
> (Quote from a long ago engineering colleague: "Whenever I see an
> outlier, I never know whether to throw it away or patent it.")
> 
> 
> Bert Gunter
> 
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
>-- Clifford Stoll
> 
> 
> On Sat, Sep 12, 2015 at 9:52 AM, David Winsemius 

> dwinsemius@

>  wrote:
>>
>> On Sep 12, 2015, at 2:32 AM, Juli wrote:
>>
>>> Hi Jim,
>>>
>>> thank you for your help. :)
>>>
>>> My point is, that there are outlier and I don´t really know how to deal
>>> with
>>> that.
>>>
>>> I need the dataframe for a regression and read often that only a few
>>> outlier
>>> can change your results very much. In addition, regression diacnostics
>>> didn´t indcate me the best results.
>>> Yes, and I know its not the core of statistics to work in a way you get
>>> results you would like to have ;).
>>>
>>> So what is your suggestion?
>>>
>>> And if I remove the outliers, my problem ist, that as you said, they
>>> differ
>>> in length. I need the data frame for a regression, so can I remove the
>>> whole
>>> column or is there a call to exclude the data?
>>
>> Most regression methods have a 'subset' parameter which would allow you
>> to distort the data to your desired specification. But why not think
>> about examining a different statistical model or using robust methods?
>> That way you can keep all your data. (Sounds like you don't really have a
>> lot.)
>>
>> --
>> David.
>>>
>>> JULI
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712170.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> __
>>> 

> R-help@

>  mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>> __
>> 

> R-help@

>  mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> __

> R-help@

>  mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.





--
View this message in context: 
http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712208.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] removing outlier

2015-09-12 Thread David Winsemius

On Sep 12, 2015, at 2:32 AM, Juli wrote:

> Hi Jim, 
> 
> thank you for your help. :)
> 
> My point is, that there are outlier and I don´t really know how to deal with
> that. 
> 
> I need the dataframe for a regression and read often that only a few outlier
> can change your results very much. In addition, regression diacnostics
> didn´t indcate me the best results.
> Yes, and I know its not the core of statistics to work in a way you get
> results you would like to have ;).
> 
> So what is your suggestion?
> 
> And if I remove the outliers, my problem ist, that as you said, they differ
> in length. I need the data frame for a regression, so can I remove the whole
> column or is there a call to exclude the data?

Most regression methods have a 'subset' parameter which would allow you to 
distort the data to your desired specification. But why not think about 
examining a different statistical model or using robust methods? That way you 
can keep all your data. (Sounds like you don't really have a lot.)

-- 
David.
> 
> JULI
> 
> 
> 
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712170.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] removing outlier

2015-09-12 Thread Juli
Hi Jim, 

thank you for your help. :)

My point is, that there are outlier and I don´t really know how to deal with
that. 

I need the dataframe for a regression and read often that only a few outlier
can change your results very much. In addition, regression diacnostics
didn´t indcate me the best results.
Yes, and I know its not the core of statistics to work in a way you get
results you would like to have ;).

So what is your suggestion?

And if I remove the outliers, my problem ist, that as you said, they differ
in length. I need the data frame for a regression, so can I remove the whole
column or is there a call to exclude the data?

JULI



--
View this message in context: 
http://r.789695.n4.nabble.com/removing-outlier-tp4712137p4712170.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] removing outlier

2015-09-11 Thread Jim Lemon
Hi Juli,
What you can do is to make your outlier remover into a function like this:

remove_outlier_by_sd<-function(x,nsd=3) {
 meanx<-mean(x,na.rm=TRUE)
 sdx<-sd(x,na.rm=TRUE)
 return(x[abs(x-xmean) < nsd*sdx])
}

Then apply the function to your data frame ("table")

newDA<-sapply(DA,remove_outlier_by_sd)

newDA will be a list, as it is likely that its elements will be of
different lengths. You may be told that you really shouldn't remove
outliers and learn to love them, but I will leave that to others.

Jim


On Sat, Sep 12, 2015 at 12:15 AM, Juli  wrote:

> Hey,
>
> i want to remove outliers so I tried do do this:
>
> # 1 define mean and sd
> sd.AT_ZU_SPAET <- sd(AT_ZU_SPAET)
> mitt.AT_ZU_SPAET <- mean(AT_ZU_SPAET)
> #
> sd.Anzahl_BAF <- sd(Anzahl_BAF)
> mitt.Anzahl_BAF <- mean(Anzahl_BAF)
> #
> sd.Änderungsintervall <- sd(Änderungsintervall)
> mitt.Änderungsintervall <- mean(Änderungsintervall)
> #
> # 2 identify outliers
> DA[ abs(AT_ZU_SPAET - mitt.AT_ZU_SPAET) > ( 3 * sd.AT_ZU_SPAET)  , ]
> DA[ abs(Anzahl_BAF - mitt.Anzahl_BAF) > ( 3 * sd.Anzahl_BAF)  , ]
> DA[ abs(Änderungsintervall - mitt.Änderungsintervall) > ( 3 *
> sd.Änderungsintervall)  , ]
> #
> # 3 remove outliers
> AT_ZU_SPAET.clean <- DA[ (abs(AT_ZU_SPAET - mitt.AT_ZU_SPAET) <
> (3*sd.AT_ZU_SPAET)), ]
> Anzahl_BAF.clean <- DA[ (abs(Anzahl_BAF - mitt.Anzahl_BAF) <
> (3*sd.Anzahl_BAF)), ]
> Änderungsintervall.clean <- DA[ (abs(Änderungsintervall -
> mitt.Änderungsintervall) <
> (3*sd.Änderungsintervall)), ]
>
> My problem ist, that I am only able to remove the outliers of one column of
> my table, but I want to remove the outliers of every column of the table.
>
> Could anybody help me?
>
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/removing-outlier-tp4712137.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] removing outlier

2015-09-11 Thread Juli
Hey,

i want to remove outliers so I tried do do this: 

# 1 define mean and sd
sd.AT_ZU_SPAET <- sd(AT_ZU_SPAET)
mitt.AT_ZU_SPAET <- mean(AT_ZU_SPAET)
#
sd.Anzahl_BAF <- sd(Anzahl_BAF)
mitt.Anzahl_BAF <- mean(Anzahl_BAF)
#
sd.Änderungsintervall <- sd(Änderungsintervall)
mitt.Änderungsintervall <- mean(Änderungsintervall)
#
# 2 identify outliers 
DA[ abs(AT_ZU_SPAET - mitt.AT_ZU_SPAET) > ( 3 * sd.AT_ZU_SPAET)  , ]
DA[ abs(Anzahl_BAF - mitt.Anzahl_BAF) > ( 3 * sd.Anzahl_BAF)  , ]
DA[ abs(Änderungsintervall - mitt.Änderungsintervall) > ( 3 *
sd.Änderungsintervall)  , ]
#
# 3 remove outliers
AT_ZU_SPAET.clean <- DA[ (abs(AT_ZU_SPAET - mitt.AT_ZU_SPAET) <
(3*sd.AT_ZU_SPAET)), ]
Anzahl_BAF.clean <- DA[ (abs(Anzahl_BAF - mitt.Anzahl_BAF) <
(3*sd.Anzahl_BAF)), ]
Änderungsintervall.clean <- DA[ (abs(Änderungsintervall -
mitt.Änderungsintervall) <
(3*sd.Änderungsintervall)), ]

My problem ist, that I am only able to remove the outliers of one column of
my table, but I want to remove the outliers of every column of the table. 

Could anybody help me?




--
View this message in context: 
http://r.789695.n4.nabble.com/removing-outlier-tp4712137.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] removing outlier function / dataset update

2011-01-26 Thread kirtau

Hi,

I have a few lines of code that will remove outliers for a regression test
based on the studentized residuals being above or below 3, -3. I have to do
this multiple times and have attempted to create a function to lessen the
amount of copying, pasting and replacing. 

I run into trouble with the function and receiving the error Error in
`$-.data.frame`(`*tmp*`, varpredicted, value = c(0.114285714285714,  : 
  replacement has 20 rows, data has 19


any help would be appreciated. a list of code is listed below. 

Thank you for your time!

x = c(1:20)
y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
data1 = data.frame(x,y)

# remove outliers for regression by studentized residuals being greater than
3
data1$predicted = predict(lm(data1$y~data1$x))
data1$stdres = rstudent(lm(data1$y~data1$x));
i=length(which(data1$stdres3|data1$stdres -3))
while(i = 1){
remove-which(data1$stdres3|data1$stdres -3)
print(data1[remove,])
data1 = data1[-remove,]
data1$predicted = predict(lm(data1$y~data1$x))
data1$stdres = rstudent(lm(data1$y~data1$x))
i = with(data1,length(which(stdres3|stdres -3)))
 }

# attemp to create a function to perfom same idea as above
rm.outliers = function(dataset,var1, var2) {

  dataset$varpredicted = predict(lm(var1~var2))
  dataset$varstdres = rstudent(lm(var1~var2))
  i = length(which(dataset$varstdres  3 | dataset$varstdres  -3))
  while(i = 1){
 removed = which(dataset$varstdres  3 | dataset$varstdres  -3)
 print(dataset[removed,])
 dataset = dataset[-removed,]
 dataset$varpredicted = predict(lm(var1~var2))
   dataset$varstdres = rstudent(lm(var1~var2))
 i = with(dataset,length(varstdres  3 | varstdres  -3))
   }
}
-- 
View this message in context: 
http://r.789695.n4.nabble.com/removing-outlier-function-dataset-update-tp3238394p3238394.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] removing outlier function / dataset update

2011-01-26 Thread Ista Zahn
Hi,
x and y are being picked up from your global environment, not from the
x and y in dataset. Here is a version that seems to work:

rm.outliers = function(dataset,var1, var2) {

dataset$varpredicted = predict(lm(as.formula(paste(var1, var2,
sep= ~ )), data=dataset))
dataset$varstdres = rstudent(lm(as.formula(paste(var1, var2, sep=
~ )), data=dataset))
i = length(which(dataset$varstdres  3 | dataset$varstdres  -3))
while(i = 1){
removed = which(dataset$varstdres  3 | dataset$varstdres  -3)
print(dataset[removed,])
dataset = dataset[-removed,]
dataset$varpredicted = predict(lm(as.formula(paste(var1, var2,
sep= ~ )), data=dataset))
dataset$varstdres = rstudent(lm(as.formula(paste(var1, var2,
sep= ~ )), data=dataset))
i = with(dataset,length(varstdres  3 | varstdres  -3))
}
}


Best,
Ista

On Wed, Jan 26, 2011 at 11:36 AM, kirtau kir...@live.com wrote:

 Hi,

 I have a few lines of code that will remove outliers for a regression test
 based on the studentized residuals being above or below 3, -3. I have to do
 this multiple times and have attempted to create a function to lessen the
 amount of copying, pasting and replacing.

 I run into trouble with the function and receiving the error Error in
 `$-.data.frame`(`*tmp*`, varpredicted, value = c(0.114285714285714,  :
  replacement has 20 rows, data has 19
 

 any help would be appreciated. a list of code is listed below.

 Thank you for your time!

 x = c(1:20)
 y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
 data1 = data.frame(x,y)

 # remove outliers for regression by studentized residuals being greater than
 3
 data1$predicted = predict(lm(data1$y~data1$x))
 data1$stdres = rstudent(lm(data1$y~data1$x));
 i=length(which(data1$stdres3|data1$stdres -3))
 while(i = 1){
        remove-which(data1$stdres3|data1$stdres -3)
        print(data1[remove,])
        data1 = data1[-remove,]
        data1$predicted = predict(lm(data1$y~data1$x))
        data1$stdres = rstudent(lm(data1$y~data1$x))
        i = with(data1,length(which(stdres3|stdres -3)))
  }

 # attemp to create a function to perfom same idea as above
 rm.outliers = function(dataset,var1, var2) {

  dataset$varpredicted = predict(lm(var1~var2))
  dataset$varstdres = rstudent(lm(var1~var2))
  i = length(which(dataset$varstdres  3 | dataset$varstdres  -3))
  while(i = 1){
         removed = which(dataset$varstdres  3 | dataset$varstdres  -3)
         print(dataset[removed,])
         dataset = dataset[-removed,]
         dataset$varpredicted = predict(lm(var1~var2))
   dataset$varstdres = rstudent(lm(var1~var2))
         i = with(dataset,length(varstdres  3 | varstdres  -3))
   }
 }
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/removing-outlier-function-dataset-update-tp3238394p3238394.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] removing outlier function / dataset update

2011-01-26 Thread kirtau

First off, thank you for the help with the global environment.

I have however attempted to run the code and am now presented with a new
error which is 
Error in formula.default(eval(parse(text = x)[[1L]])) : invalid formula
and am not sure what to make of it. I have tried a few different work around
with no luck.

Any help will continue to be appreciated! 

-
- AK
-- 
View this message in context: 
http://r.789695.n4.nabble.com/removing-outlier-function-dataset-update-tp3238394p3239080.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.