Re: [R] randomForest outlier return NA
There's a bug in the code. If you add row names to the X matrix befor you call randomForest(), you'd get: R summary (outlier(mdl.rf) ) Min. 1st Qu. MedianMean 3rd Qu.Max. -1.0580 -0.5957 0. 0.6406 1.2650 9.5200 I'll fix this in the next release. Thanks for reporting. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Pau Carrio Gaspar Sent: Wednesday, July 14, 2010 6:36 AM To: r-help@r-project.org Subject: [R] randomForest outlier return NA Dear R-users, I have a problem with randomForest{outlier}. After running the following code ( that produces a silly data set and builds a model with randomForest ): ### library(randomForest) set.seed(0) ## build data set X - rbind( matrix( runif(n=400,min=-1,max=1), ncol = 10 ) , rep(1,times= 10 ) ) Y - matrix( nrow = nrow(X), ncol = 1) for( i in (1:nrow(X))){ Y[i,1] - sign( sum ( X[i,])) } ## build model mdl.rf - randomForest( x = X, y = as.factor(Y) , proximity=TRUE , mtry = 10 , ntree = 500) summary (outlier(mdl.rf) ) ### I get the following output: Min. 1st Qu. MedianMean 3rd Qu.Max.NA's 41 Can anyone explain why the output of outlier only returns NA's ? Thanks Pau [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest outlier return NA
Hi Andy, thanks for your reply and your further correction. While the next release is not available I rewrite my code with your suggestion in case it might help anyone. ### library(randomForest) set.seed(0) ## build data set in data frame X - rbind( matrix( runif(n=400,min=-1,max=1), ncol = 10 ) , rep(1,times= 10 ) ) Y - matrix( nrow = nrow(X) , ncol = 1) for( i in (1:nrow(X))){ Y[i,1] - sign( sum ( as.numeric(X[i,]))) } df - data.frame( X , Y ) ##remove rm(X,Y) ## build model mdl.rf - randomForest( formula = as.factor(Y) ~ . , data = df , proximity=TRUE , mtry = 10 , ntree = 500 ) summary (outlier(mdl.rf) ) ## Regards Pau 2010/7/15 Liaw, Andy andy_l...@merck.com There's a bug in the code. If you add row names to the X matrix befor you call randomForest(), you'd get: R summary (outlier(mdl.rf) ) Min. 1st Qu. MedianMean 3rd Qu.Max. -1.0580 -0.5957 0. 0.6406 1.2650 9.5200 I'll fix this in the next release. Thanks for reporting. Best, Andy -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Pau Carrio Gaspar Sent: Wednesday, July 14, 2010 6:36 AM To: r-help@r-project.org Subject: [R] randomForest outlier return NA Dear R-users, I have a problem with randomForest{outlier}. After running the following code ( that produces a silly data set and builds a model with randomForest ): ### library(randomForest) set.seed(0) ## build data set X - rbind( matrix( runif(n=400,min=-1,max=1), ncol = 10 ) , rep(1,times= 10 ) ) Y - matrix( nrow = nrow(X), ncol = 1) for( i in (1:nrow(X))){ Y[i,1] - sign( sum ( X[i,])) } ## build model mdl.rf - randomForest( x = X, y = as.factor(Y) , proximity=TRUE , mtry = 10 , ntree = 500) summary (outlier(mdl.rf) ) ### I get the following output: Min. 1st Qu. MedianMean 3rd Qu.Max.NA's 41 Can anyone explain why the output of outlier only returns NA's ? Thanks Pau [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attach...{{dropped:16}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] randomForest outlier return NA
Dear R-users, I have a problem with randomForest{outlier}. After running the following code ( that produces a silly data set and builds a model with randomForest ): ### library(randomForest) set.seed(0) ## build data set X - rbind( matrix( runif(n=400,min=-1,max=1), ncol = 10 ) , rep(1,times= 10 ) ) Y - matrix( nrow = nrow(X), ncol = 1) for( i in (1:nrow(X))){ Y[i,1] - sign( sum ( X[i,])) } ## build model mdl.rf - randomForest( x = X, y = as.factor(Y) , proximity=TRUE , mtry = 10 , ntree = 500) summary (outlier(mdl.rf) ) ### I get the following output: Min. 1st Qu. MedianMean 3rd Qu.Max.NA's 41 Can anyone explain why the output of outlier only returns NA's ? Thanks Pau [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest outlier
Perhaps if you follow the posting guide more closely, you might get more (useful) replies, but without looking at your data, I doubt there's much anyone can do for you. The fact that the range of the outlying measures is -1 to 2 would tell me there are no potential outliers by this measure. Please see the value section of ?outlier to see how this measure is computed. Andy From: Birgitle Still the same question: Birgitle wrote: I try to use ?randomForest to find variables that are the most important to divide my dataset (continuous, categorical variables) in two given groups. But when I plot the outlier: plot(outlier(rfObject, cls=groupingVariable), type=p,col=c(red,green)[as.numeric(groupingVariable)]) it seems to me that all my values appear as outliers. Has anybody suggestions what is going wrong in my analysis? Additonal remark The scaling of the y-axis is quite small between -1 and 2. - The art of living is more like wrestling than dancing. (Marcus Aurelius) -- View this message in context: http://www.nabble.com/randomForest-outlier-tp17979182p18466832.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:12}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest outlier
Thanks anyway for your answer. That was also an option that I took into account (no potential outliers) and I will have a look at the value section of ?outliers. B. Am 16.07.2008 um 14:11 schrieb Liaw, Andy: Perhaps if you follow the posting guide more closely, you might get more (useful) replies, but without looking at your data, I doubt there's much anyone can do for you. The fact that the range of the outlying measures is -1 to 2 would tell me there are no potential outliers by this measure. Please see the value section of ?outlier to see how this measure is computed. Andy From: Birgitle Still the same question: Birgitle wrote: I try to use ?randomForest to find variables that are the most important to divide my dataset (continuous, categorical variables) in two given groups. But when I plot the outlier: plot(outlier(rfObject, cls=groupingVariable), type=p,col=c(red,green)[as.numeric(groupingVariable)]) it seems to me that all my values appear as outliers. Has anybody suggestions what is going wrong in my analysis? Additonal remark The scaling of the y-axis is quite small between -1 and 2. - The art of living is more like wrestling than dancing. (Marcus Aurelius) -- View this message in context: http://www.nabble.com/randomForest-outlier-tp17979182p18466832.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp Dohme or MSD and in Japan, as Banyu - direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. === Birgit Lemcke Institut of Systematic Botany University of Zurich Zollikerstrasse 107 CH-8008 Zürich Switzerland Ph: +41 (0)44 634 8351 mail: [EMAIL PROTECTED] === __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest outlier
I use a different dissimlarity measure (library(analogue);Gowers Index). I just wanted to look if there are similar values in both tables. I mainly try to find a way to find the best model to explain my predefined groups (using a bunch of different variables: factors,count,numeric, ordered factors) I am also fiddling around with a logistic regression. B. Am 16.07.2008 um 14:58 schrieb Liaw, Andy: Note that I did say by this measure: what you may want to consider as an outlier may not be what this measure picks out. After all, RF proximities are a bit unusual as a similarity measure. -Original Message- From: Birgit Lemcke [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 8:55 AM To: Liaw, Andy Cc: R Hilfe Subject: Re: [R] randomForest outlier Thanks anyway for your answer. That was also an option that I took into account (no potential outliers) and I will have a look at the value section of ?outliers. B. Am 16.07.2008 um 14:11 schrieb Liaw, Andy: Perhaps if you follow the posting guide more closely, you might get more (useful) replies, but without looking at your data, I doubt there's much anyone can do for you. The fact that the range of the outlying measures is -1 to 2 would tell me there are no potential outliers by this measure. Please see the value section of ?outlier to see how this measure is computed. Andy From: Birgitle Still the same question: Birgitle wrote: I try to use ?randomForest to find variables that are the most important to divide my dataset (continuous, categorical variables) in two given groups. But when I plot the outlier: plot(outlier(rfObject, cls=groupingVariable), type=p,col=c(red,green)[as.numeric(groupingVariable)]) it seems to me that all my values appear as outliers. Has anybody suggestions what is going wrong in my analysis? Additonal remark The scaling of the y-axis is quite small between -1 and 2. - The art of living is more like wrestling than dancing. (Marcus Aurelius) -- View this message in context: http://www.nabble.com/randomForest-outlier-tp17979182p18466832.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp Dohme or MSD and in Japan, as Banyu - direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. === Birgit Lemcke Institut of Systematic Botany University of Zurich Zollikerstrasse 107 CH-8008 Zürich Switzerland Ph: +41 (0)44 634 8351 mail: [EMAIL PROTECTED] === Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp Dohme or MSD and in Japan, as Banyu - direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. === Birgit Lemcke Institut of Systematic Botany University of Zurich Zollikerstrasse 107 CH-8008 Zürich Switzerland Ph: +41 (0)44 634 8351 mail: [EMAIL PROTECTED] === __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] randomForest outlier
Still the same question: Birgitle wrote: I try to use ?randomForest to find variables that are the most important to divide my dataset (continuous, categorical variables) in two given groups. But when I plot the outlier: plot(outlier(rfObject, cls=groupingVariable), type=p,col=c(red,green)[as.numeric(groupingVariable)]) it seems to me that all my values appear as outliers. Has anybody suggestions what is going wrong in my analysis? Additonal remark The scaling of the y-axis is quite small between -1 and 2. - The art of living is more like wrestling than dancing. (Marcus Aurelius) -- View this message in context: http://www.nabble.com/randomForest-outlier-tp17979182p18466832.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] randomForest outlier
I try to use ?randomForest to find variables that are the most important to divide my dataset (continuous, categorical variables) in two given groups. But when I plot the outliers: plot(outlier(FemMalSex_NAavoid88.rf33, cls=FemMalSex_NAavoid88$Sex), type=h,col=c(red,green)[as.numeric(FemMalSex_NAavoid88$Sex)]) it seems to me that all my values appear as outliers. Has anybody suggestions what is going wrong in my analysis? - The art of living is more like wrestling than dancing. (Marcus Aurelius) -- View this message in context: http://www.nabble.com/randomForest-outlier-tp17979182p17979182.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.