Re: [R] logistic regression tree
On 08/22/2010 01:51 PM, Kay Cichini wrote: > achim, thank you for the very kind offer!! sorrily i'm not around vienna in > the near feature, otherwise i'd be glad to coming back to your invitation. Not that it's any of my business, but I don't think you need to go THAT far to visit Achim these days... -pd -- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
dear all, thank you everyone for the profound answers and the needful references! achim, thank you for the very kind offer!! sorrily i'm not around vienna in the near feature, otherwise i'd be glad to coming back to your invitation. yours, kay - Kay Cichini Postgraduate student Institute of Botany Univ. of Innsbruck -- View this message in context: http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2334106.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
On Fri, 20 Aug 2010, Kay Cichini wrote: hello, my data-collection is not yet finished, but i though have started investigating possible analysis methods. below i give a very close simulation of my future data-set, however there might be more nominal explanatory variables - there will be no continous at all (maybe some ordered nominal..). i tried several packages today, but the one i fancied most was ctree of the party package. i can't see why the given no. of datapoints (n=100) might pose a problem here - but please teach me better, as i might be naive.. See http://biostat.mc.vanderbilt.edu/wiki/Main/ComplexDataJournalClub#Sebastiani_et_al_Nature_Genetics The recursive partitioning simulation there will give you an idea - you can modify the R code to simulate a situation more like yours. When you simulate the true patterns and see how far the tree is from discovering the true patterns, you'll be surprised. Frank > i'd be very glad about comments on the use of ctree on suchalike dataset and if i oversee possible pitfalls thank you all, kay ## # an example with 3 nominal explanatory variables: # Y is presence of a certain invasive plant species # introduced effect for fac1 and fac3, fac2 without effect. # presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr. region) and # fac3 = a|b|c (say all richer substrates). # presence is not influenced by fac2, which might be vegetation type, i.e. ## library(party) dat<-cbind( expand.grid(fac1=c("I","II"), fac2=LETTERS[1:5], fac3=letters[1:10])) print(dat<-dat[order(dat$fac1,dat$fac2,dat$fac3),]) dat$fac13<-paste(dat$fac1,dat$fac3,sep="") for(i in 1:nrow(dat)){ ifelse(dat$fac13[i]=="Ia"|dat$fac13[i]=="Ib"|dat$fac13[i]=="Ic", dat$Y[i]<-rbinom(1,1,0.75), dat$Y[i]<-rbinom(1,1,0)) } dat$Y<-as.factor(dat$Y) tr<-ctree(Y~fac1+fac2+fac3,data=dat) plot(tr) ## - Kay Cichini Postgraduate student Institute of Botany Univ. of Innsbruck -- View this message in context: http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2333073.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
On Fri, 2010-08-20 at 14:46 -0700, Kay Cichini wrote: > hello, > > my data-collection is not yet finished, but i though have started > investigating possible analysis methods. > > below i give a very close simulation of my future data-set, however there > might be more nominal explanatory variables - there will be no continous at > all (maybe some ordered nominal..). > > i tried several packages today, but the one i fancied most was ctree of the > party package. > i can't see why the given no. of datapoints (n=100) might pose a problem > here - but please teach me better, as i might be naive.. I'm no expert, but single trees are unstable predictors; change your data slightly and you might get a totally different model/tree. I hope that worries you? Frank's comment was that depending upon the signal-to-noise ratio in your sample of data, you might need a very large data set indeed, much larger than your 100 data points/samples, to have any confidence in the single fitted tree. For this reason, ensemble or committee methods have been developed that combine the predictions from many trees fitted to perturbed versions of the training data. Such methods include boosting and randomForests. We are venturing into territory not suited to email list format; statistical consultancy. As Achim is local to you and has kindly offered to meet you, I would strongly suggest you take up his offer. In the meantime, here are a couple of references to look at if you aren't familiar with these statistical machine learning techniques. Cutler et al (2007) Random forests for classification in ecology. Ecology 88(11), 2783---2792. Elith, J., Leathwick, J.R., and Hastie, T. (2008) A working guide to boosted regression trees. Journal of Animal Ecology, 77, 802---813. Also, don't dismiss the logistic regression model. Modern techniques like the lasso and elastic net are available for GLMs such as this and include model selection as part of their fitting. These are underused by ecologists (IMHO) who seem to like (abuse?)the information theoretic approaches and step-wise selection procedures... (apologies to ecologists here [I am one too] for being general!) See: Dahlgren J.p. (2010) Alternative regression methods are not considered in Murtaugh (2009) or by ecologists in general. Ecology Letters 13(5) E7-E9. HTH G > i'd be very glad about comments on the use of ctree on suchalike dataset and > if i oversee possible pitfalls > > thank you all, > kay > > ## > # an example with 3 nominal explanatory variables: > # Y is presence of a certain invasive plant species > # introduced effect for fac1 and fac3, fac2 without effect. > # presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr. > region) and > # fac3 = a|b|c (say all richer substrates). > # presence is not influenced by fac2, which might be vegetation type, i.e. > ## > library(party) > dat<-cbind( > expand.grid(fac1=c("I","II"), > fac2=LETTERS[1:5], > fac3=letters[1:10])) > > print(dat<-dat[order(dat$fac1,dat$fac2,dat$fac3),]) > > dat$fac13<-paste(dat$fac1,dat$fac3,sep="") > for(i in 1:nrow(dat)){ > ifelse(dat$fac13[i]=="Ia"|dat$fac13[i]=="Ib"|dat$fac13[i]=="Ic", >dat$Y[i]<-rbinom(1,1,0.75), >dat$Y[i]<-rbinom(1,1,0)) > } > dat$Y<-as.factor(dat$Y) > > tr<-ctree(Y~fac1+fac2+fac3,data=dat) > plot(tr) > ## > > > - > > Kay Cichini > Postgraduate student > Institute of Botany > Univ. of Innsbruck > > -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
hello, my data-collection is not yet finished, but i though have started investigating possible analysis methods. below i give a very close simulation of my future data-set, however there might be more nominal explanatory variables - there will be no continous at all (maybe some ordered nominal..). i tried several packages today, but the one i fancied most was ctree of the party package. i can't see why the given no. of datapoints (n=100) might pose a problem here - but please teach me better, as i might be naive.. i'd be very glad about comments on the use of ctree on suchalike dataset and if i oversee possible pitfalls thank you all, kay ## # an example with 3 nominal explanatory variables: # Y is presence of a certain invasive plant species # introduced effect for fac1 and fac3, fac2 without effect. # presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr. region) and # fac3 = a|b|c (say all richer substrates). # presence is not influenced by fac2, which might be vegetation type, i.e. ## library(party) dat<-cbind( expand.grid(fac1=c("I","II"), fac2=LETTERS[1:5], fac3=letters[1:10])) print(dat<-dat[order(dat$fac1,dat$fac2,dat$fac3),]) dat$fac13<-paste(dat$fac1,dat$fac3,sep="") for(i in 1:nrow(dat)){ ifelse(dat$fac13[i]=="Ia"|dat$fac13[i]=="Ib"|dat$fac13[i]=="Ic", dat$Y[i]<-rbinom(1,1,0.75), dat$Y[i]<-rbinom(1,1,0)) } dat$Y<-as.factor(dat$Y) tr<-ctree(Y~fac1+fac2+fac3,data=dat) plot(tr) ## - Kay Cichini Postgraduate student Institute of Botany Univ. of Innsbruck -- View this message in context: http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2333073.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
It would be good to tell us of the frequency of observations in each category of Y, and the number of continuous X's. Recursive partitioning will require perhaps 50,000 observations in the less frequent Y category for its structure and predicted values to validate, depending on X and the signal:noise ratio. Hence the use of combinations of trees nowadays as opposed to single trees. Or logistic regression. Frank Frank E Harrell Jr Professor and ChairmanSchool of Medicine Department of Biostatistics Vanderbilt University On Fri, 20 Aug 2010, Kay Cichini wrote: hello gavin & achim, thanks for responding. by logistic regression tree i meant a regression tree for a binary response variable. but as you say i could also use a classification tree - in my case with only two outcomes. i'm not aware if there are substantial differences to expect for the two approaches (logistic regression tree vs. classification tree with two outcomes). as i'm new to trees / boosting / etc. i also might be advised to use the more comprehensible method / a function which argumentation is understood without having to climb a steep learning ledder, respectively. at the moment i don't know which this would be. regarding the meaning of absences at stands: as these species are frequent in the area and hence there is no limitation by propagules i guess absence is really due to unfavourable conditions. thanks a lot, kay - Kay Cichini Postgraduate student Institute of Botany Univ. of Innsbruck -- View this message in context: http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
On Fri, 20 Aug 2010, Kay Cichini wrote: hello gavin & achim, thanks for responding. by logistic regression tree i meant a regression tree for a binary response variable. but as you say i could also use a classification tree - in my case with only two outcomes. i'm not aware if there are substantial differences to expect for the two approaches (logistic regression tree vs. classification tree with two outcomes). I don't think that there is a universally accepted terminology for this. Classification tree typically pertains to categorical responses (independet of the number of categories, i.e., also for binary responses). Logistic regression tree is (to the best of my knowledge) not typically used as a term for binary classification trees. Technical excursion: However, logistic regression tree may mean a specific algorithm (LOTUS - LOgistic regression Tree with Unbiased Splits) developed by Kin-Yee Chan and Wei-Yin Loh. This algorithms shares various ideas with the LMT (Logistic Model Trees) algorithm developed by Niels Landwehr with co-authors (available in R through "RWeka") and the MOB (MOdel-Based partitioning) algorithm when employed with binary GLMs (as available in the "party" package). as i'm new to trees / boosting / etc. i also might be advised to use the more comprehensible method / a function which argumentation is understood without having to climb a steep learning ledder, respectively. at the moment i don't know which this would be. Trees may be a good starting point. As I wrote to you off-list: Feel free to drop by my office if you want to chat about this. Best, Z regarding the meaning of absences at stands: as these species are frequent in the area and hence there is no limitation by propagules i guess absence is really due to unfavourable conditions. thanks a lot, kay - Kay Cichini Postgraduate student Institute of Botany Univ. of Innsbruck -- View this message in context: http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
hello gavin & achim, thanks for responding. by logistic regression tree i meant a regression tree for a binary response variable. but as you say i could also use a classification tree - in my case with only two outcomes. i'm not aware if there are substantial differences to expect for the two approaches (logistic regression tree vs. classification tree with two outcomes). as i'm new to trees / boosting / etc. i also might be advised to use the more comprehensible method / a function which argumentation is understood without having to climb a steep learning ledder, respectively. at the moment i don't know which this would be. regarding the meaning of absences at stands: as these species are frequent in the area and hence there is no limitation by propagules i guess absence is really due to unfavourable conditions. thanks a lot, kay - Kay Cichini Postgraduate student Institute of Botany Univ. of Innsbruck -- View this message in context: http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
On Thu, 19 Aug 2010, Gavin Simpson wrote: On Thu, 2010-08-19 at 13:42 -0700, Kay Cichini wrote: hello everyone, i sampled 100 stands at 20 restoration sites and presence of 3 different invasive plant species. i came across logistic regression trees and wonder if this is suited for my purpose - predicting presence of these problematic invasive plant species (one by one) by a set of recorded ecological / geographical parameters. i'd be glad if someone would comment on applying this mehtod to such data - maybe someone could point me useful references. also, i was not able to find out if there is a package implementing logistic regression? Not sure what a logistic regression tree is, but a classification tree would be useful here: Treat each species as present (== 1) or absent (== 0) and try to fit a tree consisting of a set of splits in X covariates that minimise a suitable deviance criterion. If you want to fit all three species at once, try multivariate trees, but IIRC, they (in package mvpart at least) expect a count-based data set, i.e. the deviance criterion they used (sum of squares) is probably not suited to binary type data. To add to Gavin's comments about the modeling techniques: ctree() in package "party" supports recursive partitioning of multivariate responses of arbitrary types (numeric, categorical, censored, etc.). Function mob() in the same package can also be used for partitioning based on logistic regressions. See the manual pages for further references. Also the machine learning and environmentrics task views at http://CRAN.R-project.org/view=MachineLearning http://CRAN.R-project.org/view=Environmetrics have some more pointers. Z The one problem I foresee is that you only have 100 data points and even that number is pseudo replicated as you have multiple samples from just 20 "sites". Trees are unstable at the best of times and work best when given a lot of data. Boosting, bagging and randomForests can help but they again work best/well with large data sets. I suppose large will be relative to the signal to noise ratio in your data. Ecologically, one needs to consider what a 0 value means (an absence): was the invasive not present due to the environment being bad or just because it hasn't got there yet despite environment being good? How you deal with that is anybody's guess. Try the R-SIG-Ecology list for further help. G thanks in advance, kay - Kay Cichini Postgraduate student Institute of Botany Univ. of Innsbruck -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression tree
On Thu, 2010-08-19 at 13:42 -0700, Kay Cichini wrote: > hello everyone, > > i sampled 100 stands at 20 restoration sites and presence of 3 different > invasive plant species. > i came across logistic regression trees and wonder if this is suited for my > purpose - predicting presence of these problematic invasive plant species > (one by one) by a set of recorded ecological / geographical parameters. > i'd be glad if someone would comment on applying this mehtod to such data - > maybe someone could point me useful references. > also, i was not able to find out if there is a package implementing logistic > regression? Not sure what a logistic regression tree is, but a classification tree would be useful here: Treat each species as present (== 1) or absent (== 0) and try to fit a tree consisting of a set of splits in X covariates that minimise a suitable deviance criterion. If you want to fit all three species at once, try multivariate trees, but IIRC, they (in package mvpart at least) expect a count-based data set, i.e. the deviance criterion they used (sum of squares) is probably not suited to binary type data. The one problem I foresee is that you only have 100 data points and even that number is pseudo replicated as you have multiple samples from just 20 "sites". Trees are unstable at the best of times and work best when given a lot of data. Boosting, bagging and randomForests can help but they again work best/well with large data sets. I suppose large will be relative to the signal to noise ratio in your data. Ecologically, one needs to consider what a 0 value means (an absence): was the invasive not present due to the environment being bad or just because it hasn't got there yet despite environment being good? How you deal with that is anybody's guess. Try the R-SIG-Ecology list for further help. G > > thanks in advance, > kay > > - > > Kay Cichini > Postgraduate student > Institute of Botany > Univ. of Innsbruck > > -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.