On Fri, 2010-08-20 at 14:46 -0700, Kay Cichini wrote: > hello, > > my data-collection is not yet finished, but i though have started > investigating possible analysis methods. > > below i give a very close simulation of my future data-set, however there > might be more nominal explanatory variables - there will be no continous at > all (maybe some ordered nominal..). > > i tried several packages today, but the one i fancied most was ctree of the > party package. > i can't see why the given no. of datapoints (n=100) might pose a problem > here - but please teach me better, as i might be naive..
I'm no expert, but single trees are unstable predictors; change your data slightly and you might get a totally different model/tree. I hope that worries you? Frank's comment was that depending upon the signal-to-noise ratio in your sample of data, you might need a very large data set indeed, much larger than your 100 data points/samples, to have any confidence in the single fitted tree. For this reason, ensemble or committee methods have been developed that combine the predictions from many trees fitted to perturbed versions of the training data. Such methods include boosting and randomForests. We are venturing into territory not suited to email list format; statistical consultancy. As Achim is local to you and has kindly offered to meet you, I would strongly suggest you take up his offer. In the meantime, here are a couple of references to look at if you aren't familiar with these statistical machine learning techniques. Cutler et al (2007) Random forests for classification in ecology. Ecology 88(11), 2783---2792. Elith, J., Leathwick, J.R., and Hastie, T. (2008) A working guide to boosted regression trees. Journal of Animal Ecology, 77, 802---813. Also, don't dismiss the logistic regression model. Modern techniques like the lasso and elastic net are available for GLMs such as this and include model selection as part of their fitting. These are underused by ecologists (IMHO) who seem to like (abuse?)the information theoretic approaches and step-wise selection procedures... (apologies to ecologists here [I am one too] for being general!) See: Dahlgren J.p. (2010) Alternative regression methods are not considered in Murtaugh (2009) or by ecologists in general. Ecology Letters 13(5) E7-E9. HTH G > i'd be very glad about comments on the use of ctree on suchalike dataset and > if i oversee possible pitfalls.... > > thank you all, > kay > > ###################################################################################### > # an example with 3 nominal explanatory variables: > # Y is presence of a certain invasive plant species > # introduced effect for fac1 and fac3, fac2 without effect. > # presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr. > region) and > # fac3 = a|b|c (say all richer substrates). > # presence is not influenced by fac2, which might be vegetation type, i.e. > ###################################################################################### > library(party) > dat<-cbind( > expand.grid(fac1=c("I","II"), > fac2=LETTERS[1:5], > fac3=letters[1:10])) > > print(dat<-dat[order(dat$fac1,dat$fac2,dat$fac3),]) > > dat$fac13<-paste(dat$fac1,dat$fac3,sep="") > for(i in 1:nrow(dat)){ > ifelse(dat$fac13[i]=="Ia"|dat$fac13[i]=="Ib"|dat$fac13[i]=="Ic", > dat$Y[i]<-rbinom(1,1,0.75), > dat$Y[i]<-rbinom(1,1,0)) > } > dat$Y<-as.factor(dat$Y) > > tr<-ctree(Y~fac1+fac2+fac3,data=dat) > plot(tr) > ###################################################################################### > > > ----- > ------------------------ > Kay Cichini > Postgraduate student > Institute of Botany > Univ. of Innsbruck > ------------------------ > -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.