Re: [R] logistic regression tree

2010-08-22 Thread Kay Cichini

dear all,
thank you everyone for the profound answers and the needful references!

achim, thank you for the very kind offer!! sorrily i'm not around vienna in
the near feature, otherwise i'd be glad to coming back to your invitation.

yours,
kay

-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


-- 
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2334106.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-22 Thread Peter Dalgaard
On 08/22/2010 01:51 PM, Kay Cichini wrote:

 achim, thank you for the very kind offer!! sorrily i'm not around vienna in
 the near feature, otherwise i'd be glad to coming back to your invitation.

Not that it's any of my business, but I don't think you need to go THAT
far to visit Achim these days... grin

-pd

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Kay Cichini

hello gavin  achim,

thanks for responding.

by logistic regression tree i meant a regression tree for a binary response
variable.
but as you say i could also use a classification tree - in my case with only
two outcomes.

i'm not aware if there are substantial differences to expect for the two
approaches (logistic regression tree vs. classification tree with two
outcomes).

as i'm new to trees / boosting / etc. i also might be advised to use the
more comprehensible method / a function which argumentation is understood
without having to climb a steep learning ledder, respectively. at the moment
i don't know which this would be.

regarding the meaning of absences at stands: as these species are frequent
in the area and hence there is no limitation by propagules i guess absence
is really due to unfavourable conditions. 

thanks a lot,
kay

 

-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


-- 
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Achim Zeileis

On Fri, 20 Aug 2010, Kay Cichini wrote:



hello gavin  achim,

thanks for responding.

by logistic regression tree i meant a regression tree for a binary 
response variable. but as you say i could also use a classification tree 
- in my case with only two outcomes.


i'm not aware if there are substantial differences to expect for the two 
approaches (logistic regression tree vs. classification tree with two 
outcomes).


I don't think that there is a universally accepted terminology for this. 
Classification tree typically pertains to categorical responses 
(independet of the number of categories, i.e., also for binary responses).


Logistic regression tree is (to the best of my knowledge) not typically 
used as a term for binary classification trees.


Technical excursion:
However, logistic regression tree may mean a specific algorithm (LOTUS - 
LOgistic regression Tree with Unbiased Splits) developed by Kin-Yee Chan 
and Wei-Yin Loh. This algorithms shares various ideas with the LMT 
(Logistic Model Trees) algorithm developed by Niels Landwehr with 
co-authors (available in R through RWeka) and the MOB (MOdel-Based 
partitioning) algorithm when employed with binary GLMs (as available in 
the party package).


as i'm new to trees / boosting / etc. i also might be advised to use the 
more comprehensible method / a function which argumentation is 
understood without having to climb a steep learning ledder, 
respectively. at the moment i don't know which this would be.


Trees may be a good starting point. As I wrote to you off-list: Feel free 
to drop by my office if you want to chat about this.


Best,
Z

regarding the meaning of absences at stands: as these species are 
frequent in the area and hence there is no limitation by propagules i 
guess absence is really due to unfavourable conditions.


thanks a lot,
kay



-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Frank Harrell


It would be good to tell us of the frequency of observations in each 
category of Y, and the number of continuous X's.  Recursive 
partitioning will require perhaps 50,000 observations in the less 
frequent Y category for its structure and predicted values to 
validate, depending on X and the signal:noise ratio.  Hence the use of 
combinations of trees nowadays as opposed to single trees.  Or 
logistic regression.


Frank

Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

On Fri, 20 Aug 2010, Kay Cichini wrote:



hello gavin  achim,

thanks for responding.

by logistic regression tree i meant a regression tree for a binary response
variable.
but as you say i could also use a classification tree - in my case with only
two outcomes.

i'm not aware if there are substantial differences to expect for the two
approaches (logistic regression tree vs. classification tree with two
outcomes).

as i'm new to trees / boosting / etc. i also might be advised to use the
more comprehensible method / a function which argumentation is understood
without having to climb a steep learning ledder, respectively. at the moment
i don't know which this would be.

regarding the meaning of absences at stands: as these species are frequent
in the area and hence there is no limitation by propagules i guess absence
is really due to unfavourable conditions.

thanks a lot,
kay



-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Kay Cichini

hello,

my data-collection is not yet finished, but i though have started
investigating possible analysis methods.

below i give a very close simulation of my future data-set, however there
might be more nominal explanatory variables - there will be no continous at
all  (maybe some ordered nominal..).  

i tried several packages today, but the one i fancied most was ctree of the
party package.
i can't see why the given no. of datapoints (n=100) might pose a problem
here - but please teach me better, as i might be naive..

i'd be very glad about comments on the use of ctree on suchalike dataset and
if i oversee possible pitfalls

thank you all,
kay

##
# an example with 3 nominal explanatory variables:
# Y is presence of a certain invasive plant species
# introduced effect for fac1 and fac3, fac2 without effect.
# presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr.
region) and  
# fac3 = a|b|c (say all richer substrates). 
# presence is not influenced by fac2, which might be vegetation type, i.e.
##
library(party)
dat-cbind(
expand.grid(fac1=c(I,II),
fac2=LETTERS[1:5],
fac3=letters[1:10]))

print(dat-dat[order(dat$fac1,dat$fac2,dat$fac3),])

dat$fac13-paste(dat$fac1,dat$fac3,sep=)
for(i in 1:nrow(dat)){
ifelse(dat$fac13[i]==Ia|dat$fac13[i]==Ib|dat$fac13[i]==Ic,
   dat$Y[i]-rbinom(1,1,0.75),
   dat$Y[i]-rbinom(1,1,0))
}
dat$Y-as.factor(dat$Y)

tr-ctree(Y~fac1+fac2+fac3,data=dat)
plot(tr)
##


-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


-- 
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2333073.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Gavin Simpson
On Fri, 2010-08-20 at 14:46 -0700, Kay Cichini wrote:
 hello,
 
 my data-collection is not yet finished, but i though have started
 investigating possible analysis methods.
 
 below i give a very close simulation of my future data-set, however there
 might be more nominal explanatory variables - there will be no continous at
 all  (maybe some ordered nominal..).  
 
 i tried several packages today, but the one i fancied most was ctree of the
 party package.
 i can't see why the given no. of datapoints (n=100) might pose a problem
 here - but please teach me better, as i might be naive..

I'm no expert, but single trees are unstable predictors; change your
data slightly and you might get a totally different model/tree. I hope
that worries you?

Frank's comment was that depending upon the signal-to-noise ratio in
your sample of data, you might need a very large data set indeed, much
larger than your 100 data points/samples, to have any confidence in the
single fitted tree.

For this reason, ensemble or committee methods have been developed that
combine the predictions from many trees fitted to perturbed versions of
the training data. Such methods include boosting and randomForests.

We are venturing into territory not suited to email list format;
statistical consultancy. As Achim is local to you and has kindly offered
to meet you, I would strongly suggest you take up his offer.

In the meantime, here are a couple of references to look at if you
aren't familiar with these statistical machine learning techniques.

Cutler et al (2007) Random forests for classification in ecology.
Ecology 88(11), 2783---2792.

Elith, J., Leathwick, J.R., and Hastie, T. (2008) A working guide to
boosted regression trees. Journal of Animal Ecology, 77, 802---813.

Also, don't dismiss the logistic regression model. Modern techniques
like the lasso and elastic net are available for GLMs such as this and
include model selection as part of their fitting. These are underused by
ecologists (IMHO) who seem to like (abuse?)the information theoretic
approaches and step-wise selection procedures... (apologies to
ecologists here [I am one too] for being general!) See:

Dahlgren J.p. (2010) Alternative regression methods are not considered
in Murtaugh (2009) or by ecologists in general. Ecology Letters 13(5)
E7-E9.

HTH

G

 i'd be very glad about comments on the use of ctree on suchalike dataset and
 if i oversee possible pitfalls
 
 thank you all,
 kay
 
 ##
 # an example with 3 nominal explanatory variables:
 # Y is presence of a certain invasive plant species
 # introduced effect for fac1 and fac3, fac2 without effect.
 # presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr.
 region) and  
 # fac3 = a|b|c (say all richer substrates). 
 # presence is not influenced by fac2, which might be vegetation type, i.e.
 ##
 library(party)
 dat-cbind(
 expand.grid(fac1=c(I,II),
 fac2=LETTERS[1:5],
 fac3=letters[1:10]))
 
 print(dat-dat[order(dat$fac1,dat$fac2,dat$fac3),])
 
 dat$fac13-paste(dat$fac1,dat$fac3,sep=)
 for(i in 1:nrow(dat)){
 ifelse(dat$fac13[i]==Ia|dat$fac13[i]==Ib|dat$fac13[i]==Ic,
dat$Y[i]-rbinom(1,1,0.75),
dat$Y[i]-rbinom(1,1,0))
 }
 dat$Y-as.factor(dat$Y)
 
 tr-ctree(Y~fac1+fac2+fac3,data=dat)
 plot(tr)
 ##
 
 
 -
 
 Kay Cichini
 Postgraduate student
 Institute of Botany
 Univ. of Innsbruck
 
 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Frank Harrell



On Fri, 20 Aug 2010, Kay Cichini wrote:



hello,

my data-collection is not yet finished, but i though have started
investigating possible analysis methods.

below i give a very close simulation of my future data-set, however there
might be more nominal explanatory variables - there will be no continous at
all  (maybe some ordered nominal..).

i tried several packages today, but the one i fancied most was ctree of the
party package.
i can't see why the given no. of datapoints (n=100) might pose a problem
here - but please teach me better, as i might be naive..


See

http://biostat.mc.vanderbilt.edu/wiki/Main/ComplexDataJournalClub#Sebastiani_et_al_Nature_Genetics

The recursive partitioning simulation there will give you an idea - 
you can modify the R code to simulate a situation more like yours. 
When you simulate the true patterns and see how far the tree is from 
discovering the true patterns, you'll be surprised.


Frank

 

i'd be very glad about comments on the use of ctree on suchalike dataset and
if i oversee possible pitfalls

thank you all,
kay

##
# an example with 3 nominal explanatory variables:
# Y is presence of a certain invasive plant species
# introduced effect for fac1 and fac3, fac2 without effect.
# presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr.
region) and
# fac3 = a|b|c (say all richer substrates).
# presence is not influenced by fac2, which might be vegetation type, i.e.
##
library(party)
dat-cbind(
expand.grid(fac1=c(I,II),
   fac2=LETTERS[1:5],
   fac3=letters[1:10]))

print(dat-dat[order(dat$fac1,dat$fac2,dat$fac3),])

dat$fac13-paste(dat$fac1,dat$fac3,sep=)
for(i in 1:nrow(dat)){
ifelse(dat$fac13[i]==Ia|dat$fac13[i]==Ib|dat$fac13[i]==Ic,
  dat$Y[i]-rbinom(1,1,0.75),
  dat$Y[i]-rbinom(1,1,0))
}
dat$Y-as.factor(dat$Y)

tr-ctree(Y~fac1+fac2+fac3,data=dat)
plot(tr)
##


-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2333073.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] logistic regression tree

2010-08-19 Thread Kay Cichini

hello everyone,

i sampled 100 stands at 20 restoration sites and presence of 3 different
invasive plant species. 
i came across logistic regression trees and wonder if this is suited for my
purpose - predicting presence of these problematic invasive plant species
(one by one) by a set of recorded ecological / geographical parameters.
i'd be glad if someone would comment on applying this mehtod to such data -
maybe someone could point me useful references.
also, i was not able to find out if there is a package implementing logistic
regression?

thanks in advance,
kay

-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


-- 
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2331847.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-19 Thread Gavin Simpson
On Thu, 2010-08-19 at 13:42 -0700, Kay Cichini wrote:
 hello everyone,
 
 i sampled 100 stands at 20 restoration sites and presence of 3 different
 invasive plant species. 
 i came across logistic regression trees and wonder if this is suited for my
 purpose - predicting presence of these problematic invasive plant species
 (one by one) by a set of recorded ecological / geographical parameters.
 i'd be glad if someone would comment on applying this mehtod to such data -
 maybe someone could point me useful references.
 also, i was not able to find out if there is a package implementing logistic
 regression?

Not sure what a logistic regression tree is, but a classification tree
would be useful here: Treat each species as present (== 1) or absent (==
0) and try to fit a tree consisting of a set of splits in X covariates
that minimise a suitable deviance criterion.

If you want to fit all three species at once, try multivariate trees,
but IIRC, they (in package mvpart at least) expect a count-based data
set, i.e. the deviance criterion they used (sum of squares) is probably
not suited to binary type data.

The one problem I foresee is that you only have 100 data points and even
that number is pseudo replicated as you have multiple samples from just
20 sites. Trees are unstable at the best of times and work best when
given a lot of data. Boosting, bagging and randomForests can help but
they again work best/well with large data sets. I suppose large will be
relative to the signal to noise ratio in your data.

Ecologically, one needs to consider what a 0 value means (an absence):
was the invasive not present due to the environment being bad or just
because it hasn't got there yet despite environment being good? How you
deal with that is anybody's guess.

Try the R-SIG-Ecology list for further help.

G

 
 thanks in advance,
 kay
 
 -
 
 Kay Cichini
 Postgraduate student
 Institute of Botany
 Univ. of Innsbruck
 
 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-19 Thread Achim Zeileis

On Thu, 19 Aug 2010, Gavin Simpson wrote:


On Thu, 2010-08-19 at 13:42 -0700, Kay Cichini wrote:

hello everyone,

i sampled 100 stands at 20 restoration sites and presence of 3 different
invasive plant species.
i came across logistic regression trees and wonder if this is suited for my
purpose - predicting presence of these problematic invasive plant species
(one by one) by a set of recorded ecological / geographical parameters.
i'd be glad if someone would comment on applying this mehtod to such data -
maybe someone could point me useful references.
also, i was not able to find out if there is a package implementing logistic
regression?


Not sure what a logistic regression tree is, but a classification tree
would be useful here: Treat each species as present (== 1) or absent (==
0) and try to fit a tree consisting of a set of splits in X covariates
that minimise a suitable deviance criterion.

If you want to fit all three species at once, try multivariate trees,
but IIRC, they (in package mvpart at least) expect a count-based data
set, i.e. the deviance criterion they used (sum of squares) is probably
not suited to binary type data.


To add to Gavin's comments about the modeling techniques:

ctree() in package party supports recursive partitioning of multivariate 
responses of arbitrary types (numeric, categorical, censored, etc.).


Function mob() in the same package can also be used for partitioning based 
on logistic regressions. See the manual pages for further references.


Also the machine learning and environmentrics task views at

  http://CRAN.R-project.org/view=MachineLearning
  http://CRAN.R-project.org/view=Environmetrics

have some more pointers.
Z


The one problem I foresee is that you only have 100 data points and even
that number is pseudo replicated as you have multiple samples from just
20 sites. Trees are unstable at the best of times and work best when
given a lot of data. Boosting, bagging and randomForests can help but
they again work best/well with large data sets. I suppose large will be
relative to the signal to noise ratio in your data.

Ecologically, one needs to consider what a 0 value means (an absence):
was the invasive not present due to the environment being bad or just
because it hasn't got there yet despite environment being good? How you
deal with that is anybody's guess.

Try the R-SIG-Ecology list for further help.

G



thanks in advance,
kay

-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck




--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Dr. Gavin Simpson [t] +44 (0)20 7679 0522
ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.