On Fri, 15 Jun 2007, ronggui wrote: > Dear all, > > I would like to model the relationship between y and x. y is binary > variable, and x is a count variable which may be possion-distribution. > > I think it is better to divide x into intervals and change it to a > factor before calling glm(y~x,data=dat,family=binomail). > > I try to use rpart. As y is binary, I use "class" method and get the > following result. >> rpart(y~x,data=dat,method="class") > n=778 (22 observations deleted due to missingness) > > node), split, n, loss, yval, (yprob) > * denotes terminal node > > 1) root 778 67 0 (0.91388175 0.08611825) * > > > If with the default method, I get such a result. > >> rpart(y~x,data=dat) > n=778 (22 observations deleted due to missingness) > > node), split, n, deviance, yval > * denotes terminal node > > 1) root 778 61.230080 0.08611825 > 2) x< 19.5 750 53.514670 0.07733333 > 4) x< 1.25 390 17.169230 0.04615385 * > 5) x>=1.25 360 35.555560 0.11111110 * > 3) x>=19.5 28 6.107143 0.32142860 * > > If I use 1.25 and 19.5 as the cutting points, change x into factor by >> x2 <- cut(q34b,breaks=c(0,1.25,19.5,200),right=F) > > The coef in y~x2 is significant and makes sense. > > My problem is: is it OK use the default method in rpart when response > varibale is binary one? Thanks.
Not unless you want a least-squares fit. Note that you have only 8.6% of one class, and for such an unbalanced classification problem you are unlikely to do better than declaring class 1 for all examples. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.