Hi, Lynch, Thank you for attention first.
I am also not a statistician and have just taken several statistics classes. So it is natral for us to ask some question seeming naive to statisticans. I am sorry that I cannot agree with your point that we must always include intercept in our model. becaus if true intercept is zero, the strategy of you or your textbook will be have 2 losses. First, there will be explaination problem. If true intercept is zero and your estimate of it is not zero, the result of regression is misleading. However, it might be not so serious as we judge those coefficients which are actually zeros to be none-zeros, but the misjudge here is still a loss in some extent. Secondly, if true intercept is zero, your strategy's predictive ability is often lower than other strategies which do not always include intercept. If you are interested in the performance of your strategies, e.g. maximizing adjusted R^2 always with intercept. you can run the code I put in the attachment. It will show that maximizing adjusted R^2 NOT always with intercept beats maximizing adjusted R^2 always with intercept. Junjie 2007/5/22, Paul Lynch <[EMAIL PROTECTED]>:
Junjie, First, a disclaimer: I am not a statistician, and have only taken one statistics class, but I just took it this Spring, so the concepts of linear regression are relatively fresh in my head and hopefully I will not be too inaccurate. According to my statistics textbook, when selecting variables for a model, the intercept term is always present. The "variables" under consideration do not include the constant "1" that multiplies the intercept term. I don't think it makes sense to compare models with and without an intercept term. (Also, I don't know what the point of using a model without an intercept term would be, but that is probably just my ignorance.) Similarly, the formula you were using for R**2 seems to only be useful in the context of a standard linear regression (i.e., one that includes an intercept term). As your example shows, it is easy to construct a "fit" (e.g. y = 10,000,000*x) so that SSR > SST if one is not deriving the fit from the regular linear regression process. --Paul On 5/19/07, 李俊杰 <[EMAIL PROTECTED]> wrote: > I know that "-1" indicates to remove the intercept term. But my question is > why intercept term CAN NOT be treated as a variable term as we place a > column consited of 1 in the predictor matrix. > > If I stick to make a comparison between a model with intercept and one > without intercept on adjusted r2 term, now I think the strategy is always to > use another definition of r-square or adjusted r-square, in which > r-square=sum(( y.hat)^2)/sum((y)^2). > > Am I in the right way? > > Thanks > > Li Junjie > > > 2007/5/19, Paul Lynch <[EMAIL PROTECTED]>: > > In case you weren't aware, the meaning of the "-1" in y ~ x - 1 is to > > remove the intercept term that would otherwise be implied. > > --Paul > > > > On 5/17/07, 李俊杰 <[EMAIL PROTECTED]> wrote: > > > Hi, everybody, > > > > > > 3 questions about R-square: > > > ---------(1)----------- Does R2 always increase as variables are added? > > > ---------(2)----------- Does R2 always greater than 1? > > > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared > > > calculated? It is different from (r.square=sum((y.hat-mean > > > (y))^2)/sum((y-mean(y))^2)) > > > > > > I will illustrate these problems by the following codes: > > > ---------(1)----------- R2 doesn't always increase as > variables are added > > > > > > > x=matrix(rnorm(20),ncol=2) > > > > y=rnorm(10) > > > > > > > > lm=lm(y~1) > > > > y.hat=rep(1*lm$coefficients,length(y)) > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2)) > > > [1] 2.646815e-33 > > > > > > > > lm=lm(y~x-1) > > > > y.hat=x%*%lm$coefficients > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2)) > > > [1] 0.4443356 > > > > > > > > ################ This is the biggest model, but its R2 is not the > biggest, > > > why? > > > > lm=lm(y~x) > > > > y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2)) > > > [1] 0.2704789 > > > > > > > > > ---------(2)----------- R2 can greater than 1 > > > > > > > x=rnorm(10) > > > > y=runif(10) > > > > lm=lm(y~x-1) > > > > y.hat=x*lm$coefficients > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2)) > > > [1] 3.513865 > > > > > > > > > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared > > > calculated? It is different from (r.square=sum((y.hat-mean > > > (y))^2)/sum((y-mean(y))^2)) > > > > x=matrix(rnorm(20),ncol=2) > > > > xx=cbind(rep(1,10),x) > > > > y=x%*%c(1,2)+rnorm(10) > > > > ### r2 calculated by lm(y~x) > > > > lm=lm(y~x) > > > > summary(lm)$r.squared > > > [1] 0.9231062 > > > > ### r2 calculated by lm(y~xx-1) > > > > lm=lm(y~xx-1) > > > > summary(lm)$r.squared > > > [1] 0.9365253 > > > > ### r2 calculated by me > > > > y.hat=xx%*%lm$coefficients > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2)) > > > [1] 0.9231062 > > > > > > > > > Thanks a lot for any cue:) > > > > > > > > > > > > > > > -- > > > Junjie Li, [EMAIL PROTECTED] > > > Undergranduate in DEP of Tsinghua University, > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > -- > > Paul Lynch > > Aquilent, Inc. > > National Library of Medicine (Contractor) > > > > > > -- > > Junjie Li, [EMAIL PROTECTED] > Undergranduate in DEP of Tsinghua University, -- Paul Lynch Aquilent, Inc. National Library of Medicine (Contractor)
-- Junjie Li, [EMAIL PROTECTED] Undergranduate in DEP of Tsinghua University,
______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.