Hi all,
The following are the last three outputs after a long list of outputs: The AIC first goes down, then at the last output it goes up slightly... Shall I use the last model, or the lowest AIC model? Step: AIC= 154.85 col1 ~ col2 + col3 + col4 + s(col2, 2) + s(col3, 2) + s(col4, 2) + s(col2, 3) + s(col3, 3) Step: AIC= 153.49 col1 ~ col2 + col3 + col4 + s(col2, 2) + s(col3, 2) + s(col4, 2) + s(col2, 3) Step: AIC= 154.31 col1 ~ col2 + col3 + col4 + s(col2, 2) + s(col3, 2) + s(col4, 2) On 3/16/06, Michael <[EMAIL PROTECTED]> wrote: > > Hi L.Y, > > Thank you for your advice. > > Are you talking about Trevor Hastie's gam()? > > I did not see anywhere from the result that it has an automatic Cross > Validation? > > I also could not verify that the gam() function will automatically find > the degree-of-freedom if I don't specify the df, and just use > tems such as > > s(col1) + s(col2) ... > > Does the "step()" function also include the gam() with CV and > auto-tweaking for df? > > I wondered if I have called "step()" correctly, because it looks to me > that it only run at a very short time(1second), and immediately returned two > models, in fact has even larger residual deviance than the model I have > provided to it initially... (obviously I've included every possibilities in > the initial model, and rely on the step() function to cut off some terms for > me...) > > Thanks a lot! > > > > On 3/16/06, Dr L. Y Hin <[EMAIL PROTECTED]> wrote: > > > > The engine of gam() lies in a function called smooth.spline() that is > > found > > in the > > library splines. If you leave out specifying the degree of freedom in > > the > > formulary determination, > > it will automatically specify it for you via cross-validation. The > > results > > of model fit obtainable via > > summary(mygam) will show you the "degree of freedom as choosen by the > > cross-validation method". > > On a more philosophical plane, Buja et al. (Ann Stat. > > 1989;17(2):453-510) > > pointed out that the fact > > that linear smoothers such as cubic splines and smoothing splines are > > linear > > lies in the fact that > > they are x-dependent and not y-dependent. By using cross-validation, you > > > > will invariably involve the > > use of y, which renders the determination of degree of freedom > > y-dependent, > > hence the smoothing > > parameter \lambda y-dependent, and for such a case, the smoothing > > matrix, > > strictly speaking, > > non-linear becasue S= (I + \lambda * K)^-1 in the non weighted form with > > unique x-points. > > > > If you increase the degree of freedom, the \lambda decreases, to a point > > where you will efffectively > > have a straightforward interpolation of points on the graph. Conversely, > > if > > \lambda is increased, > > the smoothing line reduces to a linear regression line through all the > > points. > > > > In my opinion, AIC and Residual sum of squares are competing tools > > looking > > for the best fit. > > The minimum of AIC and that of RSS may not concur. If you believe in > > AIC, > > then I would assume > > you also believe that it is a better tool than RSS in that the former > > uses > > an information theoretic > > approach, which is not sensitive to offset in accuracy due to > > penalization > > of outliers. Following that, > > I would disregard RSS and go according to what AIC tells me. > > > > I don't think you have used step.gam incorrectly, but I think you have > > been > > observant enough to > > realize not all statistical tools agree all the times :) > > > > Lin > > > > ----- Original Message ----- > > From: "Michael" <[EMAIL PROTECTED]> > > To: <R-help@stat.math.ethz.ch > > > Sent: Thursday, March 16, 2006 5:30 PM > > Subject: [R] Did I use "step" function correctly? (Is R's step() > > functionreliable?) > > > > > > > Hi all, > > > > > > I put up an exhaustive model to use R's "step" function: > > > > > > ------------------------ > > > > > > mygam=gam(col1 ~ 1 > > > + col2 + col3 + col4 > > > + col2 ^ 2 + col3 ^ 2 + col4 ^ 2 > > > + col2 ^ 3 + col3 ^ 3 + col4 ^ 3 > > > + s(col2, 1) + s(col3, 1) + s(col4, 1) > > > + s(col2, 2) + s(col3, 2) + s(col4, 2) > > > + s(col2, 3) + s(col3, 3) + s(col4, 3) > > > + s(col2, 4) + s(col3, 4) + s(col4, 4) > > > + s(col2, 5) + s(col3, 5) + s(col4, 5) > > > + s(col2, 6) + s(col3, 6) + s(col4, 6) > > > + s(col2, 7) + s(col3, 7) + s(col4, 7) > > > + s(col2, 8) + s(col3, 8) + s(col4, 8) > > > + s(col2, 9) + s(col3, 9) + s(col4, 9), > > > data=X); > > > > > > mystep=step(mygam); > > > > > > --------------------- > > > After a long list, the following are two lowest AIC: > > > > > > Step: AIC= 152.1 > > > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3) > > > > > > > > > Step: AIC= 153.45 > > > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) > > > ----------------------------------------------- > > > > > > However, the lowest AIC model, " col1 ~ col2 + col3 + col4 + s(col2, > > 3) + > > > s(col3, 3) + s(col4, 3)" does not give the best Residual Deviance. > > > > > > Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + > > s(col4, > > > 6), > > > data=X)" is the best, in fact, > > > > > > I found that as I increase the "degree-of-freedom", it always give > > better > > > residual deviance, lower than that of the "best" model returned by > > "step" > > > function... Please see below. > > > > > > I am wondering if I need to increase "degree-of-freedom" all the way > > up... > > > Perhaps to avoid overfitting, I should do a cross validation. Is there > > an > > > automatic Cross Validation inside "step" or "gam"? > > > > > > Is "step" function result reliable? Or perhaps I used it incorrectly? > > > > > > Thanks a lot, > > > > > > Michael. > > > > > > -------------------------- > > > > > >> > > >> mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + > > s(col4, > > > 3), data=X); > > >> > > >> mygam2=gam(col1 ~ col2 + col3 + col4 , data=X); > > >> > > >> mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6), data=X); > > >> > > >> mygam1 > > > Call: > > > gam(formula = col1 ~ col2 + col3 + col4 + > > > s(col2, 3) + s(col3, 3) + s(col4, 3), data = X) > > > > > > Degrees of Freedom: 110 total; 100.9999 Residual > > > Residual Deviance: 20.98365 > > >> mygam2 > > > Call: > > > gam(formula = col1 ~ col2 + col3 + col4, data = X) > > > > > > Degrees of Freedom: 110 total; 107 Residual > > > Residual Deviance: 27.84808 > > >> mygam3 > > > Call: > > > gam(formula = col1 ~ s(col2, 6) + s(col3, 6) + > > > s(col4, 6), data = X) > > > > > > Degrees of Freedom: 110 total; 91.99957 Residual > > > Residual Deviance: 18.45776 > > >> > > >> anova(mygam1, mygam2, mygam3); > > > Analysis of Deviance Table > > > > > > Model 1: col1 ~ col2 + col3 + col4 + s(col2, > > > 3) + s(col3, 3) + s(col4, 3) > > > Model 2: col1 ~ col2 + col3 + col4 > > > Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6) > > > Resid. Df Resid. Dev Df Deviance P(>|Chi|) > > > 1 100.9999 20.9836 > > > 2 107.0000 27.8481 -6.0001 -6.8644 6.115e-06 > > > 3 91.9996 18.4578 15.0004 9.3903 3.958e-05 > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > > > > > > > > > > [[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html