Re: [R] Did I use "step" function correctly? (Is R's step() functionreliable?)

Michael Thu, 16 Mar 2006 23:44:40 -0800

Hi all,


The following are the last three outputs after a long list of outputs:

The AIC first goes down, then at the last output it goes up slightly...

Shall I use the last model, or the lowest AIC model?


Step:  AIC= 154.85
 col1 ~ col2 + col3 + col4 + s(col2,
    2) + s(col3, 2) + s(col4, 2) + s(col2, 3) + s(col3,
    3)


Step:  AIC= 153.49
 col1 ~ col2 + col3 + col4 + s(col2,
    2) + s(col3, 2) + s(col4, 2) + s(col2, 3)


Step:  AIC= 154.31
 col1 ~ col2 + col3 + col4 + s(col2,
    2) + s(col3, 2) + s(col4, 2)



On 3/16/06, Michael <[EMAIL PROTECTED]> wrote:
>
>  Hi L.Y,
>
> Thank you for your advice.
>
> Are you talking about Trevor Hastie's gam()?
>
> I did not see anywhere from the result that it has an automatic Cross
> Validation?
>
> I also could not verify that the gam() function will automatically find
> the degree-of-freedom if I don't specify the df, and just use
> tems such as
>
> s(col1) + s(col2) ...
>
> Does the "step()" function also include the gam() with CV and
> auto-tweaking for df?
>
> I wondered if I have called "step()" correctly, because it looks to me
> that it only run at a very short time(1second), and immediately returned two
> models, in fact has even larger residual deviance than the model I have
> provided to it initially... (obviously I've included every possibilities in
> the initial model, and rely on the step() function to cut off some terms for
> me...)
>
> Thanks a lot!
>
>
>
>  On 3/16/06, Dr L. Y Hin <[EMAIL PROTECTED]> wrote:
> >
> > The engine of gam() lies in a function called smooth.spline() that is
> > found
> > in the
> > library splines. If you leave out specifying the degree of freedom in
> > the
> > formulary determination,
> > it will automatically specify it for you via cross-validation. The
> > results
> > of model fit obtainable via
> > summary(mygam) will show you the "degree of freedom as choosen by the
> > cross-validation method".
> > On a more philosophical plane, Buja et al. (Ann Stat.
> > 1989;17(2):453-510)
> > pointed out that the fact
> > that linear smoothers such as cubic splines and smoothing splines are
> > linear
> > lies in the fact that
> > they are x-dependent and not y-dependent. By using cross-validation, you
> >
> > will invariably involve the
> > use of y, which renders the determination of degree of freedom
> > y-dependent,
> > hence the smoothing
> > parameter \lambda y-dependent, and for such a case, the smoothing
> > matrix,
> > strictly speaking,
> > non-linear becasue S= (I + \lambda * K)^-1 in the non weighted form with
> > unique x-points.
> >
> > If you increase the degree of freedom, the \lambda decreases, to a point
> > where you will efffectively
> > have a straightforward interpolation of points on the graph. Conversely,
> > if
> > \lambda is increased,
> > the smoothing line reduces to a linear regression line through all the
> > points.
> >
> > In my opinion, AIC and Residual sum of squares are competing tools
> > looking
> > for the best fit.
> > The minimum of AIC and that of RSS may not concur. If you believe in
> > AIC,
> > then I would assume
> > you also believe that it is a better tool than RSS in that the former
> > uses
> > an information theoretic
> > approach, which is not sensitive to offset in accuracy due to
> > penalization
> > of outliers. Following that,
> > I would disregard RSS and go according to what AIC tells me.
> >
> > I don't think you have used step.gam incorrectly, but I think you have
> > been
> > observant enough to
> > realize not all statistical tools agree all the times :)
> >
> > Lin
> >
> > ----- Original Message -----
> > From: "Michael" <[EMAIL PROTECTED]>
> > To: <R-help@stat.math.ethz.ch >
> > Sent: Thursday, March 16, 2006 5:30 PM
> > Subject: [R] Did I use "step" function correctly? (Is R's step()
> > functionreliable?)
> >
> >
> > > Hi all,
> > >
> > > I put up an exhaustive model to use R's "step" function:
> > >
> > > ------------------------
> > >
> > > mygam=gam(col1 ~ 1
> > > + col2     + col3     + col4
> > > + col2 ^ 2 + col3 ^ 2 + col4 ^ 2
> > > + col2 ^ 3 + col3 ^ 3 + col4 ^ 3
> > > + s(col2, 1) + s(col3, 1) + s(col4, 1)
> > > + s(col2, 2) + s(col3, 2) + s(col4, 2)
> > > + s(col2, 3) + s(col3, 3) + s(col4, 3)
> > > + s(col2, 4) + s(col3, 4) + s(col4, 4)
> > > + s(col2, 5) + s(col3, 5) + s(col4, 5)
> > > + s(col2, 6) + s(col3, 6) + s(col4, 6)
> > > + s(col2, 7) + s(col3, 7) + s(col4, 7)
> > > + s(col2, 8) + s(col3, 8) + s(col4, 8)
> > > + s(col2, 9) + s(col3, 9) + s(col4, 9),
> > > data=X);
> > >
> > > mystep=step(mygam);
> > >
> > > ---------------------
> > > After a long list, the following are two lowest AIC:
> > >
> > > Step:  AIC= 152.1
> > > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) + s(col4, 3)
> > >
> > >
> > > Step:  AIC= 153.45
> > > col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3)
> > > -----------------------------------------------
> > >
> > > However, the lowest AIC model,  " col1 ~ col2 + col3 + col4 + s(col2,
> > 3) +
> > > s(col3, 3) + s(col4, 3)" does not give the best Residual Deviance.
> > >
> > > Instead, the model "mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) +
> > s(col4,
> > > 6),
> > > data=X)" is the best, in fact,
> > >
> > > I found that as I increase the "degree-of-freedom", it always give
> > better
> > > residual deviance, lower than that of the "best" model returned by
> > "step"
> > > function... Please see below.
> > >
> > > I am wondering if I need to increase "degree-of-freedom" all the way
> > up...
> > > Perhaps to avoid overfitting, I should do a cross validation. Is there
> > an
> > > automatic Cross Validation inside "step" or "gam"?
> > >
> > > Is "step" function result reliable? Or perhaps I used it incorrectly?
> > >
> > > Thanks a lot,
> > >
> > > Michael.
> > >
> > > --------------------------
> > >
> > >>
> > >> mygam1=gam(col1 ~ col2 + col3 + col4 + s(col2, 3) + s(col3, 3) +
> > s(col4,
> > > 3), data=X);
> > >>
> > >> mygam2=gam(col1 ~ col2 + col3 + col4 , data=X);
> > >>
> > >> mygam3=gam(col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6), data=X);
> > >>
> > >> mygam1
> > > Call:
> > > gam(formula = col1 ~ col2 + col3 + col4 +
> > >    s(col2, 3) + s(col3, 3) + s(col4, 3), data = X)
> > >
> > > Degrees of Freedom: 110 total; 100.9999 Residual
> > > Residual Deviance: 20.98365
> > >> mygam2
> > > Call:
> > > gam(formula = col1 ~ col2 + col3 + col4, data = X)
> > >
> > > Degrees of Freedom: 110 total; 107 Residual
> > > Residual Deviance: 27.84808
> > >> mygam3
> > > Call:
> > > gam(formula = col1 ~ s(col2, 6) + s(col3, 6) +
> > >    s(col4, 6), data = X)
> > >
> > > Degrees of Freedom: 110 total; 91.99957 Residual
> > > Residual Deviance: 18.45776
> > >>
> > >> anova(mygam1, mygam2, mygam3);
> > > Analysis of Deviance Table
> > >
> > > Model 1: col1 ~ col2 + col3 + col4 + s(col2,
> > >    3) + s(col3, 3) + s(col4, 3)
> > > Model 2: col1 ~ col2 + col3 + col4
> > > Model 3: col1 ~ s(col2, 6) + s(col3, 6) + s(col4, 6)
> > >  Resid. Df Resid. Dev       Df Deviance P(>|Chi|)
> > > 1  100.9999     20.9836
> > > 2  107.0000    27.8481  -6.0001  -6.8644 6.115e-06
> > > 3   91.9996    18.4578  15.0004   9.3903 3.958e-05
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> > >
> >
> >
> >
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Did I use "step" function correctly? (Is R's step() functionreliable?)

Reply via email to