Re: [R] {Spam?} RE: {Spam?} Re: mgcv, testing gamm vs lme, which degrees of freedom?

Christos Argyropoulos Mon, 21 Jun 2010 15:56:25 -0700

Hi, 
>From my understanding, GAMs as well as GAMMs can be represented as 
>(generalized) mixed models (Simon discusses this in the last chapter of his 
>book; a condensed overview from a Bayesian perspective is given in the 
>following paper by Y. Zhao, J. Staudenmayer, B. A. Coull and M. P. Wand : 
>http://arxiv.org/pdf/math/0606491) .


In this representation the un-penalized part appears as part of the fixed 
effects, whereas the penalized part appears in the random effect design matrix. 
The fixed effects part is determined by the basis (and the penalty order for TP 
splines), so that a comparison of the following: 
   
f1 <- gamm(y ~ s(x, k=2, bs="cr"), random = 
list(id=~1), method="ML" )
f2 <- gamm(y ~ s(x, 
k=10, bs="cr"), random = list(id=~1), method="ML")

can be undertaken via ANOVA tests since the 2 models have the same fixed 
effects. 
(If I remember correctly, the fixed effects for the cubic regression splines in 
GAM(M)s involves a design matrix with 2 columns: the intercept and the linear 
function. )
The difference in degrees of freedom of the two models arises from the 
additional "knots" that the larger model has relative to the smaller one.

Now when you try to compare the following models:

f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), 
method="ML" )
f2 <- gamm(y ~ s(x, k=10, 
bs="cr"), random = list(id=~1), method="ML"

the I(x^2) is not part of the (implicit) fixed effects design matrix of the 
cubic regression smoother and the models are not nested. However if you were to 
use a smooth term, whose fixed effects decomposition included the constant, 
linear and quadratic functions, then one could resort to an ANOVA type of test 
(or at least so I think).

Having used GAM(M)s extensively over the last 14 months, I have to point out 
that the appeal of the methodology to the analyst is precisely its ability to 
let the data distinguish between f2 and f3 without fitting the 2 models. Rather 
than try to distinguish between a quadratic and a more general relationship, it 
might be easier to fit the "f3' model and note the resulting degrees of 
freedom; if it is close to 2, then the data have essentially told you that a 
quadratic function is appropriate for the problem at hand and you can comment 
about this when summarizing inferences. Note that I do not recommend fitting 
the quadratic after a GAMM has suggested this relationship :)

One last thing you should be aware concerns the numerical performance of gamm 
(versus its cousin gamm4); the lmer package is much much faster and numerically 
more stable for large problems so that you should prefer the second interface 
for large problems.


Christos Argyropoulos
  
> Date: Mon, 21 Jun 2010 23:05:55 +0100
> From: c.fe...@uea.ac.uk
> To: gunter.ber...@gene.com
> CC: r-help@r-project.org
> Subject: Re: [R] {Spam?} RE:  {Spam?} Re:  mgcv, testing gamm vs lme, which 
> degrees of        freedom?
> 
> Thanks for your help, Bert,
> 
> on your points:
> 
> 1. I was not aware of such a list, I will check it out.
> 
> 2. As you said Pinheiro and Bates (2000) advocate anova LR test for random
> effect with the SAME fixed effect. My question was exactly related to this
> point... are the fixed effect the same in this gamm example? My
> understanding is that their are not.
> 
> 3. I read Wood (2006) book and belive it is really a great source.
> However, I do not have access to it at the moment cause I am away from the
> office for a visting period overseas. If I recall correctly there is not
> an example such as the comparison of f2 vs f3, hence my question on the
> list... but I may well be mistaken and I will double check as soon as I am
> back. Hopefully this does not sound too unreasonable.
> 
> Best wishes,
> 
> Carlo
> 
> >
> >
> > Bert Gunter
> > Genentech Nonclinical Statistics
> >
> >> -----Original Message-----
> >> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
> >> On Behalf Of Joris Meys
> >> Sent: Monday, June 21, 2010 12:09 PM
> >> To: Carlo Fezzi
> >> Cc: r-help@r-project.org
> >> Subject: Re: [R] {Spam?} Re: mgcv, testing gamm vs lme,which degrees of
> >> freedom?
> >>
> >> Hi Carlo,
> >>
> >> You should get the book of Simon Wood and read it thoroughly. It's all
> >> explained in there, but it will lead me too far to copy it all in a
> >> mail. In short : random effects are part of the error structure of the
> >> model, not of the model itself. They're added to correct the error on
> >> the parameters of the fixed model, and are inherent to the data
> >> structure and not to the hypotheses. Hence, you rarely test their
> >> significance.
> >
> > 1. This discussion probably belongs on the sig-mixed-models list.
> >
> > 2. Your claim is incorrect, I think. The structure of the random errors =
> > model covariance can be parameterized in various ways, and one can try to
> > test significance of nested parameterizations (for a particular fixed
> > effects parameterizaton). Whether you can do so meaningfully especially in
> > the gamm context --  is another issue, but if you check e.g. Bates and
> > Pinheiro, anova for different random effects parameterizations is
> > advocated,
> > and is implemented in the anova.lme() nlme function.
> >
> > 3. But I strongly endorse your suggestion to consult an authoritative
> > resource; I believe it inherently unreasonable (but alas not unusual) that
> > posters somehow expect brief explanations on this list to illuminate and
> > resolve complex statistical issues.
> >
> > -- Bert
> >
> >
> >>
> >> Cheers
> >> Joris
> >>
> >>
> >> On Mon, Jun 21, 2010 at 6:54 PM, Carlo Fezzi <c.fe...@uea.ac.uk> wrote:
> >> > Hi Joris (CC Simon),
> >> >
> >> > Thanks for your kind replies and for being so responsive.
> >> >
> >> > I think this post boils down to two main questions (which I feel are
> >> very
> >> > important for gams modelling):
> >> >
> >> > 1- Is it appropriate to use LR tests in "gamm" to test model
> >> reduction?
> >> > 2- If yes, which degrees of freedom should be used?
> >> >
> >> > I do not think we should always use the df from "model$lme". For
> >> example,
> >> > compare the two models (again my first example for the data
> >> generation):
> >> >
> >> > f1 <- gamm(y ~ s(x, k=2, bs="cr"), random = list(id=~1), method="ML" )
> >> > f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML"
> >> )
> >> >
> >> > The difference between the two models is in the random effects. Model
> >> "f2"
> >> > has, if I interpreet correctly the output, 7 random effects more than
> >> the
> >> > model "f1", but the fixed effects are the same. So the H0 = "the 7
> >> random
> >> > effect are not significant". In this case the (app.) likelihood ratio
> >> test
> >> > should have 7 df... is my interpretation correct?
> >> >
> >> > On the other hand, to compare the following models:
> >> >
> >> > f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" )
> >> > f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML"
> >> )
> >> >
> >> > Model "f3" has 1 more fixed effect than model "f2", but model "f2" has
> >> 7
> >> > more random effects... again, if I understand correctly the output. In
> >> > this case I don't know if we can do a LR test, the model are not
> >> strictly
> >> > nested I think...
> >> >
> >> > What do you think?
> >> >
> >> > Again many thanks,
> >> >
> >> > Carlo
> >> >
> >> >
> >> >
> >> >> I don't use an LR test for non-nested models, as I fail to formulate
> >> a
> >> >> sensible null hypothesis for such tests. Again, everything I write is
> >> >> a personal opinion, and inference in the case of these models is
> >> still
> >> >> subject of discussion to date. If you find a plausible way for
> >> >> explaining the result, by all means use the LR test.
> >> >>
> >> >> Personally, I'd go for the AIC / BIC, but these are based on the
> >> >> likelihood themselves. So in the case where the effective complexity
> >> >> of the model appears the same, they're completely equivalent to the
> >> >> likelihood. It's just the inference (i.e. the p-value) I don't trust.
> >> >> But then again, I'm a cautious statistician. If I'm not sure about a
> >> >> method, I'd rather don't use it and go with what I know. In my view,
> >> >> there is not one correct method for a particular problem and/or
> >> >> dataset. Every method makes assumptions and has shortcomings. Only if
> >> >> I know which ones, I can take them into account when interpreting the
> >> >> results.
> >> >>
> >> >> It also depends on the focus as well. If the focus is prediction, you
> >> >> might even want to consider testing whether the variance of the
> >> >> residuals differs significantly with a simple F-test. This indicates
> >> >> whether the predictive power differs significantly between the
> >> models.
> >> >> But these tests tend to get very sensitive when you have many
> >> >> datapoints, rendering them practically useless again.
> >> >>
> >> >> So in the end, it always boils down to interpretation.
> >> >>
> >> >> Cheers
> >> >> Joris
> >> >>
> >> >> On Fri, Jun 18, 2010 at 10:29 PM, Carlo Fezzi <c.fe...@uea.ac.uk>
> >> wrote:
> >> >>> Thanks Joris,
> >> >>>
> >> >>> I understand your point regarding the need for the two models to be
> >> >>> nested. So, according to your in the example case the LR test is not
> >> >>> appropriate and the two model should be compared with other criteria
> >> >>> such
> >> >>> as AIC or BIC for example.
> >> >>>
> >> >>> On the other hand, Simon Wood indicated that such a LR test is
> >> >>> (approximately) correct in his previous reply... a am bit confused,
> >> >>> which
> >> >>> is the correct approach to test the two models? Is the LR test
> >> correct
> >> >>> only if the parametric model is linear in the x variables maybe? In
> >> this
> >> >>> case, which is the best appraoch to compare a "gamm" vs a "lme" with
> >> >>> quadratic specification?
> >> >>>
> >> >>> Best wishes,
> >> >>>
> >> >>> Carlo
> >> >>>
> >> >>>> Just realized something: You should take into account that the LR
> >> test
> >> >>>> is actually only valid for _nested_ models. Your models are not
> >> >>>> nested. Hence, you shouldn't use the anova function to compare
> >> them,
> >> >>>> and you shouldn't compare the df. In fact, if you're interested in
> >> the
> >> >>>> contribution of a term, then using anova to compare the model with
> >> >>>> that term and without that term gives you an answer on the
> >> hypothesis
> >> >>>> whether that term with spline contributes significantly to the
> >> model.
> >> >>>>
> >> >>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML")
> >> >>>>
> >> >>>>> f3 <- gamm(y ~ x, random = list(id=~1), method="ML" )
> >> >>>>
> >> >>>>> f4 <- gamm(y ~ 1, random = list(id=~1), method="ML" )
> >> >>>>
> >> >>>>> anova(f3$lme,f2$lme)
> >> >>>>        Model df AIC BIC logLik   Test L.Ratio p-value
> >> >>>> f3$lme     1  4 760 770   -376
> >> >>>> f2$lme     2  5 381 394   -186 1 vs 2     380  <.0001
> >> >>>>
> >> >>>>> anova(f4$lme,f2$lme)
> >> >>>>        Model df AIC BIC logLik   Test L.Ratio p-value
> >> >>>> f4$lme     1  3 945 953   -470
> >> >>>> f2$lme     2  5 381 394   -186 1 vs 2     568  <.0001
> >> >>>>
> >> >>>>> anova(f3$lme,f4$lme)
> >> >>>>        Model df AIC BIC logLik   Test L.Ratio p-value
> >> >>>> f3$lme     1  4 760 770   -376
> >> >>>> f4$lme     2  3 945 953   -470 1 vs 2     188  <.0001
> >> >>>>
> >> >>>> This is the correct application of a likelihood ratio test. You see
> >> >>>> that adding the spline increases the df with 1 compared to the
> >> linear
> >> >>>> model, as part of the spline gets into the random component. Notice
> >> as
> >> >>>> well that the interpretation of a test in case of a random
> >> component
> >> >>>> is not the same as in case of a fixed component. If I understood
> >> >>>> correctly, this LR test specifically says something over the effect
> >> of
> >> >>>> X, without being interested in the shape of the spline. The
> >> >>>> "significance of a spline" is a difficult concept anyway, as a
> >> spline
> >> >>>> can be seen as a form of local regression. It's exactly the use of
> >> the
> >> >>>> randomization that allows for a general hypothesis about the added
> >> >>>> value of the spline, without focusing on its actual shape. Hence
> >> the
> >> >>>> "freedom" connected to that actual shape should not be used in the
> >> df
> >> >>>> used to test the general hypothesis.
> >> >>>>
> >> >>>> Hope this makes sense someway...
> >> >>>>
> >> >>>> Cheers
> >> >>>> Joris
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Jun 18, 2010 at 6:27 PM, Carlo Fezzi <c.fe...@uea.ac.uk>
> >> wrote:
> >> >>>>> Dear Simon,
> >> >>>>>
> >> >>>>> thanks a lot for your prompt reply.
> >> >>>>>
> >> >>>>> Unfortunately I am still confused about which is the correct way
> >> to
> >> >>>>> test
> >> >>>>> the two models... as you point out: why in my example the two
> >> models
> >> >>>>> have
> >> >>>>> the same degrees of freedom?
> >> >>>>>
> >> >>>>> Intuitively it seems to me the gamm model is more flexible since,
> >> as
> >> I
> >> >>>>> understand also from you response, it should contain more random
> >> >>>>> effects
> >> >>>>> than the other model because some of the smooth function
> >> parameters
> >> >>>>> are
> >> >>>>> represented as such. This should not be taken into account when
> >> >>>>> testing
> >> >>>>> one model vs the other?
> >> >>>>>
> >> >>>>> Continuing with my example, the two models:
> >> >>>>>
> >> >>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML")
> >> >>>>> f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" )
> >> >>>>>
> >> >>>>> Can be tested with:
> >> >>>>>
> >> >>>>> anova(f3$lme,f2$lme)
> >> >>>>>
> >> >>>>> But why are the df the same? Model f2 appears to be more flexible
> >> and,
> >> >>>>> as
> >> >>>>> such, should have more (random) parameters. Should not a test of
> >> one
> >> >>>>> model
> >> >>>>> vs the other take this into account?
> >> >>>>>
> >> >>>>> Sorry if this may sound dull, many thanks for your help,
> >> >>>>>
> >> >>>>> Carlo
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>> On Wednesday 16 June 2010 20:33, Carlo Fezzi wrote:
> >> >>>>>>> Dear all,
> >> >>>>>>>
> >> >>>>>>> I am using the "mgcv" package by Simon Wood to estimate an
> >> additive
> >> >>>>>>> mixed
> >> >>>>>>> model in which I assume normal distribution for the residuals. I
> >> >>>>>>> would
> >> >>>>>>> like to test this model vs a standard parametric mixed model,
> >> such
> >> >>>>>>> as
> >> >>>>>>> the
> >> >>>>>>> ones which are possible to estimate with "lme".
> >> >>>>>>>
> >> >>>>>>> Since the smoothing splines can be written as random effects, is
> >> it
> >> >>>>>>> correct to use an (approximate) likelihood ratio test for this?
> >> >>>>>> -- yes this is ok (subject to the usual caveats about testing on
> >> the
> >> >>>>>> boundary
> >> >>>>>> of the parameter space) but your 2 example models below will have
> >> >>>>>>  the
> >> >>>>>> same
> >> >>>>>> number of degrees of freedom!
> >> >>>>>>
> >> >>>>>>> If so,
> >> >>>>>>> which is the correct number of degrees of freedom?
> >> >>>>>> --- The edf from the lme object, if you are testing using the log
> >> >>>>>> likelihood
> >> >>>>>> returned by the  lme representation of the model.
> >> >>>>>>
> >> >>>>>>> Sometime the function
> >> >>>>>>> LogLik() seems to provide strange results regarding the number
> >> of
> >> >>>>>>> degrees
> >> >>>>>>> of freedom (df) for the gam, for instance in the example I
> >> copied
> >> >>>>>>> below
> >> >>>>>>> the df for the "gamm" are equal to the ones for the "lme", but
> >> the
> >> >>>>>>> summary(model.gam) seems to indicate a much higher edf for the
> >> gamm.
> >> >>>>>> --- the edf for the lme representation of the model counts only
> >> the
> >> >>>>>> fixed
> >> >>>>>> effects + the variance parameters (which includes smoothing
> >> >>>>>> parameters).
> >> >>>>>> Each
> >> >>>>>> smooth typically contributes only one or two fixed effect
> >> parameters,
> >> >>>>>> with
> >> >>>>>> the rest of the coefficients for the smooth treated as random
> >> >>>>>> effects.
> >> >>>>>>
> >> >>>>>> --- the edf for the gam representation of the same model differs
> >> in
> >> >>>>>> that
> >> >>>>>> it
> >> >>>>>> also counts the *effective* number of parameters used to
> >> represent
> >> >>>>>> each
> >> >>>>>> smooth: this includes contributions from all those coefficients
> >> that
> >> >>>>>> the
> >> >>>>>> lme
> >> >>>>>> representation treated as strictly random.
> >> >>>>>>
> >> >>>>>> best,
> >> >>>>>> Simon
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>> I would be very grateful to anybody who could point out a
> >> solution,
> >> >>>>>>>
> >> >>>>>>> Best wishes,
> >> >>>>>>>
> >> >>>>>>> Carlo
> >> >>>>>>>
> >> >>>>>>> Example below:
> >> >>>>>>>
> >> >>>>>>> ----
> >> >>>>>>>
> >> >>>>>>> rm(list = ls())
> >> >>>>>>> library(mgcv)
> >> >>>>>>> library(nlme)
> >> >>>>>>>
> >> >>>>>>> set.seed(123)
> >> >>>>>>>
> >> >>>>>>> x  <- runif(100,1,10)                                # regressor
> >> >>>>>>> b0 <- rep(rnorm(10,mean=1,sd=2),each=10)     # random intercept
> >> >>>>>>> id <- rep(1:10, each=10)                     # identifier
> >> >>>>>>>
> >> >>>>>>> y <- b0 + x - 0.1 * x^3 + rnorm(100,0,1)  # dependent variable
> >> >>>>>>>
> >> >>>>>>> f1 <- lme(y ~ x + I(x^2), random = list(id=~1) , method="ML" )
> >>  #
> >> >>>>>>> lme
> >> >>>>>>> model
> >> >>>>>>>
> >> >>>>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML" )    #
> >> gamm
> >> >>>>>>>
> >> >>>>>>> ## same number of "df" according to logLik:
> >> >>>>>>> logLik(f1)
> >> >>>>>>> logLik(f2$lme)
> >> >>>>>>>
> >> >>>>>>> ## much higher edf according to summary:
> >> >>>>>>> summary(f2$gam)
> >> >>>>>>>
> >> >>>>>>> -----------
> >> >>>>>>>
> >> >>>>>>> ______________________________________________
> >> >>>>>>> R-help@r-project.org mailing list
> >> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>>>>> PLEASE do read the posting guide
> >> >>>>>>> http://www.R-project.org/posting-guide.html and provide
> >> commented,
> >> >>>>>>> minimal,
> >> >>>>>>> self-contained, reproducible code.
> >> >>>>>>
> >> >>>>>> --
> >> >>>>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2
> >> 7AY
> >> >>>>>>> UK
> >> >>>>>>> +44 1225 386603  www.maths.bath.ac.uk/~sw283
> >> >>>>>>
> >> >>>>>
> >> >>>>> ______________________________________________
> >> >>>>> R-help@r-project.org mailing list
> >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>>>> PLEASE do read the posting guide
> >> >>>>> http://www.R-project.org/posting-guide.html
> >> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Joris Meys
> >> >>>> Statistical consultant
> >> >>>>
> >> >>>> Ghent University
> >> >>>> Faculty of Bioscience Engineering
> >> >>>> Department of Applied mathematics, biometrics and process control
> >> >>>>
> >> >>>> tel : +32 9 264 59 87
> >> >>>> joris.m...@ugent.be
> >> >>>> -------------------------------
> >> >>>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >> >>>>
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Joris Meys
> >> >> Statistical consultant
> >> >>
> >> >> Ghent University
> >> >> Faculty of Bioscience Engineering
> >> >> Department of Applied mathematics, biometrics and process control
> >> >>
> >> >> tel : +32 9 264 59 87
> >> >> joris.m...@ugent.be
> >> >> -------------------------------
> >> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >> >>
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Joris Meys
> >> Statistical consultant
> >>
> >> Ghent University
> >> Faculty of Bioscience Engineering
> >> Department of Applied mathematics, biometrics and process control
> >>
> >> tel : +32 9 264 59 87
> >> joris.m...@ugent.be
> >> -------------------------------
> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
> >>
> >> ______________________________________________
> >> R-help@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> >
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
                                          
_________________________________________________________________
Hotmail: Trusted email with Microsofts powerful SPAM protection.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] {Spam?} RE: {Spam?} Re: mgcv, testing gamm vs lme, which degrees of freedom?

Reply via email to