Re: [R] p-level in packages mgcv and gam

Yves Magliulo Thu, 29 Sep 2005 03:41:10 -0700

hi,

thanks for reply.


1) sorry for the mistake, (GCV and UBRE estimate df in both case in
deed) i've made a confusion.

2) i use mgcv 1.1-8. (!) so in deed again, i need an update. 

my questions :

3) i can see in the last version of summary.gam an estimated rank: how
do i interpret this?

4) I'm still confusing with this question : should i look at R-sq rather
than Deviance explain, or both? 

5) How can i estimate numericaly the contribution of each smooth
against the others. In others words, is there a way to quantify
significance like a percentage of how the model is improved by each of
my smooth?

6) mgcv : gam and gam : gam  can't be compared. that's 2 distinct
approach of gam, isn't it?

for Denis :

7) to "validate" the significance of a smooth where p-values is betwwen
0.01 and 0.05, there is no perfect rules...
you have to decide according to your need. look at df, the se of your
smooth, how your model is improved,
if the parametrization look fine according of what you want/know of your
predictor etc... but that's my point of view, maybe someone have
Criterion more precise.

Yves

Le jeu 29/09/2005 à 09:55, Henric Nilsson a écrit :

> Yves Magliulo said the following on 2005-09-28 17:05:
> > hi,
> > 
> > i'll try to help you, i send a mail about this subject last week... and
> > i did not have any response...
> > 
> > I'm using gam from package mgcv. 
> > 
> > 1)
> > How to interpret the significance of smooth terms is hard for me to
> > understand perfectly : 
> > using UBRE, you fix df. p-value are estimated by chi-sq distribution 
> > using GCV, the best df are estimated by GAM. (that's what i want) and
> > p-values 
> 
> This is not correct. The df are  (i.e. UBRE and 
> GCV), but the scale parameter is fixed in the UBRE case. Hence, by 
> default UBRE is used fsignificance of smooth termsor family = binomial or 
> poisson since the scale 
> parameter is assumed to be 1. Similarly, GCV is the default for family = 
> gaussian since we most often want the scale (usually denoted sigma^2) to 
> be estimated.
> 
> > are estimated by an F distribution But in that case they said "use at
> > your own risk" in ?summary.gam
> 
> The warning applies in both cases. The p-values are conditional on the 
> smoothing parameters, and the uncertainty of the smooths is not taken 
> into account when computing the p-values.
> 
> > so you can also look at the chi.sq : but i don't know how to choose a
> 
> No...
> 
> > criterion like for p-values... for me, chi.sq show the best predictor in
> > a model, but it's hard to reject one with it.
> 
> Which version of mgcv do you use? The confusion probably stems from 
> earlier versions of mgcv (< 1.3-5): the summary and anova methods used 
> to have a column denoted Chi.sq even when the displayed statistic was 
> computed as F. Recent versions of mgcv has
> 
>  > summary(b)
> 
> Family: gaussian
> Link function: identity
> 
> Formula:
> y ~ s(x0) + s(x1) + s(x2) + s(x3)
> 
> Parametric coefficients:
>              Estimate Std. Error t value Pr(>|t|)
> (Intercept)   7.9150     0.1049   75.44   <2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Approximate significance of smooth terms:
>          edf Est.rank      F  p-value
> s(x0) 5.173    9.000  3.785 0.000137 ***
> s(x1) 2.357    9.000 34.631  < 2e-16 ***
> s(x2) 8.517    9.000 84.694  < 2e-16 ***
> s(x3) 1.000    1.000  0.444 0.505797
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> R-sq.(adj) =  0.726   Deviance explained = 73.7%
> GCV score =  4.611   Scale est. = 4.4029    n = 400
> 
> 
> If we assume that the scale is known and fixed at 4.4029, we get
> 
>  > summary(b, dispersion = 4.4029)
> 
> Family: gaussian
> Link function: identity
> 
> Formula:
> y ~ s(x0) + s(x1) + s(x2) + s(x3)
> 
> Parametric coefficients:
>              Estimate Std. Error z value Pr(>|z|)
> (Intercept)   7.9150     0.1049   75.44   <2e-16 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Approximate significance of smooth terms:
>          edf Est.rank  Chi.sq p-value
> s(x0) 5.173    9.000  34.067 8.7e-05 ***
> s(x1) 2.357    9.000 311.679 < 2e-16 ***
> s(x2) 8.517    9.000 762.255 < 2e-16 ***
> s(x3) 1.000    1.000   0.444   0.505
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> R-sq.(adj) =  0.726   Deviance explained = 73.7%
> GCV score =  4.611   Scale est. = 4.4029    n = 400
> 
> Note that t/F changed into z/Chi.sq.
> 
> > 
> > so as far as i m concerned, i use GCV methods, and fix a 5% on the null
> > hypothesis (pvalue) to select significant predictor. after, i look at my
> > smooth, and if the parametrization look fine to me, i validate.
> > 
> > generaly, for p-values smaller than 0.001, you can be confident. over
> > 0.001, you have to check. 
> > 
> > 2)
> > for difference between package gam and mgcv, i sent a mail about this
> 
> The underlying algorithms are very different.
> 
> 
> HTH,
> Henric
> 
> > one year ago, here's the response :
> > 
> > "
> > - package gam is based very closely on the GAM approach presented in 
> > Hastie and Tibshirani's  "Generalized Additive Models" book. Estimation
> > is 
> > by back-fitting and model selection is based on step-wise regression 
> > methods based on approximate distributional results. A particular
> > strength 
> > of this approach is that local regression smoothers (`lo()' terms) can
> > be 
> > included in GAM models.
> > 
> > - gam in package mgcv represents GAMs using penalized regression
> > splines. 
> > Estimation is by direct penalized likelihood maximization with 
> > integrated smoothness estimation via GCV or related criteria (there is 
> > also an alternative `gamm' function based on a mixed model approach). 
> > Strengths of the this approach are that s() terms can be functions of
> > more 
> > than one variable and that tensor product smooths are available via te()
> > terms - these are useful when different degrees of smoothness are 
> > appropriate relative to different arguments of a smooth.
> > 
> > (...)
> > 
> > Basically, if you want integrated smoothness selection, an underlying 
> > parametric representation, or want smooth interactions in your models 
> > then mgcv is probably worth a try (but I would say that). If you want to
> > use local regression smoothers and/or prefer the stepwise selection 
> > approach then package gam is for you. 
> > "
> > 
> > i think the difference of p-values between :gam and :mgcv, is because
> > you don't have same number of step iteration. mgcv : gam choose the
> > number of step and with gam : gam you have to choose it..
> > 
> > hope it helps and someone gives us more details...
> > 
> > Yves
> > 
> > 
> > Le mer 28/09/2005 à 15:30, Denis Chabot a écrit :
> > 
> > 
> >>I only got one reply to my message:
> >>
> >>
> >>>No, this won't work.  The problem is the usual one with model  
> >>>selection: the p-value is calculated as if the df had been fixed,  
> >>>when really it was estimated.
> >>>
> >>>It is likely to be quite hard to get an honest p-value out of  
> >>>something that does adaptive smoothing.
> >>>
> >>>    -thomas
> >>
> >>I do not understand this: it seems that a lot of people chose df=4  
> >>for no particular reason, but p-levels are correct. If instead I  
> >>choose df=8 because a previous model has estimated this to be an  
> >>optimal df, P-levels are no good because df are estimated?
> >>
> >>Furthermore, shouldn't packages gam and mgcv give similar results  
> >>when the same data and df are used? I tried this:
> >>
> >>library(gam)
> >>data(kyphosis)
> >>kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)
> >>kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)
> >>kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)
> >>anova.gam(kyp1)
> >>anova.gam(kyp2)
> >>anova.gam(kyp3)
> >>
> >>detach(package:gam)
> >>library(mgcv)
> >>kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T),  family=binomial,  
> >>data=kyphosis)
> >>kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T),  family=binomial,  
> >>data=kyphosis)
> >>kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T),  family=binomial,  
> >>data=kyphosis)
> >>anova.gam(kyp4)
> >>anova.gam(kyp5)
> >>anova.gam(kyp6)
> >>
> >>
> >>P levels for these models, by pair
> >>
> >>kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)
> >>kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)
> >>kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)
> >>
> >>Also if you plot all these you find that the mgcv plots are smoother  
> >>than the gam plots, even the same df are used all the time.
> >>
> >>I am really confused now!
> >>
> >>Denis
> >>
> >>Début du message réexpédié :
> >>
> >>
> >>>>De : Denis Chabot <[EMAIL PROTECTED]>
> >>>>Date : 26 septembre 2005 12:25:04 HAE
> >>>>À : r-help@stat.math.ethz.ch
> >>>>Objet : p-level in packages mgcv and gam
> >>>>
> >>>>
> >>>>Hi,
> >>>>
> >>>>I am fairly new to GAM and started using package mgcv. I like the  
> >>>>fact that optimal smoothing is automatically used (i.e. df are not  
> >>>>determined a priori but calculated by the gam procedure).
> >>>>
> >>>>But the mgcv manual warns that p-level for the smooth can be  
> >>>>underestimated when df are estimated by the model. Most of the  
> >>>>time my p-levels are so small that even doubling them would not  
> >>>>result in a value close to the P=0.05 threshold, but I have one  
> >>>>case with P=0.033.
> >>>>
> >>>>I thought, probably naively, that running a second model with  
> >>>>fixed df, using the value of df found in the first model. I could  
> >>>>not achieve this with mgcv: its gam function does not seem to  
> >>>>accept fractional values of df (in my case 8.377).
> >>>>
> >>>>So I used the gam package and fixed df to 8.377. The P-value I  
> >>>>obtained was slightly larger than with mgcv (0.03655 instead of  
> >>>>0.03328), but it is still < 0.05.
> >>>>
> >>>>Was this a correct way to get around the "underestimated P-level"?
> >>>>
> >>>>Furthermore, although the gam.check function of the mgcv package  
> >>>>suggests to me that the gaussian family (and identity link) are  
> >>>>adequate for my data, I must say the instructions in R help for  
> >>>>"family" and in Hastie, T. and Tibshirani, R. (1990) Generalized  
> >>>>Additive Models are too technical for me. If someone knows a  
> >>>>reference that explains how to choose model and link, i.e. what  
> >>>>tests to run on your data before choosing, I would really  
> >>>>appreciate it.
> >>>>
> >>>>Thanks in advance,
> >>>>
> >>>>Denis Chabot
> >>>>
> >>>
> >>______________________________________________
> >>R-help@stat.math.ethz.ch mailing list
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide! 
> >>http://www.R-project.org/posting-guide.html
> >>
> > 
> > 
> >     [[alternative HTML version deleted]]
> > 
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> 
> 

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] p-level in packages mgcv and gam

Reply via email to