Re: [R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

Simon Wood Thu, 18 Apr 2013 08:28:12 -0700

Jan,

Thanks for the data (off list). The p-value computations are based onthe approximation that things are approximately normal on the linearpredictor scale, but actually they are no where close to normal in thiscase, which is why the p-values look inconsistent. The reason that theapproximate normality assumption doesn't hold is that the model is quitea poor fit. If you take a look at gam.check(fit) you'll see that theconstant variance assumption of quasi(link=log) is violated quite badly,and the residual distribution is really quite odd (plot residualsagainst fitted as well). Also see plot(fit,pages=1,scale=0) - it showsballooning confidence intervals and smooth estimates that are so low inplaces that they might as well be minus infinity (given log link) -clearly something is wrong with this model!

I would be inclined to reset all the 0's to 0 (rather than 0.01), andthen to try Tweedie(p=1.5,link=log) as the family. Also the predictorvariables are very skewed which is giving leverage problems, so I wouldtransform them to give less skew. e.g. Something like


fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)),
family=Tweedie(p=1.6,link=log),data=df,method="REML")

gives a model that is closer to being reasonable (p-values are thenconsistent between select=TRUE and FALSE).


best,
Simon

On 18/04/13 14:24, Simon Wood wrote:

Jan,

Thanks for this. Is there any chance that you could send me the data off
list and I'll try to figure out what is happening? (Under the
understanding that I'll only use the data for investigating this issue,
of course).

best,
Simon

on 18/04/13 11:11, Jan Holstein wrote:

Simon,

thanks for the reply,  I guess I'm pretty much up to date using
  mgcv 1.7-22.
Upgrading to R 3.0.0 also didn't do any change.

Unfortunately using method="REML" does not make any difference:

####### first with "select=FALSE"

fit<-gam(target
~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F)

summary(fit)


Family: quasi
Link function: log
Formula:
target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   -4.724      7.462  -0.633    0.527
Approximate significance of smooth terms:
             edf Ref.df      F p-value
s(mgs)    3.118  3.492  0.099   0.974
s(gsd)    6.377  7.044 15.596  <2e-16 ***
s(mud)    8.837  8.971 18.832  <2e-16 ***
s(ssCmax) 3.886  4.051  2.342   0.052 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =  0.403   Deviance explained = 40.6%
REML score =  33186  Scale est. = 8.7812e+05  n = 4511





#### Then using "select=T"

fit2<-gam(target
~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=TRUE)

summary(fit2)

Family: quasi
Link function: log
Formula:
target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   -6.406      5.239  -1.223    0.222
Approximate significance of smooth terms:
             edf Ref.df     F p-value
s(mgs)    2.844      8 25.43  <2e-16 ***
s(gsd)    6.071      9 14.50  <2e-16 ***
s(mud)    6.875      8 21.79  <2e-16 ***
s(ssCmax) 3.787      8 18.42  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) =    0.4   Deviance explained = 40.1%
REML score =  33203  Scale est. = 8.8359e+05  n = 4511







I played around with other families/link functions with no success
regarding
the "select" behaviour.

Well, look at the structure of my data:
<http://r.789695.n4.nabble.com/file/n4664586/screen-capture-1.png>

All possible predictor variables in principle look like this, and taken
alone, each and every is significant according to p-value (but not all
can
at the same time).
In theory, the target variable should be a hypersurface in 11dim space
with
lots of noise, but interaction of more than 2 vars gets costly (not to
think
of 11) and often enough (also without interaction) the solution does not
converge at minimal step size. If it does, results are usually not as
good
as without interaction.

Any comment/advice on model setup is warmly welcome here.

Since I don't want to try out all possible 2047 combinations of up to
eleven
predictor variables for each target variable, I currently see no other
way
than educated manual guessing.

If you know another way of (semi-)automated model tunig/reduction, I
would
very much appreciate it

best regards,
Jan






--
View this message in context:
http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4664586.html

Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603               http://people.bath.ac.uk/sw283

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

Reply via email to