Dear R list, I’m trying to do a GAM model / variable importance using the mgcv R package. I’ve been googling around, and found posts <https://people.maths.bris.ac.uk/~sw15190/mgcv/tampere/mgcv-advanced.pdf> stating that select=T is a good way to "penalized out” less important variables.
m1 <- bam(Y ~ s(time, by=grp, k=40, bs="cc") + s(host_id, bs="re") + s(grp, bs="re") + s(tempmax_mean) + s(rain_mean) + s(area_tot) + s(area_unique) + s(diet_PCA1) + s(diet_PCA2) + s(diet_PCA3), data=df_subset, select=T) >From the summary of m1 we can see that s(grp) and s(area_tot) are not >significant (they weren’t significant for select=F either, but their F and >p-value is smaller when select=T). Ok, so far so good (continues after >summary). summary(m1) Family: gaussian Link function: identity Formula: Dim1 ~ s(time2, by = grp, k = 40, bs = "cc") + s(host_id, bs = "re") + s(grp, bs = "re") + s(tempmax_mean) + s(rain_mean) + s(area_tot) + s(frac_uniq) + s(diet_PC1) + s(diet_PC2) + s(diet_PC3) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03382 0.03636 0.93 0.352 Approximate significance of smooth terms: edf Ref.df F p-value s(time):grp1.1 31.398 38 14.326 < 2e-16 *** s(time):grp1.21 31.370 38 22.103 < 2e-16 *** s(time):grp1.22 33.178 38 38.372 < 2e-16 *** s(time):grp2.1 31.892 38 17.257 < 2e-16 *** s(time):grp2.2 27.425 38 38.245 < 2e-16 *** s(host_id) 379.858 453 14.665 < 2e-16 *** s(grp) 1.874 4 8.478 0.29624 s(tempmax_mean) 8.064 9 17.769 < 2e-16 *** s(rain_mean) 8.056 9 15.347 < 2e-16 *** s(area_tot) 1.597 9 0.649 0.08644 . s(area_unique) 4.302 9 8.697 0.00113 ** s(diet_PC1) 5.243 9 32.816 < 2e-16 *** s(diet_PC2) 4.020 9 3.679 1.07e-05 *** s(diet_PC3) 5.116 9 11.008 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.468 Deviance explained = 49.2% fREML = 15059 Scale est. = 0.52766 n = 12823 I came across another thread <http://r.789695.n4.nabble.com/Re-variance-explained-by-each-predictor-in-GAM-td896222.html> which stated that you could calculate variance explained by each predictor doing as below. So I did this for my model m1 (please see Figure1) and plotted deviance explained (please see Figure2) and AIC (please see Figure3) for each model. While the deviance explained for Model b-j only somewhat increases, the AIC between Model e-f decreases a quite lot (from 34422.59 to 29508.39). So it would seem that the inclusion of s(area_tot) increases model fit. However, as seen in the summary of m1 above (as well as the summary for model f; not shown), the term s(area_tot) is not significant in the models. So what may cause this discrepancy? In this case, can I not use AIC to compare between models? While I would use model m1 with select=T, I figured this could be a good way to visualize model fit / variable importance. Many thanks for any pointers, Cheers, Johannes ------------------------------------------ Johannes R Björk, PhD Postdoctoral research fellow Archie Lab Department of Biological Sciences University of Notre Dame, IN, USA Alt. email: rbj...@nd.edu Skype: johannesbjork ## Example on how to calculate variance explained by each predictor dat <- gamSim(1,n=400,dist="normal",scale=2) b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat) b2<-gam(y~s(x0)+s(x1)+s(x3),sp=b$sp[-3],data=dat) summary(b2)$dev.expl summary(b)$dev.expl Figure1 Figure2 Figure3 _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology