Thank you for the replies, Gavin and Bob. You make some great points - particularly about the variables being correlated. The model is really for prediction rather than for generating meaningful coefficients. I'll look into lasso and elastic net. (It looks like "glmnet" is a good package to start with?). The other interesting complication is that my data are zero-inflated and overdispersed so a lot of techniques that are easily applied to typical regression models don't work as well for the particular functions I've been using (zero-inflated and hurdle models in the "pscl" package).
On Fri, May 25, 2012 at 1:18 AM, Gavin Simpson <gavin.simp...@ucl.ac.uk>wrote: > On Thu, 2012-05-24 at 15:00 -0700, J Straka wrote: > > Hello, > > > > I'm planning on using a regression model to describe seed set of plants > (my > > response) using some sort of predictor based on temperature. I have a > > number of temperature variables calculated from the same set of data > > (hourly temperatures for the growing season, converted to variables such > as > > average temperature, maximum temperature, minimum temperature, > degree-days > > above zero Celsius, degree days above ten Celsius, etc...), and I want to > > decide which one should be included in my model. I know that I would > > ideally select one based on "prior knowledge" of the system (e.g. > so-called > > "planned comparisons" or choosing a temperature threshold that is known > to > > be important for the development of seeds), but not much is known about > > this system. > > What is the model for? Understanding so you want to interpret the > coefficients directly as something meaningful or for prediction? > > If the latter I would say it doesn't really matter; choose the model > that gives the best out-of-sample predictions (lowest error etc), or > average predictions over a set of best/good models. Simply choosing the > best model via some sort of selection procedure may result in a model > with high variance (change the data a bit and different variables would > be selected). If so, consider a regression method that applies shrinkage > to the coefficients such as the lasso or the elastic net; this will lead > to a small bit of bias in the estimates of the coefficients but should > reduce the variance of the final model because you are considering the > selection of variables as part of the model itself. > > If you want to interpret the model coefficients as something real then > you have to be very careful doing any form of selection; the stepwise > procedures and best subsets all can potentially lead to strong bias in > the model coefficients. Be removing a variable from the model in effect > you are saying that the sample estimate of the effect of that variable > on the response is 0, not some small (statistically insignificant) > value. > > This is a very tricky thing to get right and I'm not sure I know the > right answer (or even if there is one!?). > > > I've been warned against testing the significance of multiple predictors > > using p-values, unless I use Bonferroni correction (or some equivalent). > > Unfortunately, using Bonferroni correction would result in something > like p > > = 0.05/7 (for seven different temperature variables); a rather small > value > > for detecting anything! I was wondering whether it would be appropriate > to > > instead use likelihood-based techniques (direct comparisons of > > log-likelihoods or AIC scores) to compare a series of models using each > of > > the alternative predictors in turn, and choose the most relevant > > temperature variable (i.e. predictor) based on that. > > Choosing models by AIC or BIC is just the same as doing it using > p-values; the selection procedure has all the problems I mention above. > LRTs require a significance test of the ratio of the two likelihoods, so > you are still doing a series of sequential tests that you might want to > control the overal error rate of. > > There are other corrections for multiple testing. For example, see the > p.adjust() function in R for some options. > > HTH > > G > > > Thoughts on the validity of this approach? Would any adjustments have to > be > > made for multiple comparisons if I used this strategy? > > > > Jason Straka > > University of Victoria > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > R-sig-ecology mailing list > > R-sig-ecology@r-project.org > > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > > > > -- > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > Dr. Gavin Simpson [t] +44 (0)20 7679 0522 > ECRC, UCL Geography, [f] +44 (0)20 7679 0565 > Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk > Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ > UK. WC1E 6BT. [w] http://www.freshwaters.org.uk > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% > > > [[alternative HTML version deleted]] _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology