Thank you for the replies, Gavin and Bob.  You make some great points -
particularly about the variables being correlated.
The model is really for prediction rather than for generating meaningful
coefficients.  I'll look into lasso and elastic net.  (It looks like
"glmnet" is a good package to start with?).  The other interesting
complication is that my data are zero-inflated and overdispersed so a lot
of techniques that are easily applied to typical regression models don't
work as well for the particular functions I've been using (zero-inflated
and hurdle models in the "pscl" package).

On Fri, May 25, 2012 at 1:18 AM, Gavin Simpson <gavin.simp...@ucl.ac.uk>wrote:

> On Thu, 2012-05-24 at 15:00 -0700, J Straka wrote:
> > Hello,
> >
> > I'm planning on using a regression model to describe seed set of plants
> (my
> > response) using some sort of predictor based on temperature.  I have a
> > number of temperature variables calculated from the same set of data
> > (hourly temperatures for the growing season, converted to variables such
> as
> > average temperature, maximum temperature, minimum temperature,
> degree-days
> > above zero Celsius, degree days above ten Celsius, etc...), and I want to
> > decide which one should be included in my model. I know that I would
> > ideally select one based on "prior knowledge" of the system (e.g.
> so-called
> > "planned comparisons" or choosing a temperature threshold that is known
> to
> > be important for the development of seeds), but not much is known about
> > this system.
>
> What is the model for? Understanding so you want to interpret the
> coefficients directly as something meaningful or for prediction?
>
> If the latter I would say it doesn't really matter; choose the model
> that gives the best out-of-sample predictions (lowest error etc), or
> average predictions over a set of best/good models. Simply choosing the
> best model via some sort of selection procedure may result in a model
> with high variance (change the data a bit and different variables would
> be selected). If so, consider a regression method that applies shrinkage
> to the coefficients such as the lasso or the elastic net; this will lead
> to a small bit of bias in the estimates of the coefficients but should
> reduce the variance of the final model because you are considering the
> selection of variables as part of the model itself.
>
> If you want to interpret the model coefficients as something real then
> you have to be very careful doing any form of selection; the stepwise
> procedures and best subsets all can potentially lead to strong bias in
> the model coefficients. Be removing a variable from the model in effect
> you are saying that the sample estimate of the effect of that variable
> on the response is 0, not some small (statistically insignificant)
> value.
>
> This is a very tricky thing to get right and I'm not sure I know the
> right answer (or even if there is one!?).
>
> > I've been warned against testing the significance of multiple predictors
> > using p-values, unless I use Bonferroni correction (or some equivalent).
> > Unfortunately, using Bonferroni correction would result in something
> like p
> > = 0.05/7 (for seven different temperature variables); a rather small
> value
> > for detecting anything! I was wondering whether it would be appropriate
> to
> > instead use likelihood-based techniques (direct comparisons of
> > log-likelihoods or AIC scores) to compare a series of models using each
> of
> > the alternative predictors in turn, and choose the most relevant
> > temperature variable (i.e. predictor) based on that.
>
> Choosing models by AIC or BIC is just the same as doing it using
> p-values; the selection procedure has all the problems I mention above.
> LRTs require a significance test of the ratio of the two likelihoods, so
> you are still doing a series of sequential tests that you might want to
> control the overal error rate of.
>
> There are other corrections for multiple testing. For example, see the
> p.adjust() function in R for some options.
>
> HTH
>
> G
>
> > Thoughts on the validity of this approach? Would any adjustments have to
> be
> > made for multiple comparisons if I used this strategy?
> >
> > Jason Straka
> > University of Victoria
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-ecology mailing list
> > R-sig-ecology@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> >
>
> --
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>  Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
>  ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
>  Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
>  Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
>  UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>
>
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Reply via email to