Re: [R] generalized linear model (glm) and "stepAIC"

Simone Santoro Sat, 11 Jul 2009 03:55:58 -0700

First of all, thank you for replying me.

I believe that using a statistical software (like R) and understanding the 
statistical issues beyond the software are two concepts with a strong link, but 
I understand that your scope is providing information on the way R works (so 
how to use it). For such a reason I know my questions were not really pertinent 
(or not at all!).
The truth is that it's summer-time and I have serious problems in getting some 
statistical advice that I really need. I'm studying and reading a lot and in 
fact I'm improving my poor statistical knowledge every day (I'm a biologist) 
but, you know, it's not easy...    
Your response has been very useful for me and I appreciate it a lot. Below I 
give some commentary (to better explain things and to thank you again) and I 
make some very short question.
I appreciate your effort although you don't answer me again. I hope not to seem 
hypocrite (I'm not) but I'm very grateful to you.
I promise to write again only for more pertinent questions.


Best wishes


> Date: Fri, 10 Jul 2009 17:45:50 -0700
> From: bol...@ufl.edu
> To: r-help@r-project.org
> Subject: Re: [R] generalized linear model (glm) and "stepAIC"
> 
> 
> 
>   If you possibly can, you should get some local statistical advice.
> There are a number of pitfalls involved in what you're doing.
> Frank Harrell's book is a good reference, but may be too advanced
> if you are a beginner.  You are right on the edge of having too little
> data for what you want to do (the rule of thumb is that you should
> have at least 10-20 responses per predictor), and stepwise regression is
> known
> to inflate p-values (see e.g. Whittingham et al J. Animal Ecology 2006).
> That said, here are some specific responses:
> 
> 
> Simone Santoro wrote:
> > 
> > 
> > I have 12 response variables (species growth rates) and two
> > environmental factors that I want to test to find out a possible
> > relation.
> > 
> > The sample size is quite small: (7<n<12, depending on each species-case).
> > 
> > I performed a Shapiro test (shapiro.test) to test for normal
> > distribution of the responses variables and they were normally
> > distribuited 10 times (over 12 possible, i.e. 12 response variables).
> > 
> 
>   The Shapiro test is probably not very powerful for such a small
> data set -- i.e., the data could be non-normal (in fact it almost
> certainly *is* non-normal) but the deviation
> is not detectable ...  where do your growth rates come from?
> Can you make a guess at their probable distribution?
>
> 
>>> The growth rates are calculated as  ÄXt, where ÄXt = (Xt + 1) - Xt , Xt is 
>>> loge (Nt), and Nt is the population size at time t. 
>>> I use it and not directly population size because I found in a few cases 
>>> (species population size trend) the existence of autocorrelation (time lag= 
>>> >>> 1), nevertheless the "ÄXt"didn't >>> show autocorrelation and was 
>>> equivalent to my purpose: investigating if "x1" or "x2" affected to the 
>>> population dynamic of these species. I would expect that "ÄXt" would be 
>>> normally >>> distribuited. 
>
> 
> > I performed a Generalized Linear Model in R-software (MASS package),
> > and I selected models by automatic backward stepwise (stepAIC
> > procedure) considering as the starting model the one with the additive
> > effects of both the factors. This is the case for six species growth
> > rates (six cases) but for the others six I tested the effect of just
> > one factor ("x2", see below) using just the "glm" procedure.
> > 
> 
> If you're assuming normality you don't need glm(), just lm(),
> although glm() will give you the same answer less efficiently
> (a common confusion, probably due in part to SAS's PROC GLM,
> which is about the same as lm()).
>
>
>>> Ok, I understand.
>
>
> Why different procedures for different cases?
>
> 
>>> I don't understand if you are suggesting to me to use different procedures 
>>> for different cases or if you are asking me why I used different procedures 
>>> for different cases, in such case: I >>> didn't. 
>
> 
> > So, my object containing the data is called "data" and, this is the editor
> > for the first species (sp1):
> > 
> > GLM1<-glm(growth.sp1~x1+x2,family=gaussian, data)
> > 
> > MOD.SELECTION<-stepAIC(GLM1, trace=TRUE) 
> > 
> > summary(MOD.SELECTION)
> > 
> > 
> > Here I attach an example of one of these analyses and after I finally
> > give you my questions (I hope not to be too long-winded!!):
> > 
> > 
> > 
> >> sp1.starting.model<-glm(sp1~x1+x2,family=gaussian, data)
> > 
> >> sp1.back<-stepAIC(sp1.starting.model, trace=TRUE)
> > 
> > Start:  AIC=63.6
> > 
> > sp1 ~ x1 + x2
> > 
> >        Df Deviance     AIC
> > 
> > - x2     1   73.490  61.801
> > 
> > <none>      72.278  63.602
> > 
> > - x1    1  122.659  67.949
> > 
> > 
> > 
> > Step:  AIC=61.8
> > 
> > gpf ~ x1
> > 
> > 
> > 
> >        Df Deviance     AIC
> > 
> > <none>      73.490  61.801
> > 
> > - x1    1  126.400  66.309
> > 
> >> summary(sp1.back)
> > 
> > 
> > 
> > Call:
> > 
> > glm(formula = sp1 ~ x1, family = gaussian, data = data)
> > 
> > 
> > 
> > Deviance Residuals: 
> > 
> >      Min        1Q    Median        3Q       Max  
> > 
> > -5.04833  -1.15233  -0.06802   0.81325   5.11464  
> > 
> > 
> > 
> > Coefficients:
> > 
> >             Estimate Std. Error t value Pr(>|t|)  
> > 
> > (Intercept) -7.62399    3.11127  -2.450   0.0342 *
> > 
> > x1           0.20595    0.07675   2.683   0.0230 *
> > 
> > ---
> > 
> > Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
> > 
> > 
> > 
> > (Dispersion parameter for gaussian family taken to be 7.348965)
> > 
> > 
> > 
> >     Null deviance: 126.40  on 11  degrees of freedom
> > 
> > Residual deviance:  73.49  on 10  degrees of freedom
> > 
> >   (1 observation deleted due to missingness)
> > 
> > AIC: 61.801
> > 
> > 
> > 
> > Number of Fisher Scoring iterations: 2
> > 
> > 
> > 
> 
>  You would probably be better off just doing summary() and
> looking at the p-values of the two predictors (if you must ...)
> 
> Why are you using AIC if you're interested in testing
> relationships rather than prediction?
>
> 
>>> So, by reading the Whittingham et al. paper (thank you very much) and 
>>> reading your commentaries I undertand I would be better off using the 
>>> "full" (just two predictors) model and >>> taking in account the p-values 
>>> of such a model (not using the stepAIC procedure), isn't it?  
>
>  
> > THE QUESTIONS:
> > 
> > 1) Can I trust in the existence of such statistical relation? I mean: is
> > there a way to know the power of this test in R?
> > 
> 
> There are power tests in R, but I don't know if there are any specifically
> for this case (two-predictor regression).  Remember that power applies to
> the probability of type II (falsely failing to reject null hypothesis)
> errors.
>
>
>>> Ok, on the other hand, I suppose that the small sample size makes the 
>>> existence of a statistical relation between the predictor and the response 
>>> variable even more reliable, isn't it?
> 
> 
> > 2) I decided to use always "family=gaussian" because I have also
> > negative values in my response variable and I cannot manage it in a
> > different way. In fact I was not able to use as link function a
> > "negative binomial" as I previously did in SAS because of negative
> > values of response variable (as R "told" me when I tried)
> > 
> 
> Is this a question?  As above, glm() with gaussian family and
> identity (default) link is equivalent to lm().
> 
>
>>> Yes, now I understand and I shame because I'm aware it is a very basic 
>>> statistical issue (I'm sorry!). But, if I strongly believe the response 
>>> variable is normally distribuited, although >>> the small sample size makes 
>>> difficult to test its normality, can I use lm() without testing for 
>>> normality? In other words: can I trust on logical basis that the 
>>> statistical population beyond >>> the sample would be normally distribuited 
>>> and consequently using lm()? 
>
> 
> > 3) How should I interpret the dispersion value R give me (in the case
> > reported it was "7.3")? I mean, what range of values (if it does exist)
> > I would expect to make the result reliable in the case of
> > "family=gaussian" (I'm not interested in prediction but just in finding
> > a statistical relation)?
> > 
> 
> In the case of the Gaussian, the dispersion is just the variance.
> If you're worried about overdispersion, that's really just a problem
> with models (binomial, Poisson) that assume a fixed relationship
> between mean and variance.
> 
>
>>> I was looking a lot for this information but I didn't find it: Thank you 
>>> very much!!
>
>
>   If you really were going to use AIC, you should strongly
> consider AICc -- corrected for small sample sizes ...
> 
>   Ben Bolker
> 
> -- 
> View this message in context: 
> http://www.nabble.com/generalized-linear-model-%28glm%29-and-%22stepAIC%22-tp24433331p24436223.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

_________________________________________________________________


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] generalized linear model (glm) and "stepAIC"

Reply via email to