First of all, thank you for replying me. I believe that using a statistical software (like R) and understanding the statistical issues beyond the software are two concepts with a strong link, but I understand that your scope is providing information on the way R works (so how to use it). For such a reason I know my questions were not really pertinent (or not at all!). The truth is that it's summer-time and I have serious problems in getting some statistical advice that I really need. I'm studying and reading a lot and in fact I'm improving my poor statistical knowledge every day (I'm a biologist) but, you know, it's not easy... Your response has been very useful for me and I appreciate it a lot. Below I give some commentary (to better explain things and to thank you again) and I make some very short question. I appreciate your effort although you don't answer me again. I hope not to seem hypocrite (I'm not) but I'm very grateful to you. I promise to write again only for more pertinent questions.
Best wishes > Date: Fri, 10 Jul 2009 17:45:50 -0700 > From: bol...@ufl.edu > To: r-help@r-project.org > Subject: Re: [R] generalized linear model (glm) and "stepAIC" > > > > If you possibly can, you should get some local statistical advice. > There are a number of pitfalls involved in what you're doing. > Frank Harrell's book is a good reference, but may be too advanced > if you are a beginner. You are right on the edge of having too little > data for what you want to do (the rule of thumb is that you should > have at least 10-20 responses per predictor), and stepwise regression is > known > to inflate p-values (see e.g. Whittingham et al J. Animal Ecology 2006). > That said, here are some specific responses: > > > Simone Santoro wrote: > > > > > > I have 12 response variables (species growth rates) and two > > environmental factors that I want to test to find out a possible > > relation. > > > > The sample size is quite small: (7<n<12, depending on each species-case). > > > > I performed a Shapiro test (shapiro.test) to test for normal > > distribution of the responses variables and they were normally > > distribuited 10 times (over 12 possible, i.e. 12 response variables). > > > > The Shapiro test is probably not very powerful for such a small > data set -- i.e., the data could be non-normal (in fact it almost > certainly *is* non-normal) but the deviation > is not detectable ... where do your growth rates come from? > Can you make a guess at their probable distribution? > > >>> The growth rates are calculated as ÄXt, where ÄXt = (Xt + 1) - Xt , Xt is >>> loge (Nt), and Nt is the population size at time t. >>> I use it and not directly population size because I found in a few cases >>> (species population size trend) the existence of autocorrelation (time lag= >>> >>> 1), nevertheless the "ÄXt"didn't >>> show autocorrelation and was >>> equivalent to my purpose: investigating if "x1" or "x2" affected to the >>> population dynamic of these species. I would expect that "ÄXt" would be >>> normally >>> distribuited. > > > > I performed a Generalized Linear Model in R-software (MASS package), > > and I selected models by automatic backward stepwise (stepAIC > > procedure) considering as the starting model the one with the additive > > effects of both the factors. This is the case for six species growth > > rates (six cases) but for the others six I tested the effect of just > > one factor ("x2", see below) using just the "glm" procedure. > > > > If you're assuming normality you don't need glm(), just lm(), > although glm() will give you the same answer less efficiently > (a common confusion, probably due in part to SAS's PROC GLM, > which is about the same as lm()). > > >>> Ok, I understand. > > > Why different procedures for different cases? > > >>> I don't understand if you are suggesting to me to use different procedures >>> for different cases or if you are asking me why I used different procedures >>> for different cases, in such case: I >>> didn't. > > > > So, my object containing the data is called "data" and, this is the editor > > for the first species (sp1): > > > > GLM1<-glm(growth.sp1~x1+x2,family=gaussian, data) > > > > MOD.SELECTION<-stepAIC(GLM1, trace=TRUE) > > > > summary(MOD.SELECTION) > > > > > > Here I attach an example of one of these analyses and after I finally > > give you my questions (I hope not to be too long-winded!!): > > > > > > > >> sp1.starting.model<-glm(sp1~x1+x2,family=gaussian, data) > > > >> sp1.back<-stepAIC(sp1.starting.model, trace=TRUE) > > > > Start: AIC=63.6 > > > > sp1 ~ x1 + x2 > > > > Df Deviance AIC > > > > - x2 1 73.490 61.801 > > > > <none> 72.278 63.602 > > > > - x1 1 122.659 67.949 > > > > > > > > Step: AIC=61.8 > > > > gpf ~ x1 > > > > > > > > Df Deviance AIC > > > > <none> 73.490 61.801 > > > > - x1 1 126.400 66.309 > > > >> summary(sp1.back) > > > > > > > > Call: > > > > glm(formula = sp1 ~ x1, family = gaussian, data = data) > > > > > > > > Deviance Residuals: > > > > Min 1Q Median 3Q Max > > > > -5.04833 -1.15233 -0.06802 0.81325 5.11464 > > > > > > > > Coefficients: > > > > Estimate Std. Error t value Pr(>|t|) > > > > (Intercept) -7.62399 3.11127 -2.450 0.0342 * > > > > x1 0.20595 0.07675 2.683 0.0230 * > > > > --- > > > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > > > > > > > (Dispersion parameter for gaussian family taken to be 7.348965) > > > > > > > > Null deviance: 126.40 on 11 degrees of freedom > > > > Residual deviance: 73.49 on 10 degrees of freedom > > > > (1 observation deleted due to missingness) > > > > AIC: 61.801 > > > > > > > > Number of Fisher Scoring iterations: 2 > > > > > > > > You would probably be better off just doing summary() and > looking at the p-values of the two predictors (if you must ...) > > Why are you using AIC if you're interested in testing > relationships rather than prediction? > > >>> So, by reading the Whittingham et al. paper (thank you very much) and >>> reading your commentaries I undertand I would be better off using the >>> "full" (just two predictors) model and >>> taking in account the p-values >>> of such a model (not using the stepAIC procedure), isn't it? > > > > THE QUESTIONS: > > > > 1) Can I trust in the existence of such statistical relation? I mean: is > > there a way to know the power of this test in R? > > > > There are power tests in R, but I don't know if there are any specifically > for this case (two-predictor regression). Remember that power applies to > the probability of type II (falsely failing to reject null hypothesis) > errors. > > >>> Ok, on the other hand, I suppose that the small sample size makes the >>> existence of a statistical relation between the predictor and the response >>> variable even more reliable, isn't it? > > > > 2) I decided to use always "family=gaussian" because I have also > > negative values in my response variable and I cannot manage it in a > > different way. In fact I was not able to use as link function a > > "negative binomial" as I previously did in SAS because of negative > > values of response variable (as R "told" me when I tried) > > > > Is this a question? As above, glm() with gaussian family and > identity (default) link is equivalent to lm(). > > >>> Yes, now I understand and I shame because I'm aware it is a very basic >>> statistical issue (I'm sorry!). But, if I strongly believe the response >>> variable is normally distribuited, although >>> the small sample size makes >>> difficult to test its normality, can I use lm() without testing for >>> normality? In other words: can I trust on logical basis that the >>> statistical population beyond >>> the sample would be normally distribuited >>> and consequently using lm()? > > > > 3) How should I interpret the dispersion value R give me (in the case > > reported it was "7.3")? I mean, what range of values (if it does exist) > > I would expect to make the result reliable in the case of > > "family=gaussian" (I'm not interested in prediction but just in finding > > a statistical relation)? > > > > In the case of the Gaussian, the dispersion is just the variance. > If you're worried about overdispersion, that's really just a problem > with models (binomial, Poisson) that assume a fixed relationship > between mean and variance. > > >>> I was looking a lot for this information but I didn't find it: Thank you >>> very much!! > > > If you really were going to use AIC, you should strongly > consider AICc -- corrected for small sample sizes ... > > Ben Bolker > > -- > View this message in context: > http://www.nabble.com/generalized-linear-model-%28glm%29-and-%22stepAIC%22-tp24433331p24436223.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. _________________________________________________________________ [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.