Re: [R-sig-eco] R-sig-ecology Digest, Vol 190, Issue 8
Rich, Your specific question was "whether the geometric mean of a set of chemical concentrations (e.g., in mg/L) is an appropriate representation of the expected value." The short answer to that specific question is no. There are various estimators of the expected value of a positively skewed distribution: 1) The arithmetic average 2) Assuming a specific distribution, e.g., log normal ( most common but not restricted to that), and applying the distribution-specific relationship between parameters and expected value. 3) A non-parametric, e.g. "Duan" estimator (very rare). However, I suggest your question is too narrow. The better question is "what single value best characterizes the location of a skewed distribution"? If the skew is sufficiently large (e.g. large log-scale sd), 90 - 99% of the population can be less than the expected value. In my mind, the answer to this question depends on what you "want to do" with the estimated location. 1) summarize the observations - I suggest the median is the best estimator of the typical value for an individual observation. For log normal data, the geometric mean is an estimate of the median, because the log operator and median operator can be interchanged algebraically. 2) estimate a total, e.g. over space or over time. This is the relevant goal when trying to estimate nutrient loading into a receiving body or total release of a contaminant. For this, you do want the expected value. I have seen instances where load/release was estimated by multiplying the geometric mean daily release times the number of days. Justified because "the data are skewed so we computed the geometric mean". I'll leave them nameless because the geometric mean computation is totally wrong for this goal (although favorable to the polluter because the answer is a smaller value than the arithmetic average times number of days). Best, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Determine if a genotype is resistant or susceptible based on insect survival - Possible?
Marcelo, I wrote quite a bit based on your code, but then realized that your code probably completely misrepresents the data. Your code, which indicates that 5 is 'alive at the end of the data recording period'. You compute that by whether Alive = 5, in other words you define # Alive = 5 represents 5 insects living past that day. No - because you actually have day of death for many of those insects. Yes, it is possible to compare survival curves among genotypes, but this will be very difficult to do carefully because of 3 aspects of your data: 1) 61% of your observations are censored (based on your code, which indicates that 5 is 'alive at the end of the data recording period'). When considered by genotype, all but one genotype have over 50% censoring. Hence simple summaries like the median survival time at uninformative: median = "alive at end of study" for all but one genotype. 2) Your study design produces hierarchical data: insects within leaves within plants for each of the 7 genotypes. Any good analysis will account for that structure, e.g. by incorporating random effects. 3) Your data collection scheme produces daily data over 5 days, i.e. did an insect die between day x-1 and day x, i.e. is it newly dead on day x for up to 5 days + alive at end. These measurements are quite coarse relative to the length of the study. You can certainly estimate survival curves for each genotype but the usual standard error and confidence intervals are wrong because of the hierarchical sampling. There are random effects survival models, especially well developed for the Cox proportional hazard model, but I suspect these will be difficult to fit because of the coarse sampling and large proportion of censored observations. Most of the information in the data seems to be summarized by the proportion alive at the end (i.e. whether or not Alive =5). My simple suggestion would be further collapse the data to "alive at the end": yes (Alive = 5), no (Alive < 5) and counting the # yes per . Then eliminate one level of random effect by counting the # 5's for alive at the end for each leaf (one response per leaf) and use a binomial logistic regression with a random effect for plants. # --- Much of the previous information still applies, but the comments about median death day are completely wrong. When I looked more closely at the data, "Alive" always decreases over time within each leaf. This suggests that Alive is the # alive on each day. These are the survival curves for each leaf. To use survival models, you need to convert the data to one observation per insect, where the response is the number of individuals that died that day. You can do that by taking the difference between Alive values on subsequent days and being careful about computing the number alive on the last day. Because the survival doesn't (to my knowledge) allow for case frequencies, you need to then expand that data set to one row per insect with its day of death. Then you can use survival models with censoring defined at day = 5. Best wishes, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Modeling variability along a gradient using GAMLSS
Jeff, And, if you don't want to dive into Bayesian methods, the GAMLSS package allows you to specify models for location, for scale (what you want) and/or for shape. Those models can be parametric or nonparametric, including multiple types of splines. https://www.gamlss.com/books-articles/ provides resources, including links to short course notes and a J. Stat Software article. Best, Philip [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] mvabund: Extracting a subset of comparisons between interactions
Diogo, In similar situations, I have done the following: Estimate all pairwise differences between site:season combinations. Use paste() or something equivalent so that the group means encode both site and season. No p-value adjustment at this step. Post-process that table to extract only the comparisons of interest, e.g. comparisons of season means within site or comparison of site means with season. Determine the number of comparisons you are interested in (likely much, much smaller than 498). Apply a Bonferroni adjustment to your subset of comparisons. I don't have R code to do this with anova.manyglm( pairwise.com= ) output. Sorry. Here's an outline of what I would do: If you save the results from anova.manyglm(), the information you need to extract comparisons of interest will be in the row dimnames of $pairwise.comp.table. E.g.: all.pairs <- anova.manyglm(full_mod, p.uni="adjusted , pairwise.comp = Catch$SiteSeason) pair.names <- dimnames( all.pairs$pairwise.comp.table)[[1]] Then it's just a matter of using strsplit() three times. Once to separate the two group names (using ' vs '). Then to each group name to separate the site from the season (using '_' or whatever you used as the separator character in your paste). Best, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] p-values from randomization tests
One small revision to Eduard's answer to Korla's question. When you sample a subset of the possible data configurations under the null hypothesis (what some call a randomization test), the observed data is considered one of the possible samples. When you generate N (e.g. = 199) random samples, you actually have N+1 (= 200) samples. So the p-value is (R+1) / (N+1), where R is the number of random samples with test statistics equal to or more extreme than the observed sample. So in Korla's analysis, when the observed data is more extreme than all the random samples, p = 1/(199+1). The +1 issue is why you frequently see N=99, N=199, N=999, or even N= random samples. If you enumerate all possible data configurations under the null (what some call a permutation test), the observed data is automatically included as one sample in the set of possible samples. Then p = R/N, because the observed value of the test statistic occurs at least once in {R} and that sample is included by definition in {N}. The spirit of Eduard's answer is spot-on. The p-values are the same because every pairwise comparison is more extreme than anything in the randomized set. Best, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Predicting using spatial filtering
Maarten, Fascinating question! Yes, you should not see that behaviour. Median has nothing to do with the issue, because you should get linear relationships for any constant value of the non-essential covariates. And, thanks for telling me about SpatialFiltering. I have been using PCNM() in the vegan package to extract Moran Eigenvector Maps. I use spdep for lots of things but hadn't (yet) discovered SpatialFiltering(). Good to find out about other implementations! For the benefit of the list, Stephane Dray has a wonderful vignette on the use of Moran Eigenvector Maps (MEM) to "pull" spatial covariance out of the error covariance into the fixed effect model structure. Pedro Peres-Neto used the name principal components of neighbor matrices when he popularized the method in the ecological literature but MEM is earlier. Stephane uses functions in adespatial, but the principles apply equally well to the vegan implementation (pcnm) or spdep implementation (SpatialFiltering). Her vignette is https://cran.r-project.org/web/packages/adespatial/vignettes/tutorial.html I explored various possible explanations for your issue without success. These included: High correlation between elev and intercept Coding the factor variable (soil) All were ruled out when I constructed alternatives that showed the same problem. Eventually, I decided that the issue was something subtle with predict when you have a matrix argument (EV_fit1 in your code) in the formula. It looks like predict then ignores the newdata= argument and gives you predictions for the original observations. Since those don't have the same values for "non-essential" covariates, it is no surprise that you get a random plot. This is not apparent when you predict for the same number of observations as the original data set, but it is immediately apparent when you have a different number of rows in the newdata. I simplified the problem by extracting the first two MEM eigenvectors and explicitly naming them in the formula. Here's the code (using base graphics). It gives straight lines for the prediction vs elev plot, as expected. This is not a reprex. It piggy-backs on your code. meuse$sp1 <- EV_fit1[,1] meuse$sp2 <- EV_fit1[,2] Model_example<-lm(cadmium ~ elev + dist + factor(soil) + sp1 + sp2, data=meuse) # bad predictions using the meuse data meuse$predmod1 <- predict(Model_example, type="response") with(meuse, plot(elev, predmod1, col=as.integer(soil))) # this plot shows all sorts of variation vs elev because the other covariates are not constant # predictions using a newly constructed data set testdata <- expand.grid(elev=seq(5,10, 0.1), soil=1:3) testdata$soil <- factor(testdata$soil) # You can choose any constant you like for dist, sp1 and sp2 testdata$dist <- 0.211 testdata$sp1 <- median(meuse$sp1) testdata$sp2 <- median(meuse$sp2) testdata$predmod1 <- predict(Model_example, newdata=testdata) with(testdata, plot(elev, predmod1, col=as.integer(soil))) # this plot shows straight lines vs elev because other covariates are held constant, as expected Best, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Simulating spatial point patterns that have spatial structure
Liudas, It sounds like you are trying to do what is called pattern reconstruction. That is a method to simulate a point pattern that is constrained in one or more ways. The possible ways include the intensity function and the K or L function. The original paper is Tscheschel and Stoyan, 2006, Statistical reconstruction of random point patterns, Computational Statistics and Data Analysis 51:859-871. A textbook treatment is in Wiegand and Moloney, 2014, Handbook of spatial point-pattern analysis in ecology, CRC Press. The methods are implemented in the R shar package. Look for the reconstruct functions. In response to your comments, I note that matching the second order properties (K function, variogram) usually results in a pattern that "appears" quite different, in the sense that locations of high intensity are in different places. That's to be expected because matching the correlation structure does not mean matching the intensity. I also note that when you match on a lot of characteristics, you will always simulate patterns that look a lot like your original one because patterns can only differ in so many ways. Best wishes, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Thoughts on ordinal regression
Mahnaz, You only need to send one request. Many folks do not check the list daily and I usually wait to see if others respond before I respond. Likert scale responses are among the most difficult types of data to analyze well. Many folks use approximate analyses. A good textbook treatment of ordinal data analysis is the relevant chapters in Frank Harrell's book on Regression modeling strategies. My go-to reference for categorical predictors is Shah and Madden, who deal with plant disease ratings on an ordinal 0-10 scale. Title: Nonparametric analysis of ordinal data in designed factorial experiments Author(s): Shah, DA (Shah, DA); Madden, LV (Madden, LV) Source: PHYTOPATHOLOGY Volume: 94 Issue: 1 Pages: 33-43 DOI: 10.1094/PHYTO.2004.94.1.33 Published: JAN 2004 Since your response is the average of 5 Likert scores, you may be able to use standard methods for Gaussian data. The important issues are: 1) the analysis assumes the scale is linear. I.e. the average of 2 and 4 is the same (in a subject matter sense) as the average of 1 and 5. Also, the change between 2 and 3 is the same (in a biological sense) as the change between 4 and 5. As you can tell, the assumption of linearity has consequences both for averaging 5 responses to get one summary score and for the form of the regression model. 2) that the variance of the response is smaller when the mean is close to 1 or close to 5. This may not be a problem if most of the average responses are between 2 and 3. You may decide to do an approximate analysis and ignore the unequal variances. Best wishes and good luck! Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Doing repeated measures on a randomized block design
Rick, There are (at least) two ways to set up a model for your experiment. The difference is a detail of your sampling. The two most likely schemes are: 1) Each time you sampled, you went to the same branch (or perhaps same leaf). Here, the repeated measurements are on the branch, not the tree. Exposure (N/S) is characteristic of the branch. In terms of the model variables, a branch is identified by combination of tree and exposure. This leads to the traditional between/within repeated measures design. If you assume compound symmetry (nested random effects), the lmer model would be FvFm ~ Tree + Exposure + Date + Date:Exposure + (1|Tree:Exposure) 2) Each time you sampled, you went to different branches. Now the only level of repeatedness is the tree. Now there is one observation per branch; each is characterized by exposure and date. The model is simply a 2 way factorial inside blocks. If you wanted to set up something like a compound symmetry model, you would have tree and observation as the random effects. Since there are only 4 trees, I would treat trees as fixed. The lmer model is FvFm ~ Tree + Exposure + Date + Date:Exposure With 18 dates, I would stongly advice against a compound symmetry (split plot in time) model. There is probably some structure to the correlations. This means using lme() with something like a corAR1() correlation structure. You can add that to either model, (to get CS+AR(1) for the first option or AR(1) in blocks for the second). Or you could a "just an AR(1)" model for the first option by replace (1|Tree:Exposure) with the corAR1(). Best, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Which R package for Second-Stage nMDS ?
Pierre, I don't know a function that does this, but it is extremely easy to code. Dist objects are vectors containing the 1st stage pairwise dissimilarities. Call those dist1, dist2, dist3, ... So alldist <- cbind(dist1=dist1, dist2=dist2, ...) will assemble the matrix of dissimilarities with useful column names. stage2 <- as.dist(1-cor(alldist)) will compute the matrix of correlations, convert from similarity (the correlation) to distance (1-correlation) and convert to a distance object. Then just run your favorite MDS on stage2. Note: sometimes folks prefer sqrt(1-cor) as the "correlation distance", instead of 1-cor. I don't know which Clarke prefers. Best, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Regression when Y has an estimation variance
Roy, One relevant literature is that on meta-regression (a generalization of meta-analysis). There is a very good handbook by Koricheva, Gurevitch and Mengerson. Meta analysis mostly deals with Gaussian responses (or transformable to approximately Gaussian). If there has been any work on non-gaussian responses, I expect it would be summarized in Koricheva. The metafor package is one implementation specifically for meta analysis. A resource on metafor and other R packages is Schwarzer and Carpenter, Meta-analysis with R. Other programs can also fit the models as mixed models with heterogenerous specified variances, lme() and lmer() do not let you do this. lmer() doesn't allow heterogeneous variances; lme() does, but only of the form k_i*sigma^2 (i.e. variances relative to a common scaling factor, not absolutely specified variances). For a meta analysis, you need to specify the absolute variance for each estimate. If someone knows how to trick lme() to use exactly the error variances that have been specified, I would love to hear about it. Best, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] mixed model with proportion data
Mariano, There is a huge and important difference between the two approaches suggested for your data. The log ratio of proportions (i.e. the empirical logit of the Yes proportion) estimates the residual variance. The binomial model assumes the residual variance is determined by the arbitrary (and made-up) sample size of 20 "tries" per response, in combination with the estimated mean proportions. To see the arbitrariness, if you don't already, re-express your proportions out of 200, instead of 20, because 0/200, 10/200, ... 200/200 also give your observed responses. The coefficient estimates will be the approximately same but their variances will not. (If you didn't have additional random effects in the model, the coefficient estimates would be exactly the same but the variances would be 1/10's those from N=20). If you are going to use the binomial GLM, I believe you must add overdispersion to the model. Either as an individual random effect, or by using a quasibinomial response distribution. Overdispersion is not necessary for the log proportion response because the residual error variance conceptually estimates that overdispersion. Philip ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Permutation test for SAR model
Feng, You say you are interested in the relationship between two variables, so the quantity you want to assess is something that describes that relationship. Usually that is the regression slope. A much more serious problem is: how are you going to generate random permutations that honor the spatial correlation implicit in a SAR model? You can not simply permute X variables, Y variables, or residuals because permutation assumes exchangability, a slightly more general version of independence. You might consider toroidal rotations, but that makes other assumptions (e.g. 1st order stationarity) Pesarin has been developing permutation tests for complex data (see his book with that title). I do not know whether he has considered data with a SAR structure. Best, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] Periodic regression
Thiago, I would not try to interpret each periodic component separately. That linear regression model is just a convenient way to fit a model with a phase-shifted cosine curve: log (mean index) = b0 + A*cos(c*time - theta) The amplitude of the oscillation is A; the phase shift is theta. Those can be estimated from the regression coefficients in the linear model: log (mean index) = b0 + b1 cos(c*t) + b2 sin(c*t) as: A = sqrt(b1^2 + b2^2) theta = atan(b2/b1) Here's R code to implement these calculations: tad.lm<-glm(index~(cos(2*pi*(hora/24)))+(sin(2*pi*(hora/24))), family=quasipoisson) tad.coef <- coef(tad.lm)[2:3] A <- sqrt(sum(tad.coef^2)) # where tad.coef is only periodic components phase <- atan2(tad.coef[2],tad.coef[1]) # in radians, in the appropriate quadrant maxhora <- phase*24/(2*pi) maxhora <- ifelse(maxhora < 0, 24+maxhora, maxhora) # and in hours, shifting by 24 hr if negative Standard errors are obtained by the delta method. Confidence intervals are best obtained by bootstrapping. You should think hard about what sort of bootstrap to use (observations, standardized residuals, parametric on the estimates, parametric on the observations). Chester Bliss's pamphlet on Periodic regression in Biology and ?? is a good introduction, but remember that his computing resources is 60 years old. The test of "no aggregation" that I suggested (one of many possibilities) = test of "no periodic components". That is the test that both b1 = 0 and b2 = 0 in the linear model above. Can use either a drop in quasi-deviance or a C Beta test on the estimates. Asymptotically equivalent; usually similar but rarely identical in practical samples. R code is: tad.lm<-glm(index~(cos(2*pi*(hora/24)))+(sin(2*pi*(hora/24))), family=quasipoisson) tad.lm0 <- glm(index~1, family=quasipoisson) # F (drop in quasi deviance) test for both periodic coefficients = 0 anova(tad.lm0, tad.lm, test='F') # C Beta test for both periodic coefficients = 0 tad.coef <- coef(tad.lm)[2:3] tad.vc <- vcov(tad.lm)[2:3,2:3] # extract coefficients and variance-covariance matrix for periodic terms # i.e. dropping the estimated intercept npar <- length(tad.coef) nobs <- length(hora) # extract # parameters in test and # observations in data set tad.f <- (t(tad.coef) %*% solve(tad.vc) %*% tad.coef) / npar c(F = tad.f, probF=pf(tad.f, npar, nobs-(npar+1), lower=F)) Unless there are issues of general interest to the list, I suggest any further discussion be done off-list. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Test for circular data
Thiago, I can see why you're not having much luck in the usual circular statistics literature, which focuses on responses that are circular quantities. In your case, the explanatory variable (time) is circular. I would answer your second question by fitting a periodic regression, i.e. using sin(2 pi* Time / 24) and cos(2 pi*Time/24) as the two X variables. Choose the appropriate distribution for your response and include other fixed or random effects as appropriate. The amplitude and phase of the apparent oscillation can be computed from the regression coefficients. That model allows one peak per 24 hours. You can add more components or change the frequency as appropriate for the biology and your hypothesis about the within-day pattern. The period model also provides a simple test for 'aggregation' (i.e., test both periodic coefficients = 0). There are many other possible tests of aggregation, depending on the form of the presumed 'no aggregation' distribution of your response. That model is the temporal equivalent of a simple linear regression model: simple, able to detect various forms of departure from a constant mean, but definitely not an exact description of the pattern over time. The model can be made more flexible by using a spline function that is appropriately smooth across the 24:00 hr to the 0:30 period. Look in the smoothing spline and/or GAM literature if you want to explore this. Philip ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Back transforming standardized model parameters when there is an interaction
Matt, The interaction term is why the usual transformation doesn't work, at least initially. The R parameterization of factor effects makes it easy to fall into the trap of thinking that the coefficient for a main effect "means" something when there is an interaction in the model. Sometimes yes, sometimes no. The best way to untangle model effects when there are factors in the model, especially when a factor changes the slope (because the model includes air.ice:time) is to work out the regression model for each level of the factor (i.e. for air then for ice). Using the unstandardized version to get illustrate: air.ice, assuming you used the default contrasts, see options("contrasts"), has the value of 0 for air and 1 for ice so the regression line for air is E Y = 12.52 + 0.0081 time - 0.8019 * 0 - 0.00442*0*time = 12.52 + 0.0081 time and that for ice is is E Y = 12.52 + 0.0081 time - 0.8019 * 1 - 0.00442*1*time = 11.71 + 0.0037 time Two of the coefficients in the full model correspond to coefficients for a factor-specific regression line ONLY because the factor variable has the value 0 for the baseline level. If you only standardize time (your model 3), the same calculation works because you haven't changed the air.ice variable. You should be able to take the coefficients, apply the backtransformation suggested by Drew to each regression line separately and get the same coefficients when time is the X variable. It looks like standardize also centers (or centers and standardizes) the air.ice variable. The standardized value for air is no longer 0 and that for ice is no longer 1. Probably something like -0.5 and 0.5 (if centered only with equal sample sizes). I will assume -0.5 for air and 0.5 for ice to illustrate. That means the factor-specific regression lines are now: for air: E Y = 12.48 + 0.508 z.time - 1.07 * (-0.5) - 0.38 * (-0.5)*z.time for ice: E Y = 12.48 + 0.508 z.time - 1.07 * (0.5) - 0.38 * (0.5)*z.time After standardizing, none of the model coefficients can be interpreted as a coefficient for some baseline level. You have to unpack the factor effects. Note: you get similar behavior if you use sum-to-zero contrasts. And, if you use contr.SAS(), the last factor level is used as the baseline group. Aside: the different intercept is no surprise because it is EY at a different timevalue (the average time, not time=0) and the different slope is no surprise because one unit change in standardized time is different than one unit change in raw time. So, I suggest you look at how the variables are coded by standardize(). model.matrix() will give you this information. Use that to calculate the coefficients for each regression line. Best wishes, Philip [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] evaulating multiple responses using Chi-square
Yes, the idea has been considered, by no less than RA Fisher. If you want a name, it is sometimes known as Fisher's combined probability test. Usually applied to one-sided tests of the same null hypothesis. Yes, the p-values can come from different "sources". Before using it in the community context, I would want to think carefully about what it tells me when applied to two-sided tests of different species. Philip [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Fitting a Weibull with a 0 dose
Katie, I second Ben's suggestion to use functions in the drc library, not nls(). Not only do those functions model binomial or overdispersed binomial data, but they provide estimates and confidence intervals for the EC50 and any other EC, e.g. EC10. However, neither handles x=0 very well. The problem is the way the Weibull function is coded. Instead of x^power, both use exp(power*log(x) ...). The first is defined for x=0, the second is not. A work around is to use a small positive value for x, e.g. 0.01, as you suggest. If there is some background level of x, so experimental dose = 0 is really exposure to a small background concentration, that background concentration is an easily justified replacement for 0. If not, I rerun the analysis with different choices of small positive value. If the low dose asymptote is well determined by the data, you should get similar parameter values and estimated ECx's for any small positive value. This is a robustness check to make sure that an arbitrary choice doesn't affect the results. Best wishes, Philip ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Logistic regression with 2 categorical predictors
Andrew, Your problems arise from two issues: 1) The overall test of equal P[prefer] in each age and the tests of equal P[prefer] in a specific pair of ages are based on different testing philosophies (Score or LR test for the overall test and Wald test for the pairwise comparisons). 2) The behaviour of the logit function when one empirical probability is 0 (or 1). You are absolutely correct when you highlight the ages with all avoid. If you want just my suggestion for how to conduct informative pairwise comparisons: Do 15 two-by-two Chi-square tests, or 15 runs of glm with only two ages in each analysis. Because there is no pooling of information from ages uninvolved in the null hypothesis, these 2x2 tests are exactly the same as the pairwise hypothesis for the full 6-age data. I don't know whether there is a R packge to automate this. I've always written a pair of for loops and used subset= with the appropriate logical in the glm(). Details (which are a very very small glimpse of a huge area): Fleiss, analysis of counts and proportions, and the tome: Agresti, categorical data analysis are extended treatments). It doesn't help that the analysis of this type of data has some non-trivial complications. Upton, in his classic JRSS discussion paper on the 2x2 contingency table, said something like the 2x2 contingency table was a seemingly simple problem that had a surprisingly large number of complications. 1) Why is a 2x2 contingency table comparison of ages 1 and 2 the same as the pairwise comparison of ages 1 and 2 in an analysis of all 6 ages. When the data are normally distributed and analyzed by ANOVA, the error variance is (usually) estimated from all 6 groups. The analysis of just two groups estimates the error variance from fewer observations, hence the 6-group pooled error variance provides improved power for the comparison of groups 1 and 2. The null hypothesis in both cases specifies equality of ages 1 and 2, but all 6 groups provide information about the error variance. With count data, there is no error variance that has to be estimated. (I'm not considering overdispersion, which if present would be estimated from all 6 groups, so the 2x2 and "6x2 followed by a pairwise comparison" analyses are no longer equivalent). Again, the null hypothesis is the same in both analyses. One way to consider the specific comparison of ages 1 and 2 in the 6-age analysis is ! that you are fitting a model with 5 probabilities to the data: one probability that is shared by ages 1 and 2 (hence they have equal probabilities) and four more probabilities, one each for ages 3 through 6. The fit of that 5 parameter model is compared to the fit of a 6 parameter model in which each age has its own probability. Both the 2 x 2 and the pairwise comparison are 1 df tests. In neither case, can the comparison of ages 1 and 2 benefit from any information from the other four groups. 2) What's going on the anova(model_sg, test="Chisq") analysis? This is a comparison of two models. The data and the models can be set up in at least three asymptotically equivalent ways (see Agresti and probably Fleiss for "sampling models for the contingency table" for the details). I'll use the independent binomial model. That model has up to 6 parameters, i.e. a P[prefer] for each age, where P[prefer] is the probability that a fish in a specified age group has a "Prefer" response, not an "Avoid" response. The null hypothesis (reduced) model is that each age has the same probability, so there is one parameter. The alternative (full) model is that each age has a different probability. Yes, the alternative model is fully specified. Yes, the alternative model has 0 "error" df. When you fit these models using glm, you are using deviance = -2 log likelihood as the measure of fit. Yes, the deviance is 0 for the full model (under the usual way of writing the log likelih! ood). That is exactly how things should be. The comparison between these two models is a measure of whether the reduced model fits the data. (semi-Aside: You can write the log-likelihood for the full model so the deviance is not 0. If so, the log-likelihoods for both the null and alternative models are shifted by the same additive constant so the change in deviance is the same value for either way to write the log-likelihood). This comparison of models is exactly the same comparison done by the usual (K-1) df Chi-square test of equal proportions in a Kx2 contingency table. The only difference is that the Chi-square test uses the Pearson Chi-square statistic as the measure of how well a model fits the data; glm uses deviance. We regularly calculate the Chi-square statistic for the null hypothesis. We never explicitly calculate Chi-square for the alternative hypothesis, because Chi-square for the full model is exactly 0 with exactly 0 df. The change between the two ! models is the usual C
[R-sig-eco] Help obtaining F statistics
Luis, The Analysis of deviance table you provided indicates that your data analysis has a serious issue, one much larger than calculating an F (or Chi-square, which is more common for GLM analyses) statistic. The df for the sex:behavior interaction in your table is 7. It should be 8, the product of sex df and behavior df. I'm almost certain that one combination of sex and behavior does not occur in your data. Appropriate analyses of data with missing cells is almost always very difficult. You have to think carefully about the interpretation of each of the model effects (or at least, the p-values you got do not answer questions that I care about). I recommend you find a local statistician to advise, because this is not an R issue. Best wishes, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Seasonal Mann Kendall with multiple detection limits
Tim, The issue with multiple detection limits with Kendall's tau or the Seasonal Mann-Kendall variation is computational, not conceptual. The strong advantage of Kendall's tau, over rank-based methods, is that tau handles multiple detection limits in a common-sense way. The pair (5, 10) can be ranked, the pair (<5, 10) can be ranked, the pair (<5, 3) can not be ranked, and the pair (<5, <6) can not be ranked. Having said this, I don't believe either the NADA or the current version of EnvStats implementations handles multiple detection limits. I've always done this by hand when needed, but that code is written for specific data sets and unlikely to be usable by anyone else. ( I will talk to Steve Millard about adding tau with multiple censoring limits to EnvStats, because we are writing a 2nd edition of the text to go with the R library). If you want to use imputation (what you are doing by simulating values for the censored observations), you need to use multiple imputation (i.e., generating more than one simulated data set). I didn't look at your code carefully enough to tell if it currently does multiple imputation. Rubin's books (or web links to multiple imputation) will show you how to combine results from the multiple imputations. If you only simulate one data set, your conclusions are about that specific simulated data set, not a Monte-Carlo approximation to conclusions from the data at hand with censoring. Best wishes, Philip [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] xPOD statistic
Alexandre, I don't know any prewritten function to calculate the collection of xPOD statisitics for a multiptype point pattern. However, it won't be difficult to code that yourself using functions in the spatstat library. Brown et al. define A_{ij} for each pair of species as Int( log g_{ij}(r) dr, where g_{ij}(r) is the pair correlation function for species i and j at distance r. That is numerically approximated by a sum of log \hat{g}_{ij}(r) computed at 10 (their suggestion) values of r. spatstat provides the pcf() function to compute g(r) and the alltypes() function to compute g(r) for all combinations of pairs of species. After that, it is just a matter of manipulating the function value array (see fasp()). Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] NA error in envfit
Kendra, I wonder if the problem is a factor level with no observations. One of the frustrating things about factors (class variables) in R is that the list of levels is stored separately from the data. This can cause all sorts of problems if you create the factor, then subset the data, and the subset is missing one or more levels of the factor. You are subsetting your data, so this may be the source of the problem. My working philosophy is to keep variables as character strings or numbers until just before I need the factors. That avoids any issues with extraneous levels. That means reading data sets (.txt or .csv files) with as.is=TRUE to avoid default creation of factors. relevel() may recreate the list of levels. I usually use factor(as.character(variable)) to flip a factor to a vector of character strings then back to a factor with the correct set of levels. Best wishes, Philip Dixon [[alternative HTML version deleted]] ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Modeling overdispersed count data
Bayesians commonly use the log normal to account for overdispersion. I learnt the trick from Ben Bolker. His book may be a citable source. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Observation-level random effects to account for overdispersion
All, I am a real fan of using observation-level random effects to account for overdispersion with count data. Using the default log link, an observation-level normally-distributed random effect models the data as a convolution of a Poisson and log-normal distribution. The negative binomial is the convolution of a Poisson and gamma distribution. For all practical purposes, the log-normal and gamma distribution are indistinguishable. Computationally, the normal model is much easier especially if you have additional normally distributed random effects. The sum of normals is easy to model; the sum of some normals and a log gamma is not. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Autocorrelation structures and simplicity of analysis
Jonathan, The AR(1) + compound symmetry model does make sense. It is a model where the correlation over time declines with the separation between times points, but it does not decline to zero. There is a constant non-zero (at least potentially) correlation arising from the compound symmetry (constant correlation between all pairs of observations on the same plot) part of the model. I have found it fits many data sets better than either AR(1) only or CS only. However, I suggest you completely rethink the analysis. I am a great fan of the approach recommended by Paul Murtaugh a few years ago. The paper is "Simplicity and complexity in ecological data analysis" Author(s): Murtaugh, Paul A. Source: ECOLOGY Volume: 88 Issue: 1 Pages: 56-62 Published: JAN 2007 Your situation is almost identical to one he considers in his paper. His advice: remember that you have experimental units randomly assigned to warming treatments. Your analysis currently ignores time, except for accounting for the autocorrelation. You seem to care only about the treatment effect. If that's the case, you don't need to worry about days. Just focus the analysis on plots and the treatment effect. Specifically, figure out an informative summary statistic for each plot (e.g., average over time, median over time), calculate those 12 numbers, then analyze those. No need to worry about the autocorrelation structure then! The beauty of this approach is that if you want to ask a different question, e.g. about day-day variability instead of the average, you just change your summary statistic to something that measures variability, e.g. log sd. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Logistic function between arbitrary parameter values
Phil, One reasonable solution is add two additional parameters for the bottom and top (or bottom and span) of your logistic.: Your function: K / (1 + ((K - x.0)/x.0) * exp(-r * x)) becomesalpha + beta*K / (1 + ((K - x.0)/x.0) * exp(-r * x)), This logistic spans Y values from -alpha to -alpha + beta. You may have trouble fitting it, because Yhat will be undefined if alpha is not large enough and I would be surprised if there were high collinearity between \hat\alpha and \hat\x.0. However, the choices of alpha and beta are no longer arbitrary. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Offsets in Poisson or Neg. Bin regression
Matias, The situation when you most want to use an offset is when FECUND differs across individuals. If FECUND is a constant for all observations, you could ignore it if you chose to. If you did that, its constant effect gets rolled into the intercept. When it's not constant, the logic is: log(mu_i / F_i) = your model <==> Log(mu_i) = your model + log(F_i) where mu_i is the mean count and F_i is the fecundity, both for individual i. When F_i is a constant = F, the logic is: log(mu_i / F) = your model <==> Log(mu_i) = B0 + rest of your model + log(F) <==> B0 + log(F) + rest of your model where B0 + log(F) is the new "intercept" Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Hurdle model: singular variance-covariance matrix
Yaiza, The problem is that two coefficients are perfectly correlated with each other. Since this happens in the models with interactions but not additive effects, my hunch is that the cross-product variable is very highly correlated with one of the original variables. This can happen if one variable has a large mean and large variability and the second has a large mean and small variability. If so, this is a numerical issue, not a conceptual, statistical or ecological issue. I suggest you center all X variables to mean 0. That often clears up numerical issues. If it doesn't standardizing each to unit variance can also help. Centering the X variables also has the statistical advantage that the intercept is directly interpretable. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Zuur / Pinierho / Faraway
I agree with all the previous comments and second Tom's recommendations of Faraway as an 'in between' Zuur and Piniehro & Bates. One thing to be careful of: While the advice in Faraway is sound, there are more than a few mistakes in his equations. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Clustering estimated regression coefficients
Chris, I think hclust will work just fine; you just have to construct an appropriate distance matrix from the coefficients. A major part of the beauty of distance-based analyses is that you have to choose a distance measure. When you think hard about that choice, the analysis better represents what you want to know. First, variables absent from a model have 0 values for their coefficients, not NA. A model that omits X3 says there is no relationship between Y and X3, hence 0 for that beta. What would be an appropriate distance matrix for your goals? Those are " I would like to cluster the data so that observations are grouped according to: the similarity in the direction of the relationship ( i.e. +ve or -ve) and the presence/absence of variables." I'm not sure if these are two criteria leading to two clusterings or an aggregate criterion. Easiest if they are two separate criteria. For similarity in direction, use a simple matching coefficient (proportion of coefficients in the same direction). Ignore any coefficient with 0 in either model (NA's may simplify this computation). For presence/absence, again a simple matching coefficient (proportion of coefficients present in both models or absent in both models). You could easily elaborate on these distance measures, e.g. by incorporating magnitude of coefficient. That's where the thinking becomes really important. Once you have an appropriate distance matrix (or matrices), then you turn the hclust crank. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Measuring community similarity when total abundance differs
Tim, I don't know of any automatic way to deal with unequal sampling effort. Even if there were, I would argue strongly that you need to think carefully about the data and how they relate to how you want to define 'similar' and 'less similar' community composition. Fundamentally, this is an issue about how to define the distance between two vectors of species abundances. Bray-Curtis is sensitive to differences in total abundance. Perhaps the difference between a sample with 100 individuals and a sample with 12000 individuals is informative. Probably it isn't. If so, you probably (but not certainly) want to compute percent similarity (Bray-Curtis after dividing each abundance by total abundance in the sample). That's a very common recommendation, but even that may be misleading if the reason for the 12000 is an emergence flush of one species. Adjusting by total sqrt(abundance) may be appropriate, but it is certainly not common. Similar levels of abundance heterogeneity occur in benthic macrofauna data sets. Clarke is fond of Bray-Curtis (without total abundance correction, if I remember correctly) of sqrt(sqrt(abundance)), i.e. abundance^(1/4). Anne Chou and Robin Chazdon have developed and used abundance corrected similarity measures, similar in spirit to the Chou measures of richness. I'm not fond of them because I believe the adjustments for sampling effort presume that two samples come from the same underlying community. Any distance measure reduces a pair of vectors to a single number. It always throws away information. You need to make sure that your choice of measure keeps the features you feel are important. I strongly suggest you look at the abundance vectors for pairs of samples with small distances and pairs of samples with large distances to make sure that your quantitative measure matches your intuitive sense of 'similar' and 'not similar'. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Using residuals as dependent variables
1. Regress a suite of ecological and socioeconomic variables against the residuals from the oceanographic model to determine which factors cause some countries to be above and some below. I.E as trophic level increase the residuals become increasingly negative. All my answers assume that all variables are measured at the same scale, so a multi-level model is not appropriate. Absolutely no problem. This is essentially equivalent to fitting a larger model with both oceanographic and ecological/socioeconomic variables. If you go one step further, and first calculate residuals of the ecol/socio variables when regressed against oceanographic variables, the regression of the oceanographic residuals on the ecol/socio residuals is exactly the same as fitting the larger model. The theory supporting this leads to added variable (or partial regression residual) plots. 2. Ideally ( and I am unsure how or if it is possible) work out for each country which variables/s cause the poor fit of that country to the oceanographic model. This may be a bit tricky. The problem is potential confounding. The regression of two sets of residuals suggested above tells you that a particular ecol/socio variable contributes to the larger only to the extent that that ecol/socio variable contains information not present in the set of oceanographic variables. If an ecol/socio variable is causally important but confounded with some oceanographic variables, the approach will not detect the role of that ecol/socio variable. If two ecol/socio variables are confounded, a regression model can not distinguish their contributions. You have to think about your variables and their relationships to assess this. Assuming confounding is not an issue, fit the larger model with both oceanographic and ecol/socio variables. For each country, calculate X\hat{\beta} for each ecol/socio variable. That's the contribution of that variable for that country to the model prediction. Compare those components to their sum for the country. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Residuals in a mixed effect model
Malin, I would double check exactly which residuals you are getting before worrying further about heterscedasticity. In a linear mixed model (Y = Xb + Zu + e), there are two definitions of the residual: Y-X bhat and Y - (X bhat + Z uhat). You want the second to investigate unequal variance of the e's. In mixed models with only random intercepts, this is not an issue because Var(Yi) is constant. When the slope is random, Var(Yi) is not constant. Var(Yi) depends on Xi even when Var u and Var e are constants. I believe resid(mer object) gives you the first (Y-Xbhat), based on faint memory and a quick read of ?'mer-class'. If you do need a stronger transformation, did you try Y^(-1/2) or Y^(-1), both of which are stronger than log(Y) in the Box-Cox family? For any transformation, I would worry much more about linearity of the regression on the transformed scale than about heterogeneity of variances. If the relationship isn't linear, the beta's are at best approximations and at worst irrelevant. You may be able to model the error variance as a function of logRe or a function of something else (sample size in the study). Finally, you may be able / willing to ignore the heterogeneity. David Fletcher and I investigated this in a simpler meta regression problem (each study providing one observation in the meta analysis). That came out in the March 2012 Methods in Ecology and Evolution. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Balanced split plot analysis in R
Markus, The presence or absence of an interaction has no bearing on the choice of analysis. If you're a design-based person, as I am, you have two sizes of experimental unit (main plot and split plot). If you're a model-based person, the multiple measurements (the split plots) on each main plot induces a correlation between split plot observations that needs to be accounted for in the model. Both facts are different way of saying the same thing. The presence or absence of an interaction doesn't change either fact. You're concerned about appropriate comparisons of the main plot marginal means. You're lucky that your design is balanced (equal numbers of observations for each combination of main plot and split plot factors). Otherwise, the appropriate R analysis gets very hard very quickly. (and there is considerable disagreement about what that appropriate analysis should be). My suggestion: Leave the interaction in the model. It is part of the treatment design. Test and do multiple comparisons among the split plot levels using the model you have. If you are comparing cell means, the only correct comparisons are those between disturbance treatments within a grazing treatment. If there is no evidence of an interaction, you're probably not comparing cell means, so that limitation is not an issue. To get comparisons among levels of the main plot treatment: compute the average Y for each main plot, i.e. average over the split plots in each main plot. There is now only one treatment factor (grazing regime) and one blocking factor in the analysis. The error in this analysis is the variability among main plots, which is what you want for comparisons among grazing levels. Straightforward test and multiple comparisons. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Estimating Q10
Scott and Alan, The two forms of your regression: 1) Y = a exp(b x), and 2) log Y = log a + b x are mathematically the same model if you ignore random variation. If you include random variation, these are different models for the variance of the observations around the mean. Carroll and Rupert explore this in the statistical literature (their book on transformations and weighting, and particularly their paper on the difference between three approaches to fit the Michaelis-Menton model). They call this approach 'transform both sides', which might be a useful search term. In 1), observations are assumed to have equal variances. In 2), observations are assumed to have the same coefficient of variation, i.e. variance increases with the predicted value. Because of this, the estimated regression coefficients will not be the same. But, you can use plots of predicted values vs residuals to decide which variance assumption is most reasonable for your data. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Time-varying spatial correlation
Penelope, I agree completely with your wish to keep the model simple. Spatial correlation, as something separate from spatial trend, may be something instrinic to the biology, but it can be a consequence of mis-specification of the model. Along those lines, I would use all the available information to think about the rest of the model before considering the spatial correlation. In particular here, you have two levels of spatial information: the large scale hexagon level, sampled every year, and the small scale location within hexagon level, sampled only once per location. You describe your model as containing fixed effects for year and hexagon-level random coefficients for a linear regression on year. That means your model includes a random intercept for hexagon, so the large-scale spatial trend is already described by your model. Since locations change within the hexagon each year, the small-scale spatail variation is only known after fitting the linear regression for year. Any lack-of-fit there will complicate understanding the spatial correlation. My sense is that you don't have enough of the right sort of data to account for small-scale spatial correlation unless there is some simple model for a spatial covariate that varies within hexagons (e.g depth). Otherwise, I would be strongly tempted to ignore what you're calling spatial correlation under the principle that locations within hexagons are randomly sampled. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Random or repeated?
Kyle, The eco-stat list seems the wrong place to ask this question. The people to ask are the quantitative plant breeders, e.g. at Texas A&M. Some general advice about random effects and correlated observations, for either trait data or ecological data: 1) random effects provide a way to model correlations among errors. So do 'repeated measures'. 2) what matters are the correlations among ERRORS. If I remember the traditional QTL model correctly, errors are independent conditional on the state of the flanking markers. So, you don't need random effects for RIL's, if you're analyzing a single trait at a time. Your observation is correct: the random effect and the marker fixed effects are both trying to model the same thing. The same trait measured in multiple environments is often considered as a multi-trait analysis, to estimate the correlation between pairs of environments. It could also be analyzed as a single trait, with a model that includes fixed effects of environments. Different models answering different questions. 3) yes, pedigrees (or relatedness coefficients) can be used to parameterize a correlation matrix. This provides one way (not the only way) to account for phylogenetic relatedness. Animal breeders use pedigree-based correlation matrices all the time. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Bootstrapping with pseudo-replicates
Johannes, A very good question to ask, but you can't use a bootstrap, or boot(), to investigate it. You can define strata and then bootstrap observations within strata, but all bootstrap data sets will have the same structure as the original data. That's the point of the bootstrap. In your example, you have observations from 4 sites, 1 obs from site 1, 2 from site 2, 1 from site 3, and 2 from site 4. Every stratified bootstrap sample will have 1 from site 1, 2 from site 2, 1 from site 3 and 2 from site 4. I believe you have to construct your own code, probably along the lines of defining a vector for one obs per site, then for each site: extracting the set of pseudoreplicates for one site, using sample() to grab one value from that set, then storing in the vector. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] MLEs for wrapped Cauchy
Lene, The problem could be a bad starting value, optim wandering into places it shouldn't, insufficient data to estimate c, or something else. A good way to understand what's going on is to plot profile likelihood against values of c. For a specified value of c, use optim() to find the 2 parameter mle of (mu0, rho0) given c. Evaluating lnl at (\hat mu0, \hat rho0, c) gives you profile lnl(c) = max lnl (over mu0, rho0 given c). Do that for a range of c and plot profile lnl(c) against c. If optim() refuses to work well with the three parameter problem, you can always use optimize() on the profile likelihood. Golden section search (in optimize) for a 1 parameter optimization is more numerically robust than conjugate gradient optimization. Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Average Regression and the bootstrap
Johannes, You are doing what many folks call the parametric bootstrap. The bootstrap that resamples the data is the non-parametric bootstrap. Often it is easier to code by hand, as you have done. If you want access to all the helper functions for the bootstrap, you can use boot() in the boot library and specify sim='parametric'. The details of your random uniform distribution go in the function specified in the ran.gen= argument. Once you've generated the bootstrap samples (parametric or nonparametric), there is no difference in subsequent processing. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[R-sig-eco] Bivalve study
Chris, This is an example of a study with 2 factors that is much easier to analyze as a 1 way ANOVA. You have three treatments. The structure of those treatments implies two interesting treatment contrasts: the effect of raking, estimated by the difference between control and procedural control, and the effect of fishing in disturbed plots, estimated by the difference between fished and procedural control. Because these follow from the treatment structure, I would do inference about those differences (tests, confidence intervals) without any adjustment for multiple comparisons. You could also set up the problem as a regression problem with two indicator variables (the raking and fishing variables in your treatment table). I think it is more straight forward to think about means of treatments. A 2 way factorial with interactions will run into serious problems because you do not have all 4 cells of the raking x fishing table. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
Re: [R-sig-eco] R-sig-ecology Digest, Vol 34, Issue 5
Matt, Standardization is a transformation of the X variables to mean 0 and variance 1. You do that before you fit the regression or after by multiplying beta_i by sd(X_i). The goal of standardization is to put all X's on the same scale. There are other ways to do that, and (important point) you can (and should) choose the biologically/ecologically most relevant way to put X's on the same scale. Using sd(x) is often reasonable, but there may be something more appropriate. In particular, sd(x) is not appropriate (in my mind) when one of the X's has a very skewed distribution (e.g. mostly values from 1 to 2 and one value of 100). My recommendation to evaluate the "impact" of each X (considered in the usual partial sense for a multiple regression coefficient) is to define equivalent changes for each X variable. E.g. a change in temperature of 5C might be equivalent to a change in altitude of 1000m or a change in latitude of 1degree. Then calculate the change in Y induced by each change in X. Those are 5*beta for temp, 1000*beta for altitude and beta for latitude. Best wishes, Philip Dixon ___ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology