from:"Dixon, Philip M \[STAT\]"

Re: [R-sig-eco] R-sig-ecology Digest, Vol 190, Issue 8

2024-01-23 Thread Dixon, Philip M [STAT]

Rich,

Your specific question was "whether the geometric mean of a set of chemical 
concentrations (e.g., in mg/L) is an appropriate representation of the expected 
value."  The short answer to that specific question is no.

There are various estimators of the expected value of a positively skewed 
distribution:
1) The arithmetic average
2) Assuming a specific distribution, e.g., log normal ( most common but not 
restricted to that), and applying the distribution-specific relationship 
between parameters and expected value.
3) A non-parametric, e.g. "Duan" estimator (very rare).

However, I suggest your question is too narrow.  The better question is "what 
single value best characterizes the location of a skewed distribution"?  If the 
skew is sufficiently large (e.g. large log-scale sd), 90 - 99% of the 
population can be less than the expected value.  In my mind, the answer to this 
question depends on what you "want to do" with the estimated location.
1) summarize the observations - I suggest the median is the best estimator of 
the typical value for an individual observation. For log normal data, the 
geometric mean is an estimate of the median, because the log operator and 
median operator can be interchanged algebraically.
2) estimate a total, e.g. over space or over time.  This is the relevant goal 
when trying to estimate nutrient loading into a receiving body or total release 
of a contaminant.  For this, you do want the expected value.  I have seen 
instances where load/release was estimated by multiplying the geometric mean 
daily release times the number of days.   Justified because "the data are 
skewed so we computed the geometric mean".   I'll leave them nameless because 
the geometric mean computation is totally wrong for this goal (although 
favorable to the polluter because the answer is a smaller value than the 
arithmetic average times number of days).

Best,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Re: [R-sig-eco] Determine if a genotype is resistant or susceptible based on insect survival - Possible?

2023-09-20 Thread Dixon, Philip M [STAT]

Marcelo,

I wrote quite a bit based on your code, but then realized that your code 
probably completely misrepresents the data.

Your code, which indicates that 5 is 'alive at the end of the data recording 
period'.  You compute that by whether Alive = 5, in other words you define # 
Alive = 5 represents 5 insects living past that day.  No - because you actually 
have day of death for many of those insects.

Yes, it is possible to compare survival curves among genotypes, but this will 
be very difficult to do carefully because of 3 aspects of your data:
1) 61% of your observations are censored (based on your code, which indicates 
that 5 is 'alive at the end of the data recording period').  When considered by 
genotype, all but one genotype have over 50% censoring.   Hence simple 
summaries like the median survival time at uninformative: median = "alive at 
end of study" for all but one genotype.
2) Your study design produces hierarchical data: insects within leaves within 
plants for each of the 7 genotypes.  Any good analysis will account for that 
structure, e.g. by incorporating random effects.
3) Your data collection scheme produces daily data over 5 days, i.e. did an 
insect die between day x-1 and day x, i.e. is it newly dead on day x for up to 
5 days + alive at end.  These measurements are quite coarse relative to the 
length of the study.

You can certainly estimate survival curves for each genotype but the usual 
standard error and confidence intervals are wrong because of the hierarchical 
sampling.  There are random effects survival models, especially well developed 
for the Cox proportional hazard model, but I suspect these will be difficult to 
fit because of the coarse sampling and large proportion of censored 
observations.

Most of the information in the data seems to be summarized by the proportion 
alive at the end (i.e. whether or not Alive =5).  My simple suggestion would be 
further collapse the data to "alive at the end": yes (Alive = 5), no (Alive < 
5) and counting the # yes per .  Then eliminate one level of random effect by 
counting the # 5's for alive at the end for each leaf (one response per leaf) 
and use a binomial logistic regression with a random effect for plants.

# ---
Much of the previous information still applies, but the comments about median 
death day are completely wrong.

When I looked more closely at the data, "Alive" always decreases over time 
within each leaf.  This suggests that Alive is the # alive on each day.  These 
are the survival curves for each leaf.  To use survival models, you need to 
convert the data to one observation per insect, where the response is the 
number of individuals that died that day.  You can do that by taking the 
difference between Alive values on subsequent days and being careful about 
computing the number alive on the last day.  Because the survival doesn't (to 
my knowledge) allow for case frequencies, you need to then expand that data set 
to one row per insect with its day of death.  Then you can use survival models 
with censoring defined at day = 5.

Best wishes,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Modeling variability along a gradient using GAMLSS

2022-10-29 Thread Dixon, Philip M [STAT]

Jeff,

And, if you don't want to dive into Bayesian methods, the GAMLSS package allows 
you to specify models for location, for scale (what you want) and/or for shape. 
 Those models can be parametric or nonparametric, including multiple types of 
splines.

https://www.gamlss.com/books-articles/ provides resources, including links to 
short course notes and a J. Stat Software article.
Best,
Philip


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] mvabund: Extracting a subset of comparisons between interactions

2020-09-23 Thread Dixon, Philip M [STAT]

Diogo,

In similar situations, I have done the following:
Estimate all pairwise differences between site:season combinations.  Use 
paste() or something equivalent so that the group means encode both site and 
season.  No p-value adjustment at this step.
Post-process that table to extract only the comparisons of interest, e.g. 
comparisons of season means within site or comparison of site means with season.
Determine the number of comparisons you are interested in (likely much, much 
smaller than 498).  Apply a Bonferroni adjustment to your subset of comparisons.

I don't have R code to do this with anova.manyglm(  pairwise.com= ) output.  
Sorry.  Here's an outline of what I would do:
If you save the results from anova.manyglm(), the information you need to 
extract comparisons of interest will be in the row dimnames of 
$pairwise.comp.table.
E.g.:

all.pairs <- anova.manyglm(full_mod, p.uni="adjusted , pairwise.comp = 
Catch$SiteSeason)
pair.names <- dimnames( all.pairs$pairwise.comp.table)[[1]]

Then it's just a matter of using strsplit() three times.  Once to separate the 
two group names (using ' vs ').  Then to each group name to separate the site 
from the season (using '_' or whatever you used as the separator character in 
your paste).
Best,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] p-values from randomization tests

2020-06-21 Thread Dixon, Philip M [STAT]

One small revision to Eduard's answer to Korla's question.

When you sample a subset of the possible data configurations under the null 
hypothesis (what some call a randomization test), the observed data is 
considered one of the possible samples.  When you generate N (e.g. = 199) 
random samples, you actually have N+1 (= 200) samples.  So the p-value is (R+1) 
/ (N+1), where R is the number of random samples with test statistics equal to 
or more extreme than the observed sample.  So in Korla's analysis, when the 
observed data is more extreme than all the random samples, p = 1/(199+1).  The 
+1 issue is why you frequently see N=99, N=199, N=999, or even N= random 
samples.

If you enumerate all possible data configurations under the null (what some 
call a permutation test), the observed data is automatically included as one 
sample in the set of possible samples.  Then p = R/N, because the observed 
value of the test statistic occurs at least once in {R} and that sample is 
included by definition in {N}.

The spirit of Eduard's answer is spot-on.  The p-values are the same because 
every pairwise comparison is more extreme than anything in the randomized set.

Best,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Predicting using spatial filtering

2020-05-24 Thread Dixon, Philip M [STAT]

Maarten,

Fascinating question!  Yes, you should not see that behaviour.   Median has 
nothing to do with the issue, because you should get linear relationships for 
any constant value of the non-essential covariates.

And, thanks for telling me about SpatialFiltering.  I have been using PCNM() in 
the vegan  package to extract Moran Eigenvector Maps.  I use spdep for lots of 
things but hadn't (yet) discovered SpatialFiltering().  Good to find out about 
other implementations!

For the benefit of the list, Stephane Dray has a wonderful vignette on the use 
of Moran Eigenvector Maps (MEM) to "pull" spatial covariance out of the error 
covariance into the fixed effect model structure.  Pedro Peres-Neto used the 
name principal components of neighbor matrices when he popularized the method 
in the ecological literature but MEM is earlier.  Stephane uses functions in 
adespatial, but the principles apply equally well to the vegan implementation 
(pcnm) or spdep implementation (SpatialFiltering).   Her vignette is  
https://cran.r-project.org/web/packages/adespatial/vignettes/tutorial.html

I explored various possible explanations for your issue without success.  These 
included:
High correlation between elev and intercept
Coding the factor variable (soil)
All were ruled out when I constructed alternatives that showed the same problem.

Eventually, I decided that the issue was something subtle with predict when you 
have a matrix argument (EV_fit1 in your code) in the formula.  It looks like 
predict then ignores the newdata= argument and gives you predictions for the 
original observations.  Since those don't have the same values for 
"non-essential" covariates, it is no surprise that you get a random plot.  This 
is not apparent when you predict for the same number of observations as the 
original data set, but it is immediately apparent when you have a different 
number of rows in the newdata.

I simplified the problem by extracting the first two MEM eigenvectors and 
explicitly naming them in the formula.  Here's the code (using base graphics).  
It gives straight lines for the prediction vs elev plot, as expected.  This is 
not a reprex.  It piggy-backs on your code.

meuse$sp1 <- EV_fit1[,1]
meuse$sp2 <- EV_fit1[,2]

Model_example<-lm(cadmium ~ elev + dist  + factor(soil) + sp1 + sp2,
  data=meuse)

# bad predictions using the meuse data
meuse$predmod1 <- predict(Model_example, type="response")

with(meuse, plot(elev, predmod1, col=as.integer(soil)))
# this plot shows all sorts of variation vs elev because the other covariates 
are not constant

# predictions using a newly constructed data set
testdata <- expand.grid(elev=seq(5,10, 0.1), soil=1:3)
testdata$soil <- factor(testdata$soil)

# You can choose any constant you like for dist, sp1 and sp2
testdata$dist <- 0.211
testdata$sp1 <- median(meuse$sp1)
testdata$sp2 <- median(meuse$sp2)

testdata$predmod1 <- predict(Model_example, newdata=testdata)

with(testdata, plot(elev, predmod1, col=as.integer(soil)))
# this plot shows straight lines vs elev because other covariates are held 
constant, as expected

Best,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Simulating spatial point patterns that have spatial structure

2020-04-06 Thread Dixon, Philip M [STAT]

Liudas,

It sounds like you are trying to do what is called pattern reconstruction.  
That is a method to simulate a point pattern that is constrained in one or more 
ways.  The possible ways include the intensity function and the K or L 
function.  The original paper is Tscheschel and Stoyan, 2006,  Statistical 
reconstruction of random point patterns, Computational Statistics and Data 
Analysis 51:859-871.  A textbook treatment is in Wiegand and Moloney, 2014, 
Handbook of spatial point-pattern analysis in ecology, CRC Press.  The methods 
are implemented in the R shar package.  Look for the reconstruct functions.

In response to your comments, I note that matching the second order properties 
(K function, variogram) usually results in a pattern that "appears" quite 
different, in the sense that locations of high intensity are in different 
places.  That's to be expected because matching the correlation structure does 
not mean matching the intensity.

I also note that when you match on a lot of characteristics, you will always 
simulate patterns that look a lot like your original one because patterns can 
only differ in so many ways.

Best wishes,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Thoughts on ordinal regression

2019-07-18 Thread Dixon, Philip M [STAT]

Mahnaz,

You only need to send one request.  Many folks do not check the list daily and 
I usually wait to see if others respond before I respond.

Likert scale responses are among the most difficult types of data to analyze 
well.  Many folks use approximate analyses.

A good textbook treatment of ordinal data analysis is the relevant chapters in 
Frank Harrell's book on Regression modeling strategies.

My go-to reference for categorical predictors is Shah and Madden, who deal with 
plant disease ratings on an ordinal 0-10 scale.
Title: Nonparametric analysis of ordinal data in designed factorial experiments
Author(s): Shah, DA (Shah, DA); Madden, LV (Madden, LV)
Source: PHYTOPATHOLOGY  Volume: 94  Issue: 1  Pages: 33-43  DOI: 
10.1094/PHYTO.2004.94.1.33  Published: JAN 2004

Since your response is the average of 5 Likert scores, you may be able to use 
standard methods for Gaussian data.  The important issues are:
1) the analysis assumes the scale is linear.  I.e. the average of 2 and 4 is 
the same (in a subject matter sense) as the average of 1 and 5.  Also, the 
change between 2 and 3 is the same (in a biological sense) as the change 
between 4 and 5.  As you can tell, the assumption of linearity has consequences 
both for averaging 5 responses to get one summary score and for the form of the 
regression model.
2)  that the variance of the response is smaller when the mean is close to 1 or 
close to 5.  This may not be a problem if most of the average responses are 
between 2 and 3.  You may decide to do an approximate analysis and ignore the 
unequal variances.

Best wishes and good luck!
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Doing repeated measures on a randomized block design

2019-06-16 Thread Dixon, Philip M [STAT]

Rick,

There are (at least) two ways to set up a model for your experiment.  The 
difference is a detail of your sampling.  The two most likely schemes are:
1)  Each time you sampled, you went to the same branch (or perhaps same 
leaf).  Here, the repeated measurements are on the branch, not the tree.  
Exposure (N/S) is characteristic of the branch.  In terms of the model 
variables, a branch is identified by combination of tree and exposure.  This 
leads to the traditional between/within repeated measures design.  If you 
assume compound symmetry (nested random effects), the lmer model  would be FvFm 
~ Tree + Exposure + Date + Date:Exposure + (1|Tree:Exposure)
2)  Each time you sampled, you went to different branches.  Now the only 
level of repeatedness is the tree.  Now there is one observation per branch; 
each is characterized by exposure and date.  The model is simply a 2 way 
factorial inside blocks.  If you wanted to set up something like a compound 
symmetry model, you would have tree and observation as the random effects.  
Since there are only 4 trees, I would treat trees as fixed. The lmer model is 
FvFm ~ Tree + Exposure + Date + Date:Exposure 

With 18 dates, I would stongly advice against a compound symmetry (split plot 
in time) model.  There is probably some structure to the correlations.  This 
means using lme() with something like a corAR1() correlation structure.  You 
can add that to either model, (to get CS+AR(1) for the first option or AR(1) in 
blocks for the second).  Or you could a "just an AR(1)" model for the first 
option by replace (1|Tree:Exposure) with the corAR1().
Best,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Which R package for Second-Stage nMDS ?

2019-03-22 Thread Dixon, Philip M [STAT]

Pierre,

I don't know a function that does this, but it is extremely easy to code.

Dist objects are vectors containing the 1st stage pairwise dissimilarities.   
Call those dist1, dist2, dist3, ...  So alldist <- cbind(dist1=dist1, 
dist2=dist2, ...) will assemble the matrix of dissimilarities with useful 
column names.  stage2 <- as.dist(1-cor(alldist)) will compute the matrix of 
correlations, convert from similarity (the correlation) to distance 
(1-correlation) and convert to a distance object.  Then just run your favorite 
MDS on stage2.  

Note: sometimes folks prefer sqrt(1-cor) as the "correlation distance", instead 
of 1-cor.  I don't know which Clarke prefers.

Best,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Regression when Y has an estimation variance

2019-01-14 Thread Dixon, Philip M [STAT]

Roy,

One relevant literature is that on meta-regression (a generalization of 
meta-analysis).  There is a very good handbook by Koricheva, Gurevitch and 
Mengerson.  Meta analysis mostly deals with Gaussian responses (or 
transformable to approximately Gaussian).  If there has been any work on 
non-gaussian responses, I expect it would be summarized in Koricheva.  

The metafor package is one implementation specifically for meta analysis.  A 
resource on metafor and other R packages is Schwarzer and Carpenter, 
Meta-analysis with R.  

Other programs can also fit the models as mixed models with heterogenerous 
specified variances, lme() and lmer() do not let you do this.  lmer() doesn't 
allow heterogeneous variances; lme() does, but only of the form k_i*sigma^2 
(i.e. variances relative to a common scaling factor, not absolutely specified 
variances).   For a meta analysis, you need to specify the absolute variance 
for each estimate.  If someone knows how to trick lme() to use exactly the 
error variances that have been specified, I would love to hear about it. 

Best,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Re: [R-sig-eco] mixed model with proportion data

2017-03-08 Thread Dixon, Philip M [STAT]

Mariano,

There is a huge and important difference between the two approaches suggested 
for your data.  The log ratio of proportions (i.e. the empirical logit of the 
Yes proportion) estimates the residual variance.  The binomial model assumes 
the residual variance is determined by the arbitrary (and made-up) sample size 
of 20 "tries" per response, in combination with the estimated mean proportions. 
 To see the arbitrariness, if you don't already, re-express your proportions 
out of 200, instead of 20, because 0/200, 10/200, ... 200/200 also give your 
observed responses.  The coefficient estimates will be the approximately same 
but their variances will not.  (If you didn't have additional random effects in 
the model, the coefficient estimates would be exactly the same but the 
variances would be 1/10's those from N=20).

If you are going to use the binomial GLM, I believe you must add overdispersion 
to the model.  Either as an individual random effect, or by using a 
quasibinomial response distribution.  Overdispersion is not necessary for the 
log proportion response because the residual error variance conceptually 
estimates that overdispersion.

Philip 

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Permutation test for SAR model

2016-12-03 Thread Dixon, Philip M [STAT]

Feng,

You say you are interested in the relationship between two variables, so the 
quantity you want to assess is something that describes that relationship.  
Usually that is the regression slope.

A much more serious problem is: how are you going to generate random 
permutations that honor the spatial correlation implicit in a SAR model?  You 
can not simply permute X variables, Y variables, or residuals because 
permutation assumes exchangability, a slightly more general version of 
independence.  You might consider toroidal rotations, but that makes other 
assumptions (e.g. 1st order stationarity) Pesarin has been developing 
permutation tests for complex data (see his book with that title).  I do not 
know whether he has considered data with a SAR structure.

Best,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Re: [R-sig-eco] Periodic regression

2016-11-22 Thread Dixon, Philip M [STAT]

Thiago,

I would not try to interpret each periodic component separately.  That linear 
regression model is just a convenient way to fit a model with a phase-shifted 
cosine curve:

log (mean index) = b0 + A*cos(c*time - theta)

The amplitude of the oscillation is A; the phase shift is theta.  Those can be 
estimated from the regression coefficients in the linear model:
log (mean index) = b0 + b1 cos(c*t) + b2 sin(c*t)

as:
A = sqrt(b1^2 + b2^2)
theta = atan(b2/b1)

Here's R code to implement these calculations:
tad.lm<-glm(index~(cos(2*pi*(hora/24)))+(sin(2*pi*(hora/24))), 
  family=quasipoisson)
tad.coef <- coef(tad.lm)[2:3]

A <- sqrt(sum(tad.coef^2))   # where tad.coef is only periodic components
phase <- atan2(tad.coef[2],tad.coef[1])
#  in radians, in the appropriate quadrant

maxhora <- phase*24/(2*pi)
maxhora <- ifelse(maxhora < 0, 24+maxhora, maxhora)
#  and in hours, shifting by 24 hr if negative

Standard errors are obtained by the delta method.
Confidence intervals are best obtained by bootstrapping.  You should think hard 
about what sort of bootstrap to use (observations, standardized residuals, 
parametric on the estimates, parametric on the observations).

Chester Bliss's pamphlet on Periodic regression in Biology and ?? is a good 
introduction, but remember that his computing resources is 60 years old.

The test of "no aggregation" that I suggested (one of many possibilities) = 
test of "no periodic components".  That is the test that both b1 = 0 and b2 = 0 
in the linear model above.  Can use either a drop in quasi-deviance or a C Beta 
test on the estimates.  Asymptotically equivalent; usually similar but rarely 
identical in practical samples.  

R code is:

tad.lm<-glm(index~(cos(2*pi*(hora/24)))+(sin(2*pi*(hora/24))), 
  family=quasipoisson)
tad.lm0 <- glm(index~1, family=quasipoisson)

# F (drop in quasi deviance) test for both periodic coefficients = 0
anova(tad.lm0, tad.lm, test='F')

# C Beta test for both periodic coefficients = 0

tad.coef <- coef(tad.lm)[2:3]
tad.vc <- vcov(tad.lm)[2:3,2:3]
# extract coefficients and variance-covariance matrix for periodic terms
#  i.e. dropping the estimated intercept

npar <- length(tad.coef)
nobs <- length(hora)
# extract # parameters in test and # observations in data set

tad.f <- (t(tad.coef) %*% solve(tad.vc) %*% tad.coef) / npar
c(F = tad.f, probF=pf(tad.f, npar, nobs-(npar+1), lower=F))
 
Unless there are issues of general interest to the list, I suggest any further 
discussion be done off-list.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Test for circular data

2016-11-18 Thread Dixon, Philip M [STAT]

Thiago,

I can see why you're not having much luck in the usual circular statistics 
literature, which focuses on responses that are circular quantities.  In your 
case, the explanatory variable (time) is circular. 

I would answer your second question by fitting a periodic regression, i.e. 
using sin(2 pi* Time / 24) and cos(2 pi*Time/24) as the two X variables.  
Choose the appropriate distribution for your response and include other fixed 
or random effects as appropriate.   The amplitude and phase of the apparent 
oscillation can be computed from the regression coefficients.  That model 
allows one peak per 24 hours.  You can add more components or change the 
frequency as appropriate for the biology and your hypothesis about the 
within-day pattern.  The period model also provides a simple test for 
'aggregation' (i.e., test both periodic coefficients = 0).  There are many 
other possible tests of aggregation, depending on the form of the presumed 'no 
aggregation' distribution of your response.

That model is the temporal equivalent of a simple linear regression model: 
simple, able to detect various forms of departure from a constant mean, but 
definitely not an exact description of the pattern over time.  The model can be 
made more flexible by using a spline function that is appropriately smooth 
across the 24:00 hr to the 0:30 period.  Look in the smoothing spline and/or 
GAM literature if you want to explore this.

Philip

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Back transforming standardized model parameters when there is an interaction

2016-01-12 Thread Dixon, Philip M [STAT]

Matt,

The interaction term is why the usual transformation doesn't work, at least 
initially.

The R parameterization of factor effects makes it easy to fall into the trap of 
thinking that the coefficient for a main effect "means" something when there is 
an interaction in the model.  Sometimes yes, sometimes no.

The best way to untangle model effects when there are factors in the model, 
especially when a factor changes the slope (because the model includes 
air.ice:time) is to work out the regression model for each level of the factor 
(i.e. for air then for ice).

Using the unstandardized version to get illustrate:
air.ice, assuming you used the default contrasts, see options("contrasts"), has 
the value of 0 for air and 1 for ice
so the regression line for air is E Y = 12.52 + 0.0081 time - 0.8019 * 0 - 
0.00442*0*time =  12.52 + 0.0081 time
and that for ice is is E Y = 12.52 + 0.0081 time - 0.8019 * 1 - 0.00442*1*time 
= 11.71 + 0.0037 time

Two of the coefficients in the full model correspond to coefficients for a 
factor-specific regression line ONLY because the factor variable has the value 
0 for the baseline level.

If you only standardize time (your model 3), the same calculation works because 
you haven't changed the air.ice variable.
You should be able to take the coefficients, apply the backtransformation 
suggested by Drew to each regression line separately and get the same 
coefficients when time is the X variable.

It looks like standardize also centers (or centers and standardizes) the 
air.ice variable.  The standardized value for air is no longer 0 and that for 
ice is no longer 1.  Probably something like -0.5 and 0.5 (if centered only 
with equal sample sizes).  I will assume -0.5 for air and 0.5 for ice to 
illustrate.  That means the factor-specific regression lines are now:
for air: E Y = 12.48 + 0.508 z.time - 1.07 * (-0.5) - 0.38 * (-0.5)*z.time
for ice: E Y = 12.48 + 0.508 z.time - 1.07 * (0.5) - 0.38 * (0.5)*z.time

After standardizing, none of the model coefficients can be interpreted as a 
coefficient for some baseline level.  You have to unpack the factor effects.  
Note: you get similar behavior if you use sum-to-zero contrasts.  And, if you 
use contr.SAS(), the last factor level is used as the baseline group.

Aside: the different intercept is no surprise because it is EY at a different 
timevalue (the average time, not time=0) and the different slope is no surprise 
because one unit change in standardized time is different than one unit change 
in raw time.

So, I suggest you look at how the variables are coded by standardize().  
model.matrix() will give you this information.  Use that to calculate the 
coefficients for each regression line.

Best wishes,
Philip


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] evaulating multiple responses using Chi-square

2015-06-11 Thread Dixon, Philip M [STAT]

Yes, the idea has been considered, by no less than RA Fisher.  If you want a 
name, it is sometimes known as Fisher's combined probability test.  Usually 
applied to one-sided tests of the same null hypothesis.   Yes, the p-values can 
come from different "sources".  Before using it in the community context, I 
would want to think carefully about what it tells me when applied to two-sided 
tests of different species.

Philip


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Fitting a Weibull with a 0 dose

2015-01-04 Thread Dixon, Philip M [STAT]

Katie,

I second Ben's suggestion to use functions in the drc library, not nls().  Not 
only do those functions model binomial or overdispersed binomial data, but they 
provide estimates and confidence intervals for the EC50 and any other EC, e.g. 
EC10.

However, neither handles x=0 very well.  The problem is the way the Weibull 
function is coded.  Instead of x^power, both use exp(power*log(x) ...).  The 
first is defined for x=0, the second is not.  

A work around is to use a small positive value for x, e.g. 0.01, as you 
suggest.  If there is some background level of x, so experimental dose = 0  is 
really exposure to a small background concentration, that background 
concentration is an easily justified replacement for 0.  If not, I rerun the 
analysis with different choices of small positive value.  If the low dose 
asymptote is well determined by the data, you should get similar parameter 
values and estimated ECx's for any small positive value.  This is a robustness 
check to make sure that an arbitrary choice doesn't affect the results.

Best wishes,
Philip 

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Logistic regression with 2 categorical predictors

2014-10-24 Thread Dixon, Philip M [STAT]

Andrew,

Your problems arise from two issues: 
1) The overall test of equal P[prefer] in each age and the tests of equal 
P[prefer] in a specific pair of ages  are based on different testing 
philosophies (Score or LR test for the overall test and Wald test for the 
pairwise comparisons).
2) The behaviour of the logit function when one empirical probability is 0 (or 
1).
You are absolutely correct when you highlight the ages with all avoid.  

If you want  just my suggestion for how to conduct informative pairwise 
comparisons: Do 15 two-by-two Chi-square tests, or 15 runs of glm with only two 
ages in each analysis.  Because there is no pooling of information from ages 
uninvolved in the null hypothesis, these 2x2 tests are exactly the same as the 
pairwise hypothesis for the full 6-age data.  I don't know whether there is a R 
packge to automate this. I've always written a pair of for loops and used 
subset= with the appropriate logical in the glm().

Details (which are a very very small glimpse of a huge area):  Fleiss, analysis 
of counts and proportions, and the tome: Agresti, categorical data analysis are 
extended treatments).  It doesn't help that the analysis of this type of data 
has some non-trivial complications.  Upton, in his classic JRSS discussion 
paper on the 2x2 contingency table, said something like the 2x2 contingency 
table was a seemingly simple problem that had a surprisingly large number of 
complications.

1) Why is a 2x2 contingency table comparison of ages 1 and 2 the same as the 
pairwise comparison of ages 1 and 2 in an analysis of all 6 ages.  When the 
data are normally distributed and analyzed by ANOVA, the error variance is 
(usually) estimated from all 6 groups.  The analysis of just two groups 
estimates the error variance from fewer observations, hence the 6-group pooled 
error variance provides improved power for the comparison of groups 1 and 2.  
The null hypothesis in both cases specifies equality of ages 1 and 2, but all 6 
groups provide information about the error variance.  With count data, there is 
no error variance that has to be estimated.  (I'm not considering 
overdispersion, which if present would be estimated from all 6 groups, so the 
2x2 and "6x2 followed by a pairwise comparison" analyses are no longer 
equivalent).  Again, the null hypothesis is the same in both analyses.  One way 
to consider the specific comparison of ages 1 and 2 in the 6-age analysis is !
 that you are fitting a model with 5 probabilities to the data: one probability 
that is shared by ages 1 and 2 (hence they have equal probabilities) and four 
more probabilities, one each for ages 3 through 6.   The fit of that 5 
parameter model is compared to the fit of a 6 parameter model in which each age 
has its own probability.  Both the 2 x 2 and the pairwise comparison are 1 df 
tests.  In neither case, can the comparison of ages 1 and 2 benefit from any 
information from the other four groups.

2) What's going on the anova(model_sg, test="Chisq") analysis?  This is a 
comparison of two models.  The data and the models can be set up in at least 
three asymptotically equivalent ways (see Agresti and probably Fleiss for 
"sampling models for the contingency table" for the details).  I'll use the 
independent binomial model.  That model has up to 6 parameters, i.e. a 
P[prefer] for each age, where P[prefer] is the probability that a fish in a 
specified age group has a "Prefer" response, not an "Avoid" response.  The null 
hypothesis (reduced) model is that each age has the same probability, so there 
is one parameter.  The alternative (full) model is that each age has a 
different probability.  Yes, the alternative model is fully specified.  Yes, 
the alternative model has 0 "error" df.  When you fit these models using glm, 
you are using deviance = -2 log likelihood as the measure of fit.  Yes, the 
deviance is 0 for the full model (under the usual way of writing the log 
likelih!
 ood).  That is exactly how things should be.  The comparison between these two 
models is a measure of whether the reduced model fits the data.  (semi-Aside: 
You can write the log-likelihood for the full model so the deviance is not 0.  
If so, the log-likelihoods for both the null and alternative models are shifted 
by the same additive constant so the change in deviance is the same value for 
either way to write the log-likelihood).  This comparison of models is exactly 
the same comparison done by the usual (K-1) df Chi-square test of equal 
proportions in a Kx2 contingency table.  The only difference is that the 
Chi-square test uses the Pearson Chi-square statistic as the measure of how 
well  a model fits the data; glm uses deviance.  We regularly calculate the 
Chi-square statistic for the null hypothesis.  We never explicitly calculate 
Chi-square for the alternative hypothesis, because Chi-square for the full 
model is exactly 0 with exactly 0 df.  The change between the two !
 models is the usual C

[R-sig-eco] Help obtaining F statistics

2014-09-16 Thread Dixon, Philip M [STAT]

Luis,

The Analysis of deviance table you provided indicates that your data analysis 
has a serious issue, one much larger than calculating an F (or Chi-square, 
which is more common for GLM analyses) statistic.

The df for the sex:behavior interaction in your table is 7.  It should be 8, 
the product of sex df and behavior df.  I'm almost certain that one combination 
of sex and behavior does not occur in your data.  Appropriate analyses of data 
with missing cells is almost always very difficult.  You have to think 
carefully about the interpretation of each of the model effects (or at least, 
the p-values you got do not answer questions that I care about).  I recommend 
you find a local statistician to advise, because this is not an R issue.

Best wishes,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Seasonal Mann Kendall with multiple detection limits

2014-07-28 Thread Dixon, Philip M [STAT]

Tim,

The issue with multiple detection limits with Kendall's tau or the Seasonal 
Mann-Kendall variation is computational, not conceptual.  The strong advantage 
of Kendall's tau, over rank-based methods, is that tau handles multiple 
detection limits in a common-sense way.  The pair (5, 10) can be ranked, the 
pair (<5, 10) can be ranked, the pair (<5, 3) can not be ranked, and the pair 
(<5, <6) can not be ranked.  Having said this, I don't believe either the NADA 
or the current version of EnvStats implementations handles multiple detection 
limits.  I've always done this by hand when needed, but that code is written 
for specific data sets and unlikely to be usable by anyone else.  ( I will talk 
to Steve Millard about adding tau with multiple censoring limits to EnvStats, 
because we are writing a 2nd edition of the text to go with the R library).

If you want to use imputation (what you are doing by simulating values for the 
censored observations), you need to use multiple imputation (i.e., generating 
more than one simulated data set).   I didn't look at your code carefully 
enough to tell if it currently does multiple imputation.  Rubin's books (or web 
links to multiple imputation) will show you how to combine results from the 
multiple imputations.  If you only simulate one data set, your conclusions are 
about that specific simulated data set, not a Monte-Carlo approximation to 
conclusions from the data at hand with censoring.

Best wishes,
Philip


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] xPOD statistic

2014-01-26 Thread Dixon, Philip M [STAT]

Alexandre,

I don't know any prewritten function to calculate the collection of xPOD 
statisitics for a multiptype point pattern.  However, it won't be difficult to 
code that yourself using functions in the spatstat library.

Brown et al. define A_{ij} for each pair of species as Int( log g_{ij}(r) dr, 
where g_{ij}(r) is the pair correlation function for species i and j at 
distance r.  That is numerically approximated by a sum of log \hat{g}_{ij}(r) 
computed at 10 (their suggestion) values of r.

spatstat provides the pcf() function to compute g(r) and the alltypes() 
function to compute g(r) for all combinations of pairs of species.  After that, 
it is just a matter of manipulating the function value array (see fasp()).

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] NA error in envfit

2013-12-05 Thread Dixon, Philip M [STAT]

Kendra,

I wonder if the problem is a factor level with no observations.  One of the 
frustrating things about factors (class variables) in R is that the list of 
levels is stored separately from the data.  This can cause all sorts of 
problems if you create the factor, then subset the data, and the subset is 
missing one or more levels of the factor.  You are subsetting your data, so 
this may be the source of the problem.

My working philosophy is to keep variables as character strings or numbers 
until just before I need the factors.  That avoids any issues with extraneous 
levels.  That means reading data sets (.txt or .csv files) with as.is=TRUE to 
avoid default creation of factors.  relevel() may recreate the list of levels.  
I usually use factor(as.character(variable)) to flip a factor to a vector of 
character strings then back to a factor with the correct set of levels.

Best wishes,
Philip Dixon


[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Modeling overdispersed count data

2013-11-27 Thread Dixon, Philip M [STAT]

Bayesians commonly use the log normal to account for overdispersion.  I learnt 
the trick from Ben Bolker.  His book may be a citable source.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Observation-level random effects to account for overdispersion

2013-11-27 Thread Dixon, Philip M [STAT]

All,

I am a real fan of using observation-level random effects to account for 
overdispersion with count data.  Using the default log link, an 
observation-level normally-distributed random effect models the data as a 
convolution of a Poisson and log-normal distribution.  The negative binomial is 
the convolution of a Poisson and gamma distribution.  For all practical 
purposes, the log-normal and gamma distribution are indistinguishable.  
Computationally, the normal model is much easier especially if you have 
additional normally distributed random effects.  The sum of normals is easy to 
model; the sum of some normals and a log gamma is not.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Autocorrelation structures and simplicity of analysis

2013-07-16 Thread Dixon, Philip M [STAT]

Jonathan,

The AR(1) + compound symmetry model does make sense.  It is a model where the 
correlation over time declines with the separation between times points, but it 
does not decline to zero.  There is a constant non-zero (at least potentially) 
correlation arising from the compound symmetry (constant correlation between 
all pairs of observations on the same plot) part of the model.  I have found it 
fits many data sets better than either AR(1) only or CS only.

However, I suggest you completely rethink the analysis.  I am a great fan of 
the approach recommended by Paul Murtaugh a few years ago.  The paper is 
"Simplicity and complexity in ecological data analysis" Author(s): Murtaugh, 
Paul A.  Source: ECOLOGY  Volume: 88   Issue: 1   Pages: 56-62   Published: JAN 
2007

Your situation is almost identical to one he considers in his paper.  His 
advice: remember that you have experimental units randomly assigned to warming 
treatments.  Your analysis currently ignores time, except for accounting for 
the autocorrelation.  You seem to care only about the treatment effect.  If 
that's the case, you don't need to worry about days.  Just focus the analysis 
on plots and the treatment effect.  Specifically, figure out an informative 
summary statistic for each plot (e.g., average over time, median over time), 
calculate those 12 numbers, then analyze those.  No need to worry about the 
autocorrelation structure then!

The beauty of this approach is that if you want to ask a different question, 
e.g. about day-day variability instead of the average, you just change your 
summary statistic to something that measures variability, e.g. log sd.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Logistic function between arbitrary parameter values

2013-06-19 Thread Dixon, Philip M [STAT]

Phil,

One reasonable solution is add two additional parameters for the bottom and top 
(or bottom and span) of your logistic.:

Your function: K / (1 + ((K - x.0)/x.0) * exp(-r * x))  
becomesalpha + beta*K / (1 + ((K - x.0)/x.0) * exp(-r * x)),

This logistic spans Y values from -alpha to -alpha + beta.
You may have trouble fitting it, because Yhat will be undefined if alpha is not 
large enough and I would be surprised if there were high collinearity between 
\hat\alpha and \hat\x.0.  However, the choices of alpha and beta are no longer 
arbitrary.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Offsets in Poisson or Neg. Bin regression

2013-06-15 Thread Dixon, Philip M [STAT]

Matias,

The situation when you most want to use an offset is when FECUND differs across 
individuals.  If FECUND is a constant for all observations, you could ignore it 
if you chose to.  If you did that, its constant effect gets rolled into the 
intercept.

When it's not constant, the logic is:

log(mu_i / F_i) = your model   <==> Log(mu_i) = your model + log(F_i)
where mu_i is the mean count and F_i is the fecundity, both for individual i.  

When F_i is a constant = F, the logic is:
log(mu_i / F) = your model   <==> Log(mu_i) = B0 + rest of your model + log(F)  
<==> B0 + log(F)  + rest of your model 
where B0 + log(F) is the new "intercept"

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Hurdle model: singular variance-covariance matrix

2012-12-01 Thread Dixon, Philip M [STAT]

Yaiza,

The problem is that two coefficients are perfectly correlated with each other.  
Since this happens in the models with interactions but not additive effects, my 
hunch is that the cross-product variable is very highly correlated with one of 
the original variables.  This can happen if one variable has a large mean and 
large variability and the second has a large mean and small variability.  If 
so, this is a numerical issue, not a conceptual, statistical or ecological 
issue.

I suggest you center all X variables to mean 0.  That often clears up numerical 
issues.  If it doesn't standardizing each to unit variance can also help.  
Centering the X variables also has the statistical advantage that the intercept 
is directly interpretable.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Zuur / Pinierho / Faraway

2012-11-29 Thread Dixon, Philip M [STAT]

I agree with all the previous comments and second Tom's recommendations of 
Faraway as an 'in between' Zuur and Piniehro & Bates.  One thing to be careful 
of: While the advice in Faraway is sound, there are more than a few mistakes in 
his equations.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Clustering estimated regression coefficients

2012-10-23 Thread Dixon, Philip M [STAT]

Chris,

I think hclust will work just fine; you just have to construct an appropriate 
distance matrix from the coefficients.  A major part of the beauty of 
distance-based analyses is that you have to choose a distance measure.  When 
you think hard about that choice, the analysis better represents what you want 
to know.

First, variables absent from a model have 0 values for their coefficients, not 
NA.  A model that omits X3 says there is no relationship between Y and X3, 
hence 0 for that beta.

What would be an appropriate distance matrix for your goals?  Those are " I 
would like to cluster the data so that observations are grouped according to: 
the similarity in the direction of the relationship ( i.e. +ve or -ve) and the 
presence/absence of variables."
I'm not sure if these are two criteria leading to two clusterings or an 
aggregate criterion.  Easiest if they are two separate criteria.  For 
similarity in direction, use a simple matching coefficient (proportion of 
coefficients in the same direction).  Ignore any coefficient with 0 in either 
model (NA's may simplify this computation).  For presence/absence, again a 
simple matching coefficient (proportion of coefficients present in both models 
or absent in both models).   You could easily elaborate on these distance 
measures, e.g. by incorporating magnitude of coefficient.  That's where the 
thinking becomes really important.

Once you have an appropriate distance matrix (or matrices), then you turn the 
hclust crank.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Measuring community similarity when total abundance differs

2012-08-04 Thread Dixon, Philip M [STAT]

Tim,

I don't know of any automatic way to deal with unequal sampling effort.  Even 
if there were, I would argue strongly that you need to think carefully about 
the data and how they relate to how you want to define 'similar' and 'less 
similar' community composition.  

Fundamentally, this is an issue about how to define the distance between two 
vectors of species abundances.  Bray-Curtis is sensitive to differences in 
total abundance.  Perhaps the difference between a sample with 100 individuals 
and a sample with 12000 individuals is informative.  Probably it isn't.  If so, 
you probably (but not certainly) want to compute percent similarity 
(Bray-Curtis after dividing each abundance by total abundance in the sample).  
That's a very common recommendation, but even that may be misleading if the 
reason for the 12000 is an emergence flush of one species.  Adjusting by total 
sqrt(abundance) may be appropriate, but it is certainly not common.  Similar 
levels of abundance heterogeneity occur in benthic macrofauna data sets.  
Clarke is fond of Bray-Curtis (without total abundance correction, if I 
remember correctly) of sqrt(sqrt(abundance)), i.e. abundance^(1/4).  

Anne Chou and Robin Chazdon have developed and used abundance corrected 
similarity measures, similar in spirit to the Chou measures of richness.  I'm 
not fond of them because I believe the adjustments for sampling effort presume 
that two samples come from the same underlying community.  

Any distance measure reduces a pair of vectors to a single number.  It always 
throws away information.  You need to make sure that your choice of measure 
keeps the features you feel are important.  I strongly suggest you look at the 
abundance vectors for pairs of samples with small distances and pairs of 
samples with large distances to make sure that your quantitative measure 
matches your intuitive sense of 'similar' and 'not similar'.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Using residuals as dependent variables

2012-06-22 Thread Dixon, Philip M [STAT]

1.   Regress a suite of ecological and socioeconomic variables against
the residuals from the oceanographic model to determine which factors cause 
some countries to be above and some below. I.E as trophic level increase the 
residuals become increasingly negative.

All my answers assume that all variables are measured at the same scale, so a 
multi-level model is not appropriate.

Absolutely no problem.  This is essentially equivalent to fitting a larger 
model with both oceanographic and ecological/socioeconomic variables.  If you 
go one step further, and first calculate residuals of the ecol/socio variables 
when regressed against oceanographic variables, the regression of the 
oceanographic residuals on the ecol/socio residuals is exactly the same as 
fitting the larger model.  The theory supporting this leads to added variable 
(or partial regression residual) plots.  

2.   Ideally ( and I am unsure how or if it is possible) work out for
each country which variables/s cause the poor fit of that country to the 
oceanographic model.

This may be a bit tricky.  The problem is potential confounding.  The 
regression of two sets of residuals suggested above tells you that a particular 
ecol/socio variable contributes to the larger only to the extent that that 
ecol/socio variable contains information not present in the set of 
oceanographic variables.  If an ecol/socio variable is causally important but 
confounded with some oceanographic variables, the approach will not detect the 
role of that ecol/socio variable.  If two ecol/socio variables are confounded, 
a regression model can not distinguish their contributions.  You have to think 
about your variables and their relationships to assess this.   

Assuming confounding is not an issue, fit the larger model with both 
oceanographic and ecol/socio variables.  For each country, calculate 
X\hat{\beta} for each ecol/socio variable.  That's the contribution of that 
variable for that country to the model prediction.  Compare those components to 
their sum for the country.
 
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Residuals in a mixed effect model

2012-03-24 Thread Dixon, Philip M [STAT]

Malin,

I would double check exactly which residuals you are getting before worrying 
further about heterscedasticity.
In a linear mixed model (Y = Xb + Zu + e), there are two definitions of the 
residual: Y-X bhat and Y - (X bhat + Z uhat).  You want the second to 
investigate unequal variance of the e's.

In mixed models with only random intercepts, this is not an issue because 
Var(Yi) is constant.  When the slope is random, Var(Yi) is not constant.  
Var(Yi) depends on Xi even when Var u and Var e are constants.

I believe resid(mer object) gives you the first (Y-Xbhat), based on faint 
memory and a quick read of ?'mer-class'.

If you do need a stronger transformation, did you try Y^(-1/2) or Y^(-1), both 
of which are stronger than log(Y) in the Box-Cox family?

For any transformation, I would worry much more about linearity of the 
regression on the transformed scale than about heterogeneity of variances.  If 
the relationship isn't linear, the beta's are at best approximations and at 
worst irrelevant.

You may be able to model the error variance as a function of logRe or a 
function of something else (sample size in the study).

Finally, you may be able / willing to ignore the heterogeneity.  David Fletcher 
and I investigated this in a simpler meta regression problem (each study 
providing one observation in the meta analysis).  That came out in the March 
2012 Methods in Ecology and Evolution.

Best wishes,
Philip Dixon


 

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Balanced split plot analysis in R

2012-03-22 Thread Dixon, Philip M [STAT]

Markus,

The presence or absence of an interaction has no bearing on the choice of 
analysis.  If you're a design-based person, as I am, you have two sizes of 
experimental unit (main plot and split plot).  If you're a model-based person, 
the multiple measurements (the split plots) on each main plot induces a 
correlation between split plot observations that needs to be accounted for in 
the model.  Both facts are different way of saying the same thing.  The 
presence or absence of an interaction doesn't change either fact.  You're 
concerned about appropriate comparisons of the main plot marginal means.

You're lucky that your design is balanced (equal numbers of observations for 
each combination of main plot and split plot factors).  Otherwise, the 
appropriate R analysis gets very hard very quickly.  (and there is considerable 
disagreement about what that appropriate analysis should be).

My suggestion:
Leave the interaction in the model.  It is part of the treatment design.  Test 
and do multiple comparisons among the split plot levels using the model you 
have.  If you are comparing cell means, the only correct comparisons are those 
between disturbance treatments within a grazing treatment.  If there is no 
evidence of an interaction, you're probably not comparing cell means, so that 
limitation is not an issue.
To get comparisons among levels of the main plot treatment: compute the average 
Y for each main plot, i.e. average over the split plots in each main plot.  
There is now only one treatment factor (grazing regime) and one blocking factor 
in the analysis.  The error in this analysis is the variability among main 
plots, which is what you want for comparisons among grazing levels.  
Straightforward test and multiple comparisons.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Estimating Q10

2012-01-31 Thread Dixon, Philip M [STAT]

Scott and Alan,

The two forms of your regression:
1) Y = a exp(b x),  and
2) log Y = log a + b x

are mathematically the same model if you ignore random variation.  If you 
include random variation, these are different models for the variance of the 
observations around the mean.  Carroll and Rupert explore this in the 
statistical literature (their book on transformations and weighting, and 
particularly their paper on the difference between three approaches to fit the 
Michaelis-Menton model).  They call this approach 'transform both sides', which 
might be a useful search term.

In 1), observations are assumed to have equal variances.
In 2), observations are assumed to have the same coefficient of variation, i.e. 
variance increases with the predicted value.
Because of this, the estimated regression coefficients will not be the same.  
But, you can use plots of predicted values vs residuals to decide which 
variance assumption is most reasonable for your data.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Time-varying spatial correlation

2012-01-14 Thread Dixon, Philip M [STAT]

Penelope,

I agree completely with your wish to keep the model simple.  Spatial 
correlation, as something separate from spatial trend, may be something 
instrinic to the biology, but it can be a consequence of mis-specification of 
the model.  Along those lines, I would use all the available information to 
think about the rest of the model before considering the spatial correlation.  
In particular here, you have two levels of spatial information: the large scale 
hexagon level, sampled every year, and the small scale location within hexagon 
level, sampled only once per location.

You describe your model as containing fixed effects for year and hexagon-level 
random coefficients for a linear regression on year.  That means your model 
includes a random intercept for hexagon, so the large-scale spatial trend is 
already described by your model.  Since locations change within the hexagon 
each year, the small-scale spatail variation is only known after fitting the 
linear regression for year.  Any lack-of-fit there will complicate 
understanding the spatial correlation.

My sense is that you don't have enough of the right sort of data to account for 
small-scale spatial correlation unless there is some simple model for a spatial 
covariate that varies within hexagons (e.g depth).  Otherwise, I would be 
strongly tempted to ignore what you're calling spatial correlation under the 
principle that locations within hexagons are randomly sampled.  

Philip Dixon
___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Random or repeated?

2011-12-09 Thread Dixon, Philip M [STAT]

Kyle,

The eco-stat list seems the wrong place to ask this question.  The people to 
ask are the quantitative plant breeders, e.g. at Texas A&M.

Some general advice about random effects and correlated observations, for 
either trait data or ecological data:
1) random effects provide a way to model correlations among errors.  So do 
'repeated measures'.
2) what matters are the correlations among ERRORS.  If I remember the 
traditional QTL model correctly, errors are independent conditional on the 
state of the flanking markers.  So, you don't need random effects for RIL's, if 
you're analyzing a single trait at a time.  Your observation is correct: the 
random effect and the marker fixed effects are both trying to model the same 
thing.  The same trait measured in multiple environments is often considered as 
a multi-trait analysis, to estimate the correlation between pairs of 
environments.  It could also be analyzed as a single trait, with a model that 
includes fixed effects of environments.  Different models answering different 
questions.
3) yes, pedigrees (or relatedness coefficients) can be used to parameterize a 
correlation matrix.  This provides one way (not the only way) to account for 
phylogenetic relatedness.  Animal breeders use pedigree-based correlation 
matrices all the time.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Bootstrapping with pseudo-replicates

2011-11-16 Thread Dixon, Philip M [STAT]

Johannes,

A very good question to ask, but you can't use a bootstrap, or boot(), to 
investigate it.  

You can define strata and then bootstrap observations within strata, but all 
bootstrap data sets will have the same structure as the original data.  That's 
the point of the bootstrap.  In your example, you have observations from 4 
sites, 1 obs from site 1, 2 from site 2, 1 from site 3, and 2 from site 4.  
Every stratified bootstrap sample will have 1 from site 1, 2 from site 2, 1 
from site 3 and 2 from site 4.

I believe you have to construct your own code, probably along the lines of 
defining a vector for one obs per site, then for each site: extracting the set 
of pseudoreplicates for one site, using sample() to grab one value from that 
set, then storing in the  vector.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] MLEs for wrapped Cauchy

2011-09-24 Thread Dixon, Philip M [STAT]

Lene,

The problem could be a bad starting value, optim wandering into places it 
shouldn't, insufficient data to estimate c, or something else.

A good way to understand what's going on is to plot profile likelihood against 
values of c.  For a specified value of c, use optim() to find the 2 parameter 
mle of  (mu0, rho0) given c.  Evaluating lnl at (\hat mu0, \hat rho0, c) gives 
you profile lnl(c) = max lnl (over mu0, rho0 given c).  Do that for a range of 
c and plot profile lnl(c) against c.  

If optim() refuses to work well with the three parameter problem, you can 
always use optimize() on the profile likelihood.  Golden section search (in 
optimize) for a 1 parameter optimization is more numerically robust than 
conjugate gradient optimization.

Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Average Regression and the bootstrap

2011-08-30 Thread Dixon, Philip M [STAT]

Johannes,

You are doing what many folks call the parametric bootstrap.  The bootstrap 
that resamples the data is the non-parametric bootstrap.   Often it is easier 
to code by hand, as you have done.  If you want access to all the helper 
functions for the bootstrap, you can use boot() in the boot library and specify 
sim='parametric'.  The details of your random uniform distribution go in the 
function specified in the ran.gen= argument.  Once you've generated the 
bootstrap samples (parametric or nonparametric), there is no difference in 
subsequent processing.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[R-sig-eco] Bivalve study

2011-08-24 Thread Dixon, Philip M [STAT]

Chris,

This is an example of a study with 2 factors that is much easier to analyze as 
a 1 way ANOVA.  You  have three treatments.   The structure of those treatments 
implies two interesting treatment contrasts:
the effect of raking, estimated by the difference between control and 
procedural control, and
the effect of fishing in disturbed plots, estimated by the difference between 
fished and procedural control.

Because these follow from the treatment structure, I would do inference about 
those differences (tests, confidence intervals) without any adjustment for 
multiple comparisons.

You could also set up the problem as a regression problem with two indicator 
variables (the raking and fishing variables in your treatment table).  I think 
it is more straight forward to think about means of treatments.

A 2 way factorial with interactions will run into serious problems because you 
do not have all 4 cells of the raking x fishing table.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Re: [R-sig-eco] R-sig-ecology Digest, Vol 34, Issue 5

2011-01-11 Thread Dixon, Philip M [STAT]

Matt,

Standardization is a transformation of the X variables to mean 0 and variance 
1.  You do that before you fit the regression or after by multiplying beta_i by 
sd(X_i).

The goal of standardization is to put all X's on the same scale.  There are 
other ways to do that, and (important point) you can (and should) choose the 
biologically/ecologically most relevant way to put X's on the same scale.  
Using sd(x) is often reasonable, but there may be something more appropriate.  
In particular, sd(x) is not appropriate (in my mind) when one of the X's has a 
very skewed distribution (e.g. mostly values from 1 to 2 and one value of 100). 
 

My recommendation to evaluate the "impact" of each X (considered in the usual 
partial sense for a multiple regression coefficient) is to define equivalent 
changes for each X variable.  E.g. a change in temperature of 5C might be 
equivalent to a change in altitude of 1000m or a change in latitude of 1degree. 
 Then calculate the change in Y induced by each change in X.  Those are 5*beta 
for temp, 1000*beta for altitude and beta for latitude.

Best wishes,
Philip Dixon

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

43 matches

Mail list logo