Re: [R] question in using nlme and lme4 for unbalanced data

2010-11-02 Thread Mike Marchywka









> Date: Mon, 1 Nov 2010 17:38:54 -0700
> From: djmu...@gmail.com
> To: cy...@email.arizona.edu
> CC: r-help@r-project.org
> Subject: Re: [R] question in using nlme and lme4 for unbalanced data
>
> Hi:
>
> On Mon, Nov 1, 2010 at 3:59 PM, Chi Yuan  wrote:
>
> > Hello:
> > I need some help about using mixed for model for unbalanced data. I
> > have an two factorial random block design. It's a ecology
[...]
>
> Unbalanced data is not a problem in either package. However, five blocks is
> rather at the boundary of whether or not one can compute reliable variance
> components and random effects. Given that the variance estimate of blocks in
> your models was nearly zero, you're probably better off treating them as
> fixed rather than random and analyzing the data with a fixed effects model
> instead.
>
> Another question is about p values.
> > I kind of heard the P value does not matter that much in the mixed
> > model because it's not calculate properly.
>
>
> No. p-values are not calculated in lme4 (as I understand it) because,
> especially in the case of severely unbalanced data, the true sampling
> distributions of the test statistics in small to moderate samples are not
> necessarily close to the asymptotic distributions used to compute the
> corresponding p-values. It's the (sometimes gross) disparity between the
> small-sample and asymptotic distributions that makes the reported p-values
> based on the latter unreliable, not an inability to calculate the p-value
> properly. I can assure you that Prof. Bates knows how to compute a p-value.

To add my own question on terminology[ even the statements here should be taken
as questions ], assuming the null hypothesis is 
true and you have some underlying population distribution of various 
attirubtes, 
you get some distribution for your test statistic for repeated experiemnts. The 
asymptotic distribution I take it is the true population distribution which may 
not be well reflected
in your ( small ) sample? Usually people justify non-parametrics by saying they
help in the small sample/outlier cases. Alternatively, if you have some 
reasonable
basis for knowing the true population distributions, you could use that for p 
value
calculation and/or do monte carlo and just measure the number of time you 
incorrectly
reject null hypothesis etc. Of course, monte carlo code needs to be debugged 
too so
nothing will be a sure thing. Introducing new things like an indpendently known
population distribution may not be statitically rigorous by some criteria( 
comments welcome LOL) but you free to examine it for analysis.


>
> Is there any other way I can
> > tell whether the treatment has a effect not? I know AIC is for model
> > comparison,

Get more data? In this case,it would seem the goal of statistical analysis
is to make some guesses about causality. Presumably this is one piece of
evidence in a larger "case" that includes theory and other observations.
To paraphrase the legal popular legal phrase, 
"if the model doesn't fit you must not quit." Or, as someone at the US FDA
is quoted as saying, " A p-value is no substitute for a brain." 



> > do I report this in formal publication?

I guess that depends on the journal (LOL). Personally I'd be more worried about
getting a convincing story together than playing to a specific audience. 
However,
many questions of detail do relate to the audience and journal- you want to
use the math to determine reality, what you present depends on the publication. 
There is nothing wrong with presenting novel analyses with enough detail to
the right audience but it may not be for everyone :)

> >
>
> As mentioned above, I would suggest analyzing this as a fixed effects
> problem. Since the imbalance is not too bad, and it is not unusual in field
> experiments to have more control EUs than treatment EUs within each level of
> treatment, a fixed effects analysis may be sufficient. It wouldn't hurt to
> consult with a local statistician to discuss the options.

  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] question in using nlme and lme4 for unbalanced data

2010-11-01 Thread Ben Bolker
Chi Yuan  email.arizona.edu> writes
> 
> Hello:
>  I need some help about using mixed for model for unbalanced data. I
> have an two factorial random block design. It's a ecology
> experiment. My two factors are, guild removal and enfa removal. Both
> are two levels, 0 (no removal), 1 (removal). I have 5 blocks. But
> within each block, it's unbalanced at plot level because I have 5
> plots instead of 4 in each block. Within each block, I have 1 plot
> with only guild removal, 1 plot with only enfa removal, 1 plot for
> control with no removal, 2 plots for both guild and enfa removal. I am
> looking at how these treatment affect the enfa mortality rate. I
> decide to use mixed model to treat block as random effect. So I try
> both nlme and lme4. But I don't know whether they take the unbalanced
> data properly. So my question is, does lme in nlme and lmer in lme4
> take unbalanced data? How do I know it's analysis in a proper way?

  Didn't Bert Gunter and I already provide answers to this question
last week? Can you please clarify what about those answers you didn't
understand?

> Another question is about p values.
> I kind of heard the P value does not matter that much in the mixed
> model because it's not calculate properly. Is there any other way I can
>  tell whether the treatment has a effect not? I know AIC is for model
> comparison,
> do I report this in formal publication?

  It is indeed hard to compute p-values, but ... if you use AIC,
you are essentially making the same assumption as if you assumed
that the denominator degrees of freedom were infinite in an F test
(or if you used the likelihood ratio test).

[snip results]

  Ben Bolker

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] question in using nlme and lme4 for unbalanced data

2010-11-01 Thread Dennis Murphy
Hi:

On Mon, Nov 1, 2010 at 3:59 PM, Chi Yuan  wrote:

> Hello:
>  I need some help about using mixed for model for unbalanced data. I
> have an two factorial random block design. It's a ecology
> experiment. My two factors are, guild removal and enfa removal. Both
> are two levels, 0 (no removal), 1 (removal). I have 5 blocks. But
> within each block, it's unbalanced at plot level because I have 5
> plots instead of 4 in each block. Within each block, I have 1 plot
> with only guild removal, 1 plot with only enfa removal, 1 plot for
> control with no removal, 2 plots for both guild and enfa removal. I am
> looking at how these treatment affect the enfa mortality rate. I
> decide to use mixed model to treat block as random effect. So I try
> both nlme and lme4. But I don't know whether they take the unbalanced
> data properly. So my question is, does lme in nlme and lmer in lme4
> take unbalanced data? How do I know it's analysis in a proper way?
>

Unbalanced data is not a problem in either package. However, five blocks is
rather at the boundary of whether or not one can compute reliable variance
components and random effects. Given that the variance estimate of blocks in
your models was nearly zero, you're probably better off treating them as
fixed rather than random and analyzing the data with a fixed effects model
instead.

Another question is about p values.
> I kind of heard the P value does not matter that much in the mixed
> model because it's not calculate properly.


No. p-values are not calculated in lme4 (as I understand it) because,
especially in the case of severely unbalanced data, the true sampling
distributions of the test statistics in small to moderate samples are not
necessarily close to the asymptotic distributions used to compute the
corresponding p-values. It's the (sometimes gross) disparity between the
small-sample and asymptotic distributions that makes the reported p-values
based on the latter unreliable, not an inability to calculate the p-value
properly. I can assure you that Prof. Bates knows how to compute a p-value.

Is there any other way I can
>  tell whether the treatment has a effect not? I know AIC is for model
> comparison,
> do I report this in formal publication?
>

As mentioned above, I would suggest analyzing this as a fixed effects
problem. Since the imbalance is not too bad, and it is not unusual in field
experiments to have more control EUs than treatment EUs within each level of
treatment, a fixed effects analysis may be sufficient. It wouldn't hurt to
consult with a local statistician to discuss the options.

HTH,
Dennis


> Here is my code and the result for each method.
>  I first try nlme
>  library(nlme)
>
>  
> m=lme(enfa_mortality~guild_removal*enfa_removal,random=~1|block,data=com_summer)
>  It gave me the result as following
>  Linear mixed-effects model fit by REML
>  Data: com_summer
>  AIC  BIC   logLik
>  8.552254 14.81939 1.723873
>
> Random effects:
>  Formula: ~1 | block
>(Intercept)  Residual
> StdDev: 9.722548e-07 0.1880945
>
> Fixed effects: enfa_mortality ~ guild_removal * enfa_removal
>   Value Std.Error DF   t-value p-value
> (Intercept) 0.450 0.0841184 17  5.349603  0.0001
> guild_removal  -0.100 0.1189614 17 -0.840609  0.4122
> enfa_removal   -0.368 0.1189614 17 -3.093441  0.0066
> guild_removal:enfa_removal  0.197 0.1573711 17  1.251818  0.2276
>  Correlation:
>  (Intr) gld_rm enf_rm
> guild_removal  -0.707
> enfa_removal   -0.707  0.500
> guild_removal:enfa_removal  0.535 -0.756 -0.756
>
> Standardized Within-Group Residuals:
>  Min Q1Med Q3Max
> -1.7650706 -0.7017751  0.1594943  0.7974717  1.9139320
>
> Number of Observations: 25
> Number of Groups: 5
>
>
>  I then try lme4, it give similar result, but won't tell me the p value.
> library(lme4)
> m<-lmer(enfa_mortality ~ guild_removal*enfa_removal +(1|block),
> data=com_summer)
> here is the result
>  Linear mixed model fit by REML
> Formula: enfa_mortality ~ guild_removal * enfa_removal + (1 | block)
>  Data: com_summer
>  AIC   BIC logLik deviance REMLdev
>  8.552 15.87  1.724   -16.95  -3.448
> Random effects:
>  Groups   NameVariance Std.Dev.
>  block(Intercept) 0.00 0.0
>  Residual 0.035380 0.18809
> Number of obs: 25, groups: block, 5
>
> Fixed effects:
>  Estimate Std. Error t value
> (Intercept) 0.450000.08412   5.350
> guild_removal  -0.10.11896  -0.841
> enfa_removal   -0.368000.11896  -3.093
> guild_removal:enfa_removal  0.197000.15737   1.252
>
> Correlation of Fixed Effects:
>   (Intr) gld_rm enf_rm
> guild_remvl -0.707
> enfa_removl -0.707  0.500
> gld_rmvl:n_  0.535 -0.756 -0.756
>
>
> I really appreciate any suggestion!
> Thank you!
> --
> Chi Yuan
> Graduate Student
> Department of Eco

Re: [R] question in using nlme and lme4 for unbalanced data

2010-10-25 Thread Ben Bolker
On 10-10-25 04:59 PM, Bert Gunter wrote:
>  ...ignore the block variation entirely, "
> 
> If the between block variability is large, this will lose precision;
> with imbalance, it could also result in bias (prhaps not in this
> study...). The mixed or fixed effects choice is arbitrary; this is not
> -- blocks should be modeled, not ignored.
> 
> -- Bert

   I agree that in general blocking factors should not be ignored!  I
don't even think they should be discarded if they are
small/non-significant (in the ecological literature this is called
'sacrificial pseudoreplication' sensu Hurlbert (1984)).

 But the point is that for this particular case the *estimated* block
variation is *exactly* zero for one function and trivially small for the
other (I think it's exactly zero for lmer, but not sure), so in fact the
original poster will (I think?) get exactly (or almost exactly) the same
answer if they simply use lm() and ignore blocks entirely ...

  Or am I missing something?

  (Taking the liberty of cc'ing back to the list so I can look like an
idiot in public if necessary ...)

  cheers
Ben

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] question in using nlme and lme4 for unbalanced data

2010-10-25 Thread Ben Bolker
Bert Gunter  gene.com> writes:

  Thanks for starting this.  I don't really disagree, much.

> I would say that these are not really R questions, but about
> statistical procedures in general. However, there are some important
> issues that you touch on, so I cannot resist commenting. Hopefully my
> comments might pique others to add theirs -- and perhaps to disagree
> with me.
> 
> 1. Unbalanced data?
> Yes that's what they're for. However, you need to  recognize that with
> unbalanced data, interpretability of coefficients may be lost (the
> estimates depend on what factors you include in your model).

  Agreed.

> 2.Proper analysis?
> TRUST in DOUG. (Doug Bates, the author of the packages and an
> acknowledged expert in the field).

  Yes, you seem to have encoded your design (randomized complete
block, with a twist) correctly.  You can in principle quantify
the variation in mean response among blocks [(1|block)], but not
the variation in treatments among blocks [(ttt|block) , where ttt
is a factor].  It is not surprising, but a good sign, that you
get identical coefficient estimates from lme and lmer -- this
might not be the case if you had an extremely exotic experimental
design (in which case you might uncover a bug), or if your estimates
are numerically unstable (in which case the different algorithms
underlying the two different functions might give different
answers).

 
> 3. Without P values, how do I I tell what's significant?
> Well, first of all, P values/statistical significance are only
> meaningful when hypotheses are pre-specified; and even then,
> statistical significance is a function of effect magnitude, sample
> size, and experimental noise, so usually doesn't accord with
> scientific importance anyway unless the experiments have been powered
> appropriately, which is rarely the case outside clinical trials. So
> statistical significance is probably irrelevant to the scientific
> questions iin any case.
> 
> I would say that you need to focus on effect (coefficient) sizes and
> perhaps do sensitivity analyses to see how much predicted results
> change as you change them. Good graphs of your data are also always a
> good idea.

  This is true, in general, but it's also true that you can't always
fight against the machine.  If you can decide on the basis of your
design and examination of a standard statistical reference
(Gotelli and Ellison and Quinn and Keough are good for ecologists)
what the appropriate 'denominator degrees of freedom' would be for a nearly
equivalent situation (it would be good to check the DF that lme tells you)
then you can do the t or F tests yourself -- if uncertain then pick 
conservative values (i.e. the smaller of the plausible denominator DFs).
Or trust what lme (representing a younger and ?less wise? Doug Bates)
tells you.

 Be warned that the values you get from summary() represent marginal
tests, which very likely don't make sense when there is an interaction
present in the model.  Read ?anova.lme , and make sure you know the
difference between sequential and marginal tests.

> 4, With only 5 blocks, you have practically no information to estimate
> variance anyway. You're probably better off treating them as fixed
> effects and just estimating the model via lm. You might find a sqrt or
> log transformation of the enfa numbers to be useful, though...

  Note also that the estimates you are getting from both lmer and lm
say that there is zero among-block variance in the intercept. As Bert
says, that really just indicates that you don't have very much
information about among-block variation (the data are consistent
with all the error being among individuals). If there is someone in
a position of power over you who will review your work, the pragmatic
thing to do is to ask them what they recommend, because anything you
do at this point (ignore the block variation entirely, treat it as
random, or treat it as fixed) will be unacceptable to some members
of the statistical or ecological community.  It won't make very
much difference to your answers either way (might change a p-value
from <0.05 to >0.05 or vice versa, but in reality that does not
mean the answers are much different).

> 
> Cheers,
> Bert
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] question in using nlme and lme4 for unbalanced data

2010-10-25 Thread Bert Gunter
Chi:

I would say that these are not really R questions, but about
statistical procedures in general. However, there are some important
issues that you touch on, so I cannot resist commenting. Hopefully my
comments might pique others to add theirs -- and perhaps to disagree
with me.

1. Unbalanced data?
Yes that's what they're for. However, you need to  recognize that with
unbalanced data, interpretability of coefficients may be lost (the
estimates depend on what factors you include in your model).

2.Proper analysis?
TRUST in DOUG. (Doug Bates, the author of the packages and an
acknowledged expert in the field).

3. Without P values, how do I I tell what's significant?
Well, first of all, P values/statistical significance are only
meaningful when hypotheses are pre-specified; and even then,
statistical significance is a function of effect magnitude, sample
size, and experimental noise, so usually doesn't accord with
scientific importance anyway unless the experiments have been powered
appropriately, which is rarely the case outside clinical trials. So
statistical significance is probably irrelevant to the scientific
questions iin any case.

I would say that you need to focus on effect (coefficient) sizes and
perhaps do sensitivity analyses to see how much predicted results
change as you change them. Good graphs of your data are also always a
good idea.

4, With only 5 blocks, you have practically no information to estimate
variance anyway. You're probably better off treating them as fixed
effects and just estimating the model via lm. You might find a sqrt or
log transformation of the enfa numbers to be useful, though...

Cheers,
Bert

On Mon, Oct 25, 2010 at 11:44 AM, Chi Yuan  wrote:
> Hello:
>  I have an two factorial random block design. It's a ecology
> experiment. My two factors are, guild removal and enfa removal. Both
> are two levels, 0 (no removal), 1 (removal). I have 5 blocks. But
> within each block, it's unbalanced at plot level because I have 5
> plots instead of 4 in each block. Within each block, I have 1 plot
> with only guild removal, 1 plot with only enfa removal, 1 plot for
> control with no removal, 2 plots for both guild and enfa removal. I am
> looking at how these treatment affect the enfa mortality rate. I
> decide to use mixed model to treat block as random effect. So I try
> both nlme and lme4. But I don't know whether they take the unbalanced
> data properly. So my question is, does lme in nlme and lmer in lme4
> take unbalanced data? How do I know it's analysis in a proper way?
> Here is my code and the result for each method.
>  I first try nlme
>  library(nlme)
>  m=lme(enfa_mortality~guild_removal*enfa_removal,random=~1|block,data=com_summer)
>  It gave me the result as following
>  Linear mixed-effects model fit by REML
>  Data: com_summer
>       AIC      BIC   logLik
>  8.552254 14.81939 1.723873
>
> Random effects:
>  Formula: ~1 | block
>         (Intercept)  Residual
> StdDev: 9.722548e-07 0.1880945
>
> Fixed effects: enfa_mortality ~ guild_removal * enfa_removal
>                            Value Std.Error DF   t-value p-value
> (Intercept)                 0.450 0.0841184 17  5.349603  0.0001
> guild_removal              -0.100 0.1189614 17 -0.840609  0.4122
> enfa_removal               -0.368 0.1189614 17 -3.093441  0.0066
> guild_removal:enfa_removal  0.197 0.1573711 17  1.251818  0.2276
>  Correlation:
>                           (Intr) gld_rm enf_rm
> guild_removal              -0.707
> enfa_removal               -0.707  0.500
> guild_removal:enfa_removal  0.535 -0.756 -0.756
>
> Standardized Within-Group Residuals:
>       Min         Q1        Med         Q3        Max
> -1.7650706 -0.7017751  0.1594943  0.7974717  1.9139320
>
> Number of Observations: 25
> Number of Groups: 5
>
> I kind of heard the P value does not matter that much in the mixed
> model. Is there any other way I can tell whether the treatment has a
> significant effect or not?
>  I then try lme4, it give similar result, but won't tell me the p value.
> library(lme4)
> m<-lmer(enfa_mortality ~ guild_removal*enfa_removal +(1|block), 
> data=com_summer)
> here is the result
>  Linear mixed model fit by REML
> Formula: enfa_mortality ~ guild_removal * enfa_removal + (1 | block)
>   Data: com_summer
>   AIC   BIC logLik deviance REMLdev
>  8.552 15.87  1.724   -16.95  -3.448
> Random effects:
>  Groups   Name        Variance Std.Dev.
>  block    (Intercept) 0.00 0.0
>  Residual             0.035380 0.18809
> Number of obs: 25, groups: block, 5
>
> Fixed effects:
>                           Estimate Std. Error t value
> (Intercept)                 0.45000    0.08412   5.350
> guild_removal              -0.1    0.11896  -0.841
> enfa_removal               -0.36800    0.11896  -3.093
> guild_removal:enfa_removal  0.19700    0.15737   1.252
>
> Correlation of Fixed Effects:
>            (Intr) gld_rm enf_rm
> guild_remvl -0.707
> enfa_removl -0.707  0.500
> gld_rmvl:n_  0.535 -0.756 -0.756