Laura de Ruiter wrote:
Dear R-users and -experts,

I am performing a rather simple analysis on a small data set (pasted below this email) and keep getting a to me inexplicable result. Perhaps I am missing something here - it would be great if someone could point out to me what I am doing wrong.

I want to test whether the factor "Info" (which has three levels: "new", "given", "accessible") is a significant predictor for the binary variable "DeaccYN". The random factor is "Subject". The distribution of the data looks as follows:

----------------------------------------------------------------------------- Info
DeaccYN given new accessible
no 25 42 21
yes 11 0 1
------------------------------------------------------------------------------

This is the model:

---------------------------------------------------------------------------------------------------------- deacc.lmer = lmer (DeaccYN ~ Info + (1|Subject), data = dat, family = "binomial") -----------------------------------------------------------------------------------------------------------------

However, given the distribution above, this outcome seems rather weird to me:

---------------------------------------------------------------------------------------------------------
summary (deacc.lmer)
Generalized linear mixed model fit using Laplace
Formula: DeaccYN ~ Info + (1 | Subject)
Data: dat
Family: binomial(logit link)
AIC BIC logLik deviance
60.4 70.82 -26.2 52.4
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.18797 0.43356
number of obs: 100, groups: Subject, 21

Estimated scale (compare to 1 ) 0.7316067

Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8635 0.3795 -2.2754 0.0229 *
Infonew -18.7451 2764.2445 -0.0068 0.9946
Infoaccessible -2.2496 1.1186 -2.0110 0.0443 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> [...]

----------------------------------------------------------------------------------------------------

Why should the difference between 25/11 and 21/1 be significant, but the difference between 25/11 and 42/0 not? Very odd to me seems the standard error of 2764!

> [...]

I was wondering: Is it perhaps a problem for the model that there are no cases in the DeaccYN == "yes" category for Info == "given"? And if this
                                                      ^^^^^
                                         I believe you mean "new" here.
is the case, why?
Am I overlooking something here?

Dear Laura,

Independently of the issue that Florian is raising...you are right that the lack of is a problem for the model (to be precise, it's a problem for estimating the significance of the parameter estimate using the z value). The z value, which is the basis of the significance level reported in the lmer summary, is based on the Wald statistic, which is

  z = \hat{b} / StdError(\hat{b})

where \hat{b} is the parameter estimate. However, for parameter estimates of large magnitudes (as is the case for "new" here), the standard error is inflated, which leads to a small wald statistic. This problem is discussed in a number of places, including Agresti (1996, 2002); here's a decent mention of it online:

  http://userwww.sfsu.edu/~efc/classes/biol710/logistic/logisticreg.htm

You can see this quite clearly if you add just one "yes/given" example at the end of your data frame, for example:

  101 132 new yes

Then the model gives (being sure to correctly specify the levels of "Info"):

> summary(deacc.lmer)
Generalized linear mixed model fit by the Laplace approximation
Formula: Deacc ~ Info + (1 | Subject)
   Data: dat
   AIC   BIC logLik deviance
 69.78 80.24 -30.89    61.78
Random effects:
 Groups  Name        Variance Std.Dev.
 Subject (Intercept) 0.34332  0.58594
Number of obs: 101, groups: Subject, 21

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.8996     0.3936  -2.286  0.02226 *
Iaccessible  -2.2808     1.1460  -1.990  0.04656 *
Inew         -3.0050     1.1395  -2.637  0.00836 **


The next question, of course, is how to deal with the dataset you actually have. If you weren't using a mixed-effects model, you could use a likelihood-ratio test by comparing your full model with a simpler model; the likelihood-ratio test isn't susceptible to problems with large parameter estimates the way the Wald test is. For example:

> deacc.glm = glm (Deacc ~ Info, data = dat, family = "binomial")
> deacc.glm1 = glm (Deacc ~ Info1, data = dat, family = "binomial")
> anova(deacc.glm,deacc.glm1,test="Chisq")
Analysis of Deviance Table

Model 1: Deacc ~ Info
Model 2: Deacc ~ Info1
  Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1        97     52.452
2        98     71.600 -1  -19.148 1.210e-05

In principle, you could do this with your mixed-effects model, being sure to use ML instead of REML fitting:

> deacc.lmer = lmer (Deacc ~ Info + (1 | Subject), data = dat, family = "binomial",REML=F) > deacc.lmer1 = lmer (Deacc ~ Info1 + (1 | Subject), data = dat, family = "binomial",REML=F)
> anova(deacc.lmer,deacc.lmer1)
Data: dat
Models:
deacc.lmer1: Deacc ~ Info1 + (1 | Subject)
deacc.lmer: Deacc ~ Info + (1 | Subject)
            Df     AIC     BIC  logLik Chisq Chi Df Pr(>Chisq)
deacc.lmer1  3  77.600  85.416 -35.800
deacc.lmer   4  60.400  70.821 -26.200  19.2      1  1.177e-05 ***

There is an argument that the likelihood-ratio test is anti-conservative and hence inappropriate for comparing mixed-effects models differing only in fixed-effects structure. (See Pinheiro & Bates, 2000, around page 76 or so. The argument is made only for linear mixed-effects models and I'm not sure of the status for logit mixed-effects models.) That being said, it's not clear how strong the anti-conservativity is, and in your case it seems like you have such an exceedingly powerful effect that you might be safe in using the likelihood-ratio test here and just mentioning the potential anti-conservativity as a caveat.

So the summary is: believe it or not, how to assess the significance of the parameter estimate for "new" for your model & dataset is a bit of an open question, but it seems pretty clear that the estimate is significantly non-zero.

Best & hope this helps.

Roger



--

Roger Levy                      Email: rl...@ucsd.edu
Assistant Professor             Phone: 858-534-7219
Department of Linguistics       Fax:   858-534-4789
UC San Diego                    Web:   http://ling.ucsd.edu/~rlevy

_______________________________________________
R-lang mailing list
R-lang@ling.ucsd.edu
http://pidgin.ucsd.edu/mailman/listinfo/r-lang

Reply via email to