[R] problem with glm(family=binomial) when some levels have only 0 proportion values

2011-03-02 Thread Jürg Schulze

Hello everybody

I want to compare the proportions of germinated seeds (seed batches of  
size 10) of three plant types (1,2,3) with a glm with binomial data  
(following the method in Crawley: Statistics,an introduction using R,  
p.247).
The problem seems to be that in two plant types (2,3) all plants have  
proportions = 0.

I give you my data and the model I'm running:

  success failure type
 [1,]   0   103
 [2,]   0   102
 [3,]   0   102
 [4,]   0   102
 [5,]   0   102
 [6,]   0   102
 [7,]   0   102
 [8,]   461
 [9,]   461
[10,]   371
[11,]   551
[12,]   731
[13,]   461
[14,]   0   103
[15,]   0   103
[16,]   0   103
[17,]   0   103
[18,]   0   103
[19,]   0   103
[20,]   0   102
[21,]   0   102
[22,]   0   102
[23,]   911
[24,]   641
[25,]   461
[26,]   0   103
[27,]   0   103

 y- cbind(success, failure)

 Call:
glm(formula = y ~ type, family = binomial)

Deviance Residuals:
   Min  1Q  Median  3Q
-1.3521849  -0.427  -0.427  -0.427
   Max
 2.6477556

Coefficients:
  Estimate Std. Error z value Pr(|z|)
(Intercept)0.044450.21087   0.2110.833
typeFxC  -23.16283 6696.13233  -0.0030.997
typeFxD  -23.16283 6696.13233  -0.0030.997

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 134.395  on 26  degrees of freedom
Residual deviance:  12.622  on 24  degrees of freedom
AIC: 42.437

Number of Fisher Scoring iterations: 20


Huge standard errors are calculated and there is no difference between  
plant type 1 and 2 or between plant type 1 and 3.
If I add 1 to all successes, so that all the 0 values disappear, the  
standard error becomes lower and I find highly significant differences  
between the plant types.


suc- success + 1
fail- 11 - suc
Y- cbind(suc,fail)

Call:
glm(formula = Y ~ type, family = binomial)

Deviance Residuals:
   Min  1Q  Median  3Q
-1.279e+00  -4.712e-08  -4.712e-08   0.000e+00
   Max
 2.584e+00

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)   0.2231 0.2023   1.103 0.27
typeFxC  -2.5257 0.4039  -6.253 4.02e-10 ***
typeFxD  -2.5257 0.4039  -6.253 4.02e-10 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 86.391  on 26  degrees of freedom
Residual deviance: 11.793  on 24  degrees of freedom
AIC: 76.77

Number of Fisher Scoring iterations: 4


So I think the 0 values of all plants of group 2 and 3 are the  
problem, do you agree?
I don't know why this is a problem, or how I can explain to a reviewer  
why a data transformation (+ 1) is necessary with such a dataset.


I would greatly appreciate any comments.
Juerg
__

Jürg Schulze
Department of Environmental Sciences
Section of Conservation Biology
University of Basel
St. Johanns-Vorstadt 10
4056 Basel, Switzerland
Tel.: ++41/61/267 08 47

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] problem with glm(family=binomial) when some levels have only 0 proportion values

2011-03-02 Thread Robert A LaBudde

The algorithm is not converging. Your iterations are at the maximum.

It won't do any good to add a fractional number 
to all data, as the result will depend on the 
number added (try 1.0, 0.5 and 0.1 to see this).


The root problem is that your data are 
degenerate. Firstly, your types '2' and '3' are 
indistinguishable in your data. Secondly, 
consider the case without 'type'. If you have all 
zero data for 10 trials, you cannot discriminate 
among mu = 0, 0.1, 0.0001, 0.001 or 0.01. 
This leads to numerical instability. Thirdly, the 
variance estimate in the IRLS will start at 0.0, which gives a singularity.


Fundamentally, the algorithm is failing because 
you are at the boundary of possibilities  for a 
parameter, so special techniques are needed to do 
maximum likelihood estimation.


The simple solution is to deal with the data for 
your types separately. Another is to do more 
batches for '2' and '3' to get an observed failure.




At 05:01 AM 3/2/2011, Jürg Schulze wrote:

Hello everybody

I want to compare the proportions of germinated seeds (seed batches of
size 10) of three plant types (1,2,3) with a glm with binomial data
(following the method in Crawley: Statistics,an introduction using R,
p.247).
The problem seems to be that in two plant types (2,3) all plants have
proportions = 0.
I give you my data and the model I'm running:

  success failure type
 [1,]   0   103
 [2,]   0   102
 [3,]   0   102
 [4,]   0   102
 [5,]   0   102
 [6,]   0   102
 [7,]   0   102
 [8,]   461
 [9,]   461
[10,]   371
[11,]   551
[12,]   731
[13,]   461
[14,]   0   103
[15,]   0   103
[16,]   0   103
[17,]   0   103
[18,]   0   103
[19,]   0   103
[20,]   0   102
[21,]   0   102
[22,]   0   102
[23,]   911
[24,]   641
[25,]   461
[26,]   0   103
[27,]   0   103

 y- cbind(success, failure)

 Call:
glm(formula = y ~ type, family = binomial)

Deviance Residuals:
   Min  1Q  Median  3Q
-1.3521849  -0.427  -0.427  -0.427
   Max
 2.6477556

Coefficients:
  Estimate Std. Error z value Pr(|z|)
(Intercept)0.044450.21087   0.2110.833
typeFxC  -23.16283 6696.13233  -0.0030.997
typeFxD  -23.16283 6696.13233  -0.0030.997

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 134.395  on 26  degrees of freedom
Residual deviance:  12.622  on 24  degrees of freedom
AIC: 42.437

Number of Fisher Scoring iterations: 20


Huge standard errors are calculated and there is no difference between
plant type 1 and 2 or between plant type 1 and 3.
If I add 1 to all successes, so that all the 0 values disappear, the
standard error becomes lower and I find highly significant differences
between the plant types.

suc- success + 1
fail- 11 - suc
Y- cbind(suc,fail)

Call:
glm(formula = Y ~ type, family = binomial)

Deviance Residuals:
   Min  1Q  Median  3Q
-1.279e+00  -4.712e-08  -4.712e-08   0.000e+00
   Max
 2.584e+00

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)   0.2231 0.2023   1.103 0.27
typeFxC  -2.5257 0.4039  -6.253 4.02e-10 ***
typeFxD  -2.5257 0.4039  -6.253 4.02e-10 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 86.391  on 26  degrees of freedom
Residual deviance: 11.793  on 24  degrees of freedom
AIC: 76.77

Number of Fisher Scoring iterations: 4


So I think the 0 values of all plants of group 2 and 3 are the
problem, do you agree?
I don't know why this is a problem, or how I can explain to a reviewer
why a data transformation (+ 1) is necessary with such a dataset.

I would greatly appreciate any comments.
Juerg
__

Jürg Schulze
Department of Environmental Sciences
Section of Conservation Biology
University of Basel
St. Johanns-Vorstadt 10
4056 Basel, Switzerland
Tel.: ++41/61/267 08 47

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and 

Re: [R] problem with glm(family=binomial) when some levels have only 0 proportion values

2011-03-02 Thread csrabak

Em 2/3/2011 08:01, Jürg Schulze escreveu:

Hello everybody


This is not a R related problem, but rather more theoretic one, anyway:



I want to compare the proportions of germinated seeds (seed batches of
size 10) of three plant types (1,2,3) with a glm with binomial data
(following the method in Crawley: Statistics,an introduction using R,
p.247).
The problem seems to be that in two plant types (2,3) all plants have
proportions = 0.
I give you my data and the model I'm running:

success failure type
[1,] 0 10 3


[snipped]


[26,] 0 10 3
[27,] 0 10 3

y- cbind(success, failure)

Call:
glm(formula = y ~ type, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q
-1.3521849 -0.427 -0.427 -0.427
Max
2.6477556

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) 0.04445 0.21087 0.211 0.833
typeFxC -23.16283 6696.13233 -0.003 0.997
typeFxD -23.16283 6696.13233 -0.003 0.997

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 134.395 on 26 degrees of freedom
Residual deviance: 12.622 on 24 degrees of freedom
AIC: 42.437

Number of Fisher Scoring iterations: 20


Huge standard errors are calculated and there is no difference between
plant type 1 and 2 or between plant type 1 and 3.
If I add 1 to all successes, so that all the 0 values disappear, the
standard error becomes lower and I find highly significant differences
between the plant types.

suc- success + 1
fail- 11 - suc
Y- cbind(suc,fail)

Call:
glm(formula = Y ~ type, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q
-1.279e+00 -4.712e-08 -4.712e-08 0.000e+00
Max
2.584e+00

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) 0.2231 0.2023 1.103 0.27
typeFxC -2.5257 0.4039 -6.253 4.02e-10 ***
typeFxD -2.5257 0.4039 -6.253 4.02e-10 ***
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 86.391 on 26 degrees of freedom
Residual deviance: 11.793 on 24 degrees of freedom
AIC: 76.77

Number of Fisher Scoring iterations: 4


So I think the 0 values of all plants of group 2 and 3 are the problem,
do you agree?


It depends on the definition of problem here, if the result of your 
experiment, maybe, for the difference in the two regressions, not.



I don't know why this is a problem, or how I can explain to a reviewer
why a data transformation (+ 1) is necessary with such a dataset.


You need to ascertain the modeling of your statistic test against the 
epistemological analysis you're performing. Caveat: I'm not an expert in 
agriculture, so this is just a comment.


If the success rates of your dataframe are the germinations of three 
types of plants in a certain period of time, then perhaps it could make 
sense to add one to all the values in the success column (and subtract 
ones from the failure?) because that would cope with the possibility 
that a certain time after the experiment has been stopped, it could have 
germinated.


If in the other hand, the non germinated seeds are known to not 
germinate anymore, then the calculation device would put you on wrong path.




I would greatly appreciate any comments.


Get a look at the zero inflated (and perhaps hurdle as well) 
distributions and the regressions associated with them.


Using sos I get more than 100 entries to look at, so I'll refrain to put 
specific links here.


HTH

--
Cesar Rabak
DC Consulting LTDA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.