[R] A question on modelling binary response data using factors

Cormac Long Wed, 15 Sep 2010 04:55:05 -0700

 Dear all,

 A question on modelling proportional data in R.  I have a test experiment
that was designed in a particular way, and which I can analyse "by hand" to
 an extent.  I am really struggling to get R to give me sensible results in
modelling it "properly", so must be doing something wrong here.


 As background, I conduct a series of experiments and count the "hits" and
"misses" - I believe this could be modelled using a glm.  The experiments
 are done under a range of conditions, altering three different "factors",
lets call them A, B, and C.  Factors A and B each have 2 levels
(A1,A2,B1,B2)
 while factor C has 5 levels (C1,C2,C3,C4,C5).  The experiment has only
partial coverage, that is not every A is tested with every B and every C.
However,
 I was careful in the experimental design to ensure that every A and every B
was tested against at least one C.

 Here is my experimental data:

  FactorA   FactorB  FactorC        Hit       Miss
          A1          B1         C1         17          83
          A1          B1         C2         17          83
          A1          B1         C3         18          82
          A1          B1         C4         NA         NA
          A1          B1         C5         NA         NA
          A1          B2         C1         11          89
          A1          B2         C2         17          83
          A1          B2         C3         17          83
          A1          B2         C4         NA         NA
          A1          B2         C5         NA         NA
          A2          B1         C1         NA         NA
          A2          B1         C2         NA         NA
          A2          B1         C3         23          77
          A2          B1         C4         19          81
          A2          B1         C5         29          71
          A2          B2         C1         NA         NA
          A2          B2         C2         NA         NA
          A2          B2         C3         13          87
          A2          B2         C4         20          80
          A2          B2         C5         24          76


 I to compare individual rates against the overall rate, and look at
difference across pooled rate for each factor

 Doing hand calculations, it is possible to calculate the individual hit
rates, the overall hit rate of the pool, and the standard errors on these
 using hitrate = hits/(hits+misses) and hitrateerror =
sqrt((hits+misses)*(hits/(hits+misses))*(misses/(hits+misses)))/(hits+misses)

 Again doing hand calculations, it is possible to extract the influence of
each variant of each factor, by producing a "pool" and comparing the
influence
 of each variant to the pool.  For the overall pool, there were 225 hits
from 1200 trials, for a pooled hitrate of 18.8%.

 For factor C the analysis is relatively straightforward, in that we
calculate the hitrate for each C1 (28 hits from 200 trials = 14.0%), C2, ...
etc.  We then
 calculate the influence of a given factor compared to the pool - for
example C1 has a hitrate of 14.0% compared to a pooled hitrate of 18.8%, so
has an
 influence of (14.0%/18.8%)-100% = -25%.

 For factor A, the analysis is slightly complicated by the fact that factor
C3 is the only variant that "fairly" tests factor A
 (none of factors C1,C2,C4,C5 do not have experiments for both factors A1
and A2).  Therefore, just taking the factor C3 data, we can show that the
pool
 of 400 trials had 71 hits, for a hitrate of 17.8%.  We can then calculate
the hitrate for A1 and A2 as 17.5% and 18.0% respectively.  From this we can
 say that the effect of A1 and A2 was -1.4% and +1.4%, that is to say
(17.5%/17.8%)-100% and (18.0/17.8%)-100%.

 The trick here is that the pooled hitrate for factor A is 17.8%, i.e. the
hitrate for factor C3 alone, not 18.8% which is the overall pooled hitrate.
 However we have taken out the influence of C3 by normalising to the C3
hitrate, i.e. we could calculate the expected hitrate of the combination
 A2 B2 C1, which was never tested, by combining the influence of A2, B2 and
C1 on the overall pool hitrate of 18.8%.

 It should be possible to model this using e.g. a glm with appropriate
contrasts and model, however for the life of me I cannot work out how to get
 this to work. Getting R to provide a sensible "intercept" to the model,
i.e. the overall hitrate of 18.8%, and also get sensible indications of the
 importance or otherwise of each factor.

 In particular, I am really struggling to get R to test the relevance of
factor A - this is because when tested using all factor C, not just C3,
there
 is a huge bias in that C1 and C2 have comparitively low hit rates, and were
only tested against A1, whereas C4 and C5 have comparitively high hitrates,
 but were only tested against A2.  That means that it appears, unless you
look at the C3 data only, that factor A is highly important - whereas it
clearly
 is not.

 The reason for using R to model this "correctly", rather than just doing
the hand calculations above, is that I would then like to be able to test
for
 interactions as well (and the significance thereof), and perform rather
more complex sets of experiments using the same general ideas.

 Apologies for the massive post, but I have been banging my head against
this one for more than a week now and am absolutely sure there is something
 simple that I am doing wrong!

 Yours sincerely,
 Cormac Long.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] A question on modelling binary response data using factors

Reply via email to