On Aug 5, 2009, at 12:51 AM, Thomas Mang wrote:

Marc Schwartz wrote:
On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
Hi,

Suppose a binomial GLM with both continuous as well as categorical predictors (sometimes referred to as GLM-ANCOVA, if I remember correctly). For the categorical predictors = indicator variables, is then there a suggested minimum frequency of each level ? Would such a rule/ recommendation be dependent on the y-side too ?

Example: N is quite large, a bit > 100. Observed however are only 0/1s (so Bernoulli random variables, not Binomial, because the covariates are from observations and in general always different between observations). There are two categorical predictors, each with 2 levels. It would structurally probably also make sense to allow an interaction between those, yielding de facto a single categorical predictor with 4 levels. Is then there a minimum of observations falling in each of the 4 level category (either absolute or relative), or also that plus also considering the y- side ?
Must be the day for sample size questions for logistic regression. A similar query is on MedStats today. The typical minimum sample size recommendation for logistic regression is based upon covariate degrees of freedom (or columns in the model matrix). The guidance is that there should be 10 to 20 *events* per covariate degree of freedom. So if you have 2 factors, each with two levels, that gives you two covariate degrees of freedom total (two columns in the model matrix). At the high end of the above range, you would need 40 events in your sample. If the event incidence in your sample is 10%, you would need 400 cases to observe 40 events to support the model with the two two- level covariates (Y ~ X1 + X2). An interaction term (in addition to the 2 main effect terms, Y ~ X1 * X2) in this case would add another column to the model matrix, thus, you would need an additional 20 events, or another 200 cases in your sample. So you could include the two two-level factors and the interaction term if you have 60 events, or in my example, about 600 cases.

Thanks for that. I suppose your term 'event' does not refer to a technical thing of GLMs, so I assume that both the number of observed 0s _or_ 1s have to be >= 10 / 20 for each df (since it's arbitrary what of them is the event, and what is the non-event).

Sorry for any confusion. In my applications (clinical), we are typically modeling/predicting the probability of a discrete event (eg. death, stroke, repeat intervention) or more generally perhaps, the presence/absence of some characteristic (eg. renal failure). So I think in terms of events, which more generally then also corresponds to Cox regression, where similar 'event'/sample size guidelines are in place when looking at time based event models.

As you note, the count/sample size requirements importantly refer to the smaller incidence/proportion of the two possible response variable values. So you may be interested in modeling/predicting a response value that has a probability of 0.7, but the requirements will be based upon the 0.3 probability response value.


OK, two questions: The model also contains continuous predictors (call them W, so the model is Y ~ X1*X2 + W. Does the same apply here too -> for each df of these, 10-20 more events? [If the answer to the former yes, this question is now redundant:] If there are interactions between the continuous covariates and a categorical predictor (Y ~ X1 * (X2 + W), how many more events do I need? Does the rule for the categorical predictors count, or that for the continuous covariates ?

I tend to think in terms of the number of columns that would be in the model matrix, where each column corresponds to one covariate degree of freedom. So if you create a model matrix using contrived data that reflects your expected actual data, along with a given formula, you can perhaps better quantify the requirements. See ?model.matrix for more information.

Each continuous variable as a main effect term, creates a single column in the model matrix, therefore adds one degree of freedom, requiring 10-20 'events' for each and the corresponding increase in the number of total cases.

A single interaction term between a factor and a continuous variable (Factor * Continuous) results in 'nlevels(factor) - 1' additional columns in the model matrix. So again, for each additional column, the 'event'/sample size requirements are in place.

Of course, more complex interaction terms and formulae will impact the model matrix accordingly, so as noted, it may be best to create one using dummy data, if your model formulae will be more complicated.

HTH,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to