Re: [R] min frequencies of categorical predictor variables in GLM

Marc Schwartz Wed, 05 Aug 2009 06:23:29 -0700

On Aug 5, 2009, at 12:51 AM, Thomas Mang wrote:

Marc Schwartz wrote:
On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
Hi,
Suppose a binomial GLM with both continuous as well as categoricalpredictors (sometimes referred to as GLM-ANCOVA, if I remembercorrectly). For the categorical predictors = indicator variables,is then there a suggested minimum frequency of each level ? Wouldsuch a rule/ recommendation be dependent on the y-side too ?
Example: N is quite large, a bit > 100. Observed however are only0/1s (so Bernoulli random variables, not Binomial, because thecovariates are from observations and in general always differentbetween observations). There are two categorical predictors, eachwith 2 levels. It would structurally probably also make sense toallow an interaction between those, yielding de facto a singlecategorical predictor with 4 levels. Is then there a minimum ofobservations falling in each of the 4 level category (eitherabsolute or relative), or also that plus also considering the y-side ?
Must be the day for sample size questions for logistic regression.A similar query is on MedStats today.The typical minimum sample size recommendation for logisticregression is based upon covariate degrees of freedom (or columnsin the model matrix). The guidance is that there should be 10 to 20*events* per covariate degree of freedom.So if you have 2 factors, each with two levels, that gives you twocovariate degrees of freedom total (two columns in the modelmatrix). At the high end of the above range, you would need 40events in your sample.If the event incidence in your sample is 10%, you would need 400cases to observe 40 events to support the model with the two two-level covariates (Y ~ X1 + X2).An interaction term (in addition to the 2 main effect terms, Y ~ X1* X2) in this case would add another column to the model matrix,thus, you would need an additional 20 events, or another 200 casesin your sample.So you could include the two two-level factors and the interactionterm if you have 60 events, or in my example, about 600 cases.
Thanks for that. I suppose your term 'event' does not refer to atechnical thing of GLMs, so I assume that both the number ofobserved 0s _or_ 1s have to be >= 10 / 20 for each df (since it'sarbitrary what of them is the event, and what is the non-event).

Sorry for any confusion. In my applications (clinical), we aretypically modeling/predicting the probability of a discrete event (eg.death, stroke, repeat intervention) or more generally perhaps, thepresence/absence of some characteristic (eg. renal failure). So Ithink in terms of events, which more generally then also correspondsto Cox regression, where similar 'event'/sample size guidelines are inplace when looking at time based event models.

As you note, the count/sample size requirements importantly refer tothe smaller incidence/proportion of the two possible response variablevalues. So you may be interested in modeling/predicting a responsevalue that has a probability of 0.7, but the requirements will bebased upon the 0.3 probability response value.

OK, two questions: The model also contains continuous predictors(call them W, so the model is Y ~ X1*X2 + W. Does the same applyhere too -> for each df of these, 10-20 more events? [If the answerto the former yes, this question is now redundant:] If there areinteractions between the continuous covariates and a categoricalpredictor (Y ~ X1 * (X2 + W), how many more events do I need? Doesthe rule for the categorical predictors count, or that for thecontinuous covariates ?

I tend to think in terms of the number of columns that would be in themodel matrix, where each column corresponds to one covariate degree offreedom. So if you create a model matrix using contrived data thatreflects your expected actual data, along with a given formula, youcan perhaps better quantify the requirements. See ?model.matrix formore information.

Each continuous variable as a main effect term, creates a singlecolumn in the model matrix, therefore adds one degree of freedom,requiring 10-20 'events' for each and the corresponding increase inthe number of total cases.

A single interaction term between a factor and a continuous variable(Factor * Continuous) results in 'nlevels(factor) - 1' additionalcolumns in the model matrix. So again, for each additional column, the'event'/sample size requirements are in place.

Of course, more complex interaction terms and formulae will impact themodel matrix accordingly, so as noted, it may be best to create oneusing dummy data, if your model formulae will be more complicated.


HTH,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] min frequencies of categorical predictor variables in GLM

Reply via email to