Hi,

I have a question regarding small/sparsely-populated datasets and the
reliability of statistical inference in using traditional ML
estimation in logistic regression models.

I'm working with a sample with several hundred observations, with up
to about a dozen plausible covariates, each of which has discrete
ordinal values {1,0,-1}.  On theoretical grounds (involving the
problem domain) I believe it's pretty safe not to use interaction
terms, at least until I've thoroughly investigated first-order models.
The sample is relatively well-balanced with respect to any individual
covariate, but obviously it is sparse with respect to the full
potential covariate space.  As I understand it, a sparsely populated
covariate space does NOT mean that the ML estimates of covariate
parameters are biased, so for purposes of, say, prediction and overall
model goodness of fit I can safely use traditional asymptotic ML
estimation techniques.  However, according to my (vague) understanding
a sparsely-populated covariate space DOES mean that inference
regarding significance of parameter values and parameter confidence
intervals by ML techniques WILL be unreliable.  Herein is where my
confusion lies.

>From what I've read (e.g., Agresti 1996), the large-sample ML-based
approximation of standard error for parameter values is based on the
law of large numbers.  For a small covariate space, then, I can see
why sparsely populated means unreliable SE estimate.  If you have a
sample size of a few hundred, on the other hand, it seems like the law
of large numbers would apply even if the covariate space is large and
sparsely populated.  Now, I've empirically found with exact logistic
regression (LogXact) that using only a half-dozen of my covariates I
get considerable divergence in p-values for some parameters with ML
versus exact techniques (technically, the network Monte Carlo method
from Mehta and Patel 2000).  But I don't understand the theoretical
underpinning of why this is happening.

Any explanation would be greatly appreciated!  Also, I'd greatly
appreciate any references to somewhat more technical discussion of the
large-sample approximation of standard error for asymptotic ML
techniques.

Many thanks,

Roger Levy
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to