Richard Ulrich <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>...
> On 8 Apr 2004 09:40:26 -0700, [EMAIL PROTECTED] (Roger Levy) wrote:
>
> > Hi,
> >
> > I have a question regarding small/sparsely-populated datasets and the
> > reliability of statistical inference in using traditional ML
> > estimation in logistic regression models.
> >
> > I'm working with a sample with several hundred observations, with up
> > to about a dozen plausible covariates, each of which has discrete
> > ordinal values {1,0,-1}. On theoretical grounds (involving the
> > problem domain) I believe it's pretty safe not to use interaction
> > terms, at least until I've thoroughly investigated first-order models.
> > The sample is relatively well-balanced with respect to any individual
> > covariate, but obviously it is sparse with respect to the full
> > potential covariate space. As I understand it, a sparsely populated
> > covariate space does NOT mean that the ML estimates of covariate
> > parameters are biased, so for purposes of, say, prediction and overall
> > model goodness of fit I can safely use traditional asymptotic ML
> > estimation techniques. However, according to my (vague) understanding
> > a sparsely-populated covariate space DOES mean that inference
> > regarding significance of parameter values and parameter confidence
> > intervals by ML techniques WILL be unreliable. Herein is where my
> > confusion lies.
> >
> > From what I've read (e.g., Agresti 1996), the large-sample ML-based
> > approximation of standard error for parameter values is based on the
> > law of large numbers. For a small covariate space, then, I can see
> > why sparsely populated means unreliable SE estimate. If you have a
> > sample size of a few hundred, on the other hand, it seems like the law
> > of large numbers would apply even if the covariate space is large and
> > sparsely populated. Now, I've empirically found with exact logistic
> > regression (LogXact) that using only a half-dozen of my covariates I
> > get considerable divergence in p-values for some parameters with ML
> > versus exact techniques (technically, the network Monte Carlo method
> > from Mehta and Patel 2000). But I don't understand the theoretical
> > underpinning of why this is happening.
>
> How many do you have your smaller group? If you have only
> (say) 5 cases, you may be lucky to find anything with *one*
> variable, even though your total N is 300. - And, once the cases
> are 'predicted' adequately, there is little for your extra variables
> to do that won't show up as artifacts of over-fitting.
> If you reach perfect prediction, then your likelihood surface
> has a hole in it -- Not allowable. Or, effectively, your predictors
> can become collinear, when any predictor can substitute for
> some other one: That makes a flat likelihood surface, where
> the SEs become large because they are measured by the
> steepness of the slope.
By 'cases' I presume you mean distinct covariate vectors? Sorry, I
should have mentioned this -- the number of covariate vectors is on
the order of the sample size (i.e., in the hundreds). So I'm pretty
sure that overfitting and collinearity are not really issues here
(since I'm not including any interaction terms in the model).
>
> I don't know why the LogXact method is not affected.
> I suspect that it is, depending on which variety of SE it
> reports.
LogXact determines (or samples from, depending on method chosen) the
exact distribution of the conditional likelihood function and uses it
to calculate p-values and confidence intervals. Since it's using the
exact distribution, it isn't susceptible to small-sample effects
(similar to the way that, for 2x2 contingency table analysis, Fisher's
exact test can be used for any distribution of cell counts whereas the
chi-squared test has sample size limitations).
>
> >
> > Any explanation would be greatly appreciated! Also, I'd greatly
> > appreciate any references to somewhat more technical discussion of the
> > large-sample approximation of standard error for asymptotic ML
> > techniques.
> >
>
> Check further references in your Agresti?
Not as far as I've found...
Any other thoughts?
Many thanks,
Roger
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================