Re: Question regarding small samples and (possibly exact) logistic regression

Richard Ulrich Fri, 09 Apr 2004 05:28:04 -0700

On 8 Apr 2004 09:40:26 -0700, [EMAIL PROTECTED] (Roger Levy) wrote:

> Hi,
> 
> I have a question regarding small/sparsely-populated datasets and the
> reliability of statistical inference in using traditional ML
> estimation in logistic regression models.
> 
> I'm working with a sample with several hundred observations, with up
> to about a dozen plausible covariates, each of which has discrete
> ordinal values {1,0,-1}.  On theoretical grounds (involving the
> problem domain) I believe it's pretty safe not to use interaction
> terms, at least until I've thoroughly investigated first-order models.
> The sample is relatively well-balanced with respect to any individual
> covariate, but obviously it is sparse with respect to the full
> potential covariate space.  As I understand it, a sparsely populated
> covariate space does NOT mean that the ML estimates of covariate
> parameters are biased, so for purposes of, say, prediction and overall
> model goodness of fit I can safely use traditional asymptotic ML
> estimation techniques.  However, according to my (vague) understanding
> a sparsely-populated covariate space DOES mean that inference
> regarding significance of parameter values and parameter confidence
> intervals by ML techniques WILL be unreliable.  Herein is where my
> confusion lies.
> 
> From what I've read (e.g., Agresti 1996), the large-sample ML-based
> approximation of standard error for parameter values is based on the
> law of large numbers.  For a small covariate space, then, I can see
> why sparsely populated means unreliable SE estimate.  If you have a
> sample size of a few hundred, on the other hand, it seems like the law
> of large numbers would apply even if the covariate space is large and
> sparsely populated.  Now, I've empirically found with exact logistic
> regression (LogXact) that using only a half-dozen of my covariates I
> get considerable divergence in p-values for some parameters with ML
> versus exact techniques (technically, the network Monte Carlo method
> from Mehta and Patel 2000).  But I don't understand the theoretical
> underpinning of why this is happening.


How many do you have your smaller group?  If you have only
(say) 5 cases, you may be lucky to find anything with *one*  
variable, even though your total N  is 300.  - And, once the cases
are 'predicted' adequately, there is little for your extra variables
to do that won't show up as artifacts of over-fitting.
If you reach perfect prediction, then your likelihood surface
has a hole in it -- Not allowable.  Or, effectively, your predictors
can become collinear, when any predictor can substitute for
some other one:  That makes a flat likelihood surface, where
the SEs  become large because they are measured by the 
steepness of the slope.

I don't know why the LogXact method is not affected.
I suspect that it is, depending on which variety of SE it 
reports.  

> 
> Any explanation would be greatly appreciated!  Also, I'd greatly
> appreciate any references to somewhat more technical discussion of the
> large-sample approximation of standard error for asymptotic ML
> techniques.
> 

Check further references in your Agresti?

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: Question regarding small samples and (possibly exact) logistic regression

Reply via email to