Re: Question regarding small samples and (possibly exact) logistic regression

Richard Ulrich Mon, 12 Apr 2004 06:11:20 -0700

On 10 Apr 2004 09:37:16 -0700, [EMAIL PROTECTED] (Roger Levy)
wrote:

[snip, earlier posts of his and mine]
me > > Now you have confused me, a lot.
> > By 'cases in the smaller group', I am using the common metaphor 
> > of logistic regression, where the prediction is being made between
> > cases and non-cases.
RL > 
> Ah, I think I misunderstood you.  I'm not familiar with the
> cases/non-cases terminology of logistic regression -- could you
> explain this usage?


I will explain by way of providing this extract from a useful
reference, which includes the point I was making - from 
http://www2.chass.ncsu.edu/garson/pa765/logistic.htm
 [ after a number of pages ]
"How many independents can I have? 

"There is no precise answer to this question, but the more 
independents, the more likelihood of multicollinearity. 
Also, if you have 20 independents, at the .05 level of 
significance you would expect one to be found to be 
significant just by chance. A rule of thumb is that there 
should be no more than 1 independent for each 10 cases in 
the sample. In applying this rule of thumb, keep in mind 
that if there are categorical independents, such as 
dichotomies, the number of cases should be considered to 
be the lesser of the groups (ex., in a dichotomy with 
500 0's and 10 1's, effective size would be 10). "
---- end of extract from Garson.

You might find the whole document interesting to scan.
[snip, more of mine]
> 
> By a "distinct covariate vector" I mean the following: with n
> covariates (i.e., predictors) X_1,...,X_n, a covariate vector is a
> value [x_1,...,x_n] for a given data point.  So, for example, if I
> have a half-dozen binary covariates, there are 2^6=64 logically
> possible covariate vectors.

Now I wonder what computer program you are using. 

What you describe was once a concern for the packages, about
20 years.  I remember a program that wanted me to sort my 
cases into order, so that each 'possible covariate vector' (as 
you say) would be contiguous so the program could form the
actual groups.  I do not *think* that is a concern any more, 
for modern packages, even though I find an ambiguous reference 
to your concern in the Garson document I cited.

> 
> Each of my covariates is three-valued.  So the situation for which ML
> and exact logistic regression were giving me substantially different
> results was with a half-dozen covariates, i.e. 3^6=729 possible
> covariate vectors, and 300 datapoints, therefore the covariate space
> was sparsely populated.  I was not including any interaction terms,
> and in most cases each datapoint had a unique set of predictor values,
> so there were only seven parameters in my model and overfitting is
> almost certainly not an issue.
> 
> So to restate my confusion, what I don't understand is the technical
> reason why asymptotic ML estimates for parameter confidence intervals
> and p-values would be unreliable in such a situation, since sample
> size is relatively large in absolute terms.

Well, for one thing, there are two different versions of the 
p-values that are available these days.  You want to look
at the tests that are defined by subtraction, rather than
the Wald test:  If you have an old program, it might only
feature the Wald, which is the ratio of the coefficient divided
by the ASE.  See Garson for details and commentary.

As an alternative step, diagnosing your whole dataset and 
problem, I suggest that you perform a regression
with  0/1 criterion or do two-group discriminant function.  
Those OLS programs are mathematically the same as each
other,  and give practically identical tests to logistic, for 
most data with Ns in the hundreds.  They are more robust 
than logistic against overfitting, and also give better 
diagnostics if that is any threat.

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: Question regarding small samples and (possibly exact) logistic regression

Reply via email to