On Mon, 14 Jun 2010, Joris Meys wrote:

Hi,

Marcs explanation is valid to a certain extent, but I don't agree with
his conclusion. I'd like to point out "the curse of
dimensionality"(Hughes effect) which starts to play rather quickly.


Ahem!

... minimal, self-contained, reproducible code ...

The OPs situation may not be an impossible one:

set.seed(54321)
dat <- as.data.frame(matrix(rnorm(1800*50),nc=50))
dat$y <- rbinom(1800,1,plogis(rowSums(dat)/7))
fit <- glm(y~., dat, family=binomial)
1/7 # the true coef
[1] 0.1428571
sd(coef(fit)) # roughly, the common standard error
[1] 0.05605597
colMeans(coef(summary(fit))) # what glm() got
  Estimate Std. Error    z value   Pr(>|z|)
0.14590836 0.05380666 2.71067328 0.06354820
# a trickier case:
set.seed(54321)
dat <- as.data.frame(matrix(rnorm(1800*50),nc=50))
dat$y <- rbinom(1800,1,plogis(rowSums(dat))) # try again with coef==1
fit <- glm(y~., dat, family=binomial)
colMeans(coef(summary(fit)))
   Estimate  Std. Error     z value    Pr(>|z|)
0.982944012 0.119063631 8.222138491 0.008458002


Finer examination of the latter fit will show some values that differ too far from 1.0 to agree with the asymptotic std err.

sd(coef(fit)) # rather bigger than 0.119
[1] 0.1827462
range(coef(fit))
[1] -0.08128586  1.25797057

And near separability may be playing here:

cbind(
+ table(
+ cut(
+ plogis(abs(predict(fit))),
+ c( 0, 0.9, 0.99, 0.999, 0.9999, 0.99999, 1 ) ) ) )
                 [,1]
(0,0.9]           453
(0.9,0.99]        427
(0.99,0.999]      313
(0.999,0.9999]    251
(0.9999,0.99999]  173
(0.99999,1]       183


Recall that the observed information contains a factor of

        plogis( predict(fit) )* plogis( -predict(fit))

hist(plogis( predict(fit) )* plogis( -predict(fit)))

So the effective sample size here was much reduced.

But to the OP's question, whether what you get is reasonable depends on what the setup is.

I wouldn't call the first of the above cases 'highly unstable'.

Which is not to say that one cannot generate difficult cases (esp. with correlated covariates and/or one or more highly influential covariates) and that the OPs case is not one of them.


HTH,

Chuck

The curse of dimensionality is easily demonstrated looking at the
proximity between your datapoints. Say we scale the interval in one
dimension to be 1 unit. If you have 20 evenly-spaced observations, the
distance between the observations is 0.05 units. To have a proximity
like that in a 2-dimensional space, you need 20^2=400 observations. in
a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
distance between your observations is important, as a sparse dataset
will definitely make your model misbehave.

Even with about 35 samples per variable, using 50 independent
variables will render a highly unstable model, as your dataspace is
about as sparse as it can get. On top of that, interpreting a model
with 50 variables is close to impossible, and then I didn't even start
on interactions. No point in trying I'd say. If you really need all
that information, you might want to take a look at some dimension
reduction methods first.

Cheers
Joris

On Mon, Jun 14, 2010 at 2:55 PM, Marc Schwartz <marc_schwa...@me.com> wrote:
On Jun 13, 2010, at 10:20 PM, array chip wrote:

Hi, this is not R technical question per se. I know there are many excellent 
statisticians in this list, so here my questions: I have dataset with ~1800 
observations and 50 independent variables, so there are about 35 samples per 
variable. Is it wise to build a stable multiple logistic model with 50 
independent variables? Any problem with this approach? Thanks

John


The general rule of thumb is to have 10-20 'events' per covariate degree of 
freedom. Frank has suggested that in some cases that number should be as high 
as 25.

The number of events is the smaller of the two possible outcomes for your 
binary dependent variable.

Covariate degrees of freedom refers to the number of columns in the model 
matrix. Continuous variables are 1, binary factors are 1, K-level factors are K 
- 1.

So if out of your 1800 records, you have at least 500 to 1000 events, depending 
upon how many of your 50 variables are K-level factors and whether or not you 
need to consider interactions, you may be OK. Better if towards the high end of 
that range, especially if the model is for prediction versus explanation.

Two excellent references would be Frank's book:

?http://www.amazon.com/Regression-Modeling-Strategies-Frank-Harrell/dp/0387952322/

and Steyerberg's book:

?http://www.amazon.com/Clinical-Prediction-Models-Development-Validation/dp/038777243X/

to assist in providing guidance for model building/validation techniques.

HTH,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to