[R] permutation test for PLS/PLSDA

2012-12-22 Thread khosoda
Hi,

Is there any R package doing permutation/randomization test for
PLS/PLSDA? I found some codes for MatLab, but I want to use R program.

Thank you very much in advance.

Kohkichi Hosoda

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction

2011-08-20 Thread khosoda

Dear Mark,

Thank you very much for your advice.
I will try it.

I really appreciate your all kind advice.
Thanks a lot again.

Best regards,

Kohkichi


(11/08/19 22:28), Mark Difford wrote:

On Aug 19, 2011 khosoda wrote:


I used x10.homals4$objscores[, 1] as a predictor for logistic regression
as in the same way as PC1 in PCA.
Am I going the right way?


Hi Kohkichi,

Yes, but maybe explore the "sets=" argument (set "Response" as the target
variable and the others as the predictor variables). Then use Dim1 scores.
Also think about fitting a rank-1 restricted model, combined with the sets=
option.

See the vignette to the package and look at

@ARTICLE{MIC98,
   author = {Michailides, G. and de Leeuw, J.},
   title = {The {G}ifi system of descriptive multivariate analysis},
   journal = {Statistical Science},
   year = {1998},
   volume = {13},
   pages = {307--336},
   abstract = {}
}

Regards, Mark.

-
Mark Difford (Ph.D.)
Research Associate
Botany Department
Nelson Mandela Metropolitan University
Port Elizabeth, South Africa
--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3755163.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction

2011-08-19 Thread khosoda

Dear Mark,

Thank you very much for your kind advice.

Actually, I already performed penalized logistic regression by pentrace 
and lrm in package "rms".


The reason why I wanted to reduce dimensionality of those 9 variables 
was that these variables were not so important according to the subject 
matter knowledge and that I wanted to avoid events per variable problem.


Your answer about dudi.mix$l1 helped me a lot.
I finally was able to perform penalized logistic regression for data 
consisting of 4 important variables and x18.dudi.mix$l1[, 1]. Thanks a 
lot again.


One more question, I investigated homals package too. I found it has 
"ndim" option.


mydata is followings;

> head(x10homals.df)
  age sex  symptom   HT   DM  IHD  smoking 
hyperlipidemia   Statin Response
1  62   M asymptomatic positive negative negative positive 
positive positive negative
2  82   M  symptomatic positive negative negative negative 
positive positive negative
3  64   M asymptomatic negative positive negative negative 
positive positive negative
4  55   M  symptomatic positive positive positive negative 
positive positive negative
5  67   M  symptomatic positive negative negative negative 
negative positive negative
6  79   M asymptomatic positive positive negative negative 
positive positive negative


age is continuous variable, and Response should not be active for 
computation, so, ...


x10.homals4 <- homals(x10homals.df, active = c(rep(TRUE, 9), FALSE), 
level=c("numerical", rep("nominal", 9)), ndim=4)


I did it with ndim from 2 to 9, compared Classification rate of Response 
by predict(x10.homals).


> p.x10.homals4

Classification rate:
 Variable Cl. Rate %Cl. Rate
1 age   0.4712 47.12
2 sex   0.9808 98.08
3 symptom   0.8269 82.69
4  HT   0.9135 91.35
5  DM   0.8558 85.58
6 IHD   0.8750 87.50
7 smoking   0.9423 94.23
8  hyperlipidemia   0.9519 95.19
9  Statin   0.8942 89.42
10   Response   0.6154 61.54

This is the best for classification of Response, so, I selected ndim=4. 
Then, I found objscores.


> head(x10.homals4$objscores)
D1   D2   D3  D4
1 -0.002395321 -0.034032230 -0.008140378  0.02369123
2  0.036788626 -0.010308707  0.005725984 -0.02751958
3  0.014363031  0.049594466 -0.025627467  0.06254055
4  0.083092285  0.065147519  0.045903394 -0.03751551
5 -0.013692504  0.005106661 -0.007656776 -0.04107009
6  0.002320747  0.024375393 -0.017785415 -0.01752556

I used x10.homals4$objscores[, 1] as a predictor for logistic regression 
as in the same way as PC1 in PCA.


Am I going the right way?

Thanks a lot for your help in advance.

Best regards

--
Kohkichi Hosoda


(11/08/19 4:21), Mark Difford wrote:

On Aug 18, 2011 khosoda wrote:


I'm trying to do model reduction for logistic regression.


Hi Kohkichi,

My general advice to you would be to do this by fitting a penalized logistic
model (see lrm in package rms and glmnet in package glmnet; there are
several others).

Other points are that the amount of variance explained by mixed PCA and MCA
are not comparable. Furthermore, homals() is a much better choice than MCA
because it handles different types of variables whereas MCA is for
categorical variables.

On the more specific question of whether you should use dudi.mix$l1 or
dudi.mix$li, it doesn't matter: the former is a scaled version of the
latter. Same for dudi.acm. To see this do the following:

##
plot(x18.dudi.mix$li[, 1], x18.dudi.mix$l1[, 1])

Regards, Mark.

-
Mark Difford (Ph.D.)
Research Associate
Botany Department
Nelson Mandela Metropolitan University
Port Elizabeth, South Africa
--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3753437.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
*
 神戸大学大学院医学研究科 脳神経外科学分野
 細田 弘吉
 
 〒650-0017 神戸市中央区楠町7丁目5-1
Phone: 078-382-5966
Fax  : 078-382-5979
E-mail address
Office: khos...@med.kobe-u.ac.jp
Home  : khos...@venus.dti.ne.jp

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction

2011-08-18 Thread khosoda

Dear Mark,

Thank you very much for your mail. This is what I really wanted!
I tried dudi.mix in ade4 package.

> ade4plaque.df <- x18.df[c("age", "sex", "symptom", "HT", "DM", "IHD", 
"smoking", "DL", "Statin")]


> head(ade4plaque.df)
  age sex  symptom   HT   DM  IHD  smoking 
hyperlipidemia   Statin
1  62   M asymptomatic positive negative negative positive 
positive positive
2  82   M  symptomatic positive negative negative negative 
positive positive
3  64   M asymptomatic negative positive negative negative 
positive positive
4  55   M  symptomatic positive positive positive negative 
positive positive
5  67   M  symptomatic positive negative negative negative 
negative positive
6  79   M asymptomatic positive positive negative negative 
positive positive


> x18.dudi.mix <- dudi.mix(ade4plaque.df)
> x18.dudi.mix$eig
[1] 1.7750557 1.4504641 1.2178640 1.0344946 0.8496640 0.8248379 
0.7011151 0.6367328 0.5097718

> x18.dudi.mix$eig[1:9]/sum(x18.dudi.mix$eig)
[1] 0.19722841 0.16116268 0.13531822 0.11494385 0.09440711 0.09164866 
0.07790168 0.07074809 0.05664131


Still first component explained only 19.8% of the variances, right?

Then, I investigated values of dudi.mix corresponding to PC1 of PCA. 
Help file say;

l1   principal components, data frame with n rows and nf columns
li   row coordinates, data frame with n rows and nf columns

So, I guess I should use x18.dudi.mix$l1[, 1].
Am I right?

Or should I use multiple correpondence analysis because the first plane 
explained 43% of the variance?


Thank you for your help in advance.

Kohkichi


(11/08/18 18:33), Mark Difford wrote:

On Aug 17, 2011 khosoda wrote:


1. Is it O.K. to perform PCA for data consisting of 1 continuous
variable and 8 binary variables?
2. Is it O.K to perform transformation of age from continuous variable
to factor variable for MCA?
3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of
logistic regression model like PC1 of PCA?


Hi Kohkichi,

If you want to do this, i.e. PCA-type analysis with different
variable-types, then look at dudi.mix() in package ade4 and homals() in
package homals.

Regards, Mark.

-
Mark Difford (Ph.D.)
Research Associate
Botany Department
Nelson Mandela Metropolitan University
Port Elizabeth, South Africa
--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3752168.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
*
 神戸大学大学院医学研究科 脳神経外科学分野
 細田 弘吉
 
 〒650-0017 神戸市中央区楠町7丁目5-1
Phone: 078-382-5966
Fax  : 078-382-5979
E-mail address
Office: khos...@med.kobe-u.ac.jp
Home  : khos...@venus.dti.ne.jp

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction

2011-08-18 Thread khosoda

Dear Daniel,

Thank you for your mail.
Your comment is exactly what I was worried about.

I konw very little about latent class analysis. So, I would like to use 
multiple correspondence analysis (MCA) for data redution. Besides, the 
first plane of the MCA captured 43% of the variance.


Do you think my use of "mjca1$rowcoord[, 1]" in ca package for data 
reduction in the previous mail is O.K.?


Thank you for your help.

--
Kohkichi Hosoda

(11/08/18 17:39), Daniel Malter wrote:

Pooling nominal with numeric variables and running pca on them sounds like
conceptual nonsense to me. You use PCA to reduce the dimensionality of the
data if the data are numeric. For categorical data analysis, you should use
latent class analysis or something along those lines.

The fact that your first PC captures only 20 percent of the variance
indicates that either you apply the wrong technique or that dimensionality
reduction is of little use for these data more generally. The first step
should generally be to check the correlations/associations between the
variables to inspect whether what you intend to do makes sense.

HTH,
Daniel



khosoda wrote:


Hi all,

I'm trying to do model reduction for logistic regression. I have 13
predictor (4 continuous variables and 9 binary variables). Using subject
matter knowledge, I selected 4 important variables. Regarding the rest 9
variables, I tried to perform data reduction by principal component
analysis (PCA). However, 8 of 9 variables were binary and only one
continuous. I transformed the data by transcan of rms package and did
PCA with princomp. PC1 explained only 20% of the variance. Still, I used
the PC1 as a predictor of the logistic model and obtained some results.

Then, I tried multiple correspondence analysis (MCA). The only one
continuous variable was age. I transformed "age" variable to "age_Q"
factor variable as the followings.


quantile(mydata.df$age)

0%   25%   50%   75%  100%
53.00 66.75 72.00 76.25 85.00

age_Q<- cut(x17.df$age, right=TRUE, breaks=c(-Inf, 66, 72, 76, Inf),

labels=c("53-66", "67-72", "73-76", "77-85"))

table(age_Q)

age_Q
53-66 67-72 73-76 77-85
26272526

Then, I used mjca of ca pacakge for MCA.


mjca1<-  mjca(mydata.df[, c("age_Q","sex","symptom", "HT", "DM",

"IHD","smoking","DL", "Statin")])


summary(mjca1)


Principal inertias (eigenvalues):

  dimvalue  %   cum%   scree plot
  1  0.009592  43.4  43.4  *
  2  0.003983  18.0  61.4  **
  3  0.001047   4.7  66.1  **
  4  0.000367   1.7  67.8
  -
  Total: 0.022111

The dimension 1 explained 43% of the variance. Then, I was wondering
which values I could use like PC1 in PCA. I explored in mjca1 and found
"rowcoord".


mjca1$rowcoord

   [,1]  [,2][,3] [,4]
   [1,]  0.07403748  0.8963482181  0.10828273  1.581381849
   [2,]  0.92433996 -1.1497911361  1.28872517  0.304065865
   [3,]  0.49833354  0.6482940556 -2.4314  0.365023261
   [4,]  0.18998290 -1.4028117048 -1.70962159  0.451951744
   [5,] -0.13008173  0.2557656854  1.16561601 -1.012992485
.
.
[101,] -1.86940216  0.5918128751  0.87352987 -1.118865117
[102,] -2.19096615  1.2845448725  0.25227354 -0.938612155
[103,]  0.77981265 -1.1931087587  0.23934034  0.627601413
[104,] -2.37058237 -1.4014005013 -0.73578248 -1.455055095

Then, I used mjca1$rowcoord[, 1] as the followings.


mydata.df$NewScore<- mjca1$rowcoord[, 1]


I used this "NewScore" as one of the predictors for the model instead of
original 9 variables.

The final logistic model obtained by use of MCA was similar to the one
obtained by use of PCA.

My questions are;

1. Is it O.K. to perform PCA for data consisting of 1 continuous
variable and 8 binary variables?

2. Is it O.K to perform transformation of age from continuous variable
to factor variable for MCA?

3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of
logistic regression model like PC1 of PCA?

I would appreciate your help in advance.

--
Kohkichi Hosoda

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3752062.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.

[R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction

2011-08-17 Thread khosoda
Hi all,

I'm trying to do model reduction for logistic regression. I have 13
predictor (4 continuous variables and 9 binary variables). Using subject
matter knowledge, I selected 4 important variables. Regarding the rest 9
variables, I tried to perform data reduction by principal component
analysis (PCA). However, 8 of 9 variables were binary and only one
continuous. I transformed the data by transcan of rms package and did
PCA with princomp. PC1 explained only 20% of the variance. Still, I used
the PC1 as a predictor of the logistic model and obtained some results.

Then, I tried multiple correspondence analysis (MCA). The only one
continuous variable was age. I transformed "age" variable to "age_Q"
factor variable as the followings.

> quantile(mydata.df$age)
   0%   25%   50%   75%  100%
53.00 66.75 72.00 76.25 85.00
> age_Q <- cut(x17.df$age, right=TRUE, breaks=c(-Inf, 66, 72, 76, Inf),
labels=c("53-66", "67-72", "73-76", "77-85"))
> table(age_Q)
age_Q
53-66 67-72 73-76 77-85
   26272526

Then, I used mjca of ca pacakge for MCA.

> mjca1 <-  mjca(mydata.df[, c("age_Q","sex","symptom", "HT", "DM",
"IHD","smoking","DL", "Statin")])

> summary(mjca1)

Principal inertias (eigenvalues):

 dimvalue  %   cum%   scree plot
 1  0.009592  43.4  43.4  *
 2  0.003983  18.0  61.4  **
 3  0.001047   4.7  66.1  **
 4  0.000367   1.7  67.8
 -
 Total: 0.022111

The dimension 1 explained 43% of the variance. Then, I was wondering
which values I could use like PC1 in PCA. I explored in mjca1 and found
"rowcoord".

> mjca1$rowcoord
  [,1]  [,2][,3] [,4]
  [1,]  0.07403748  0.8963482181  0.10828273  1.581381849
  [2,]  0.92433996 -1.1497911361  1.28872517  0.304065865
  [3,]  0.49833354  0.6482940556 -2.4314  0.365023261
  [4,]  0.18998290 -1.4028117048 -1.70962159  0.451951744
  [5,] -0.13008173  0.2557656854  1.16561601 -1.012992485
.
.
[101,] -1.86940216  0.5918128751  0.87352987 -1.118865117
[102,] -2.19096615  1.2845448725  0.25227354 -0.938612155
[103,]  0.77981265 -1.1931087587  0.23934034  0.627601413
[104,] -2.37058237 -1.4014005013 -0.73578248 -1.455055095

Then, I used mjca1$rowcoord[, 1] as the followings.

> mydata.df$NewScore <- mjca1$rowcoord[, 1]

I used this "NewScore" as one of the predictors for the model instead of
original 9 variables.

The final logistic model obtained by use of MCA was similar to the one
obtained by use of PCA.

My questions are;

1. Is it O.K. to perform PCA for data consisting of 1 continuous
variable and 8 binary variables?

2. Is it O.K to perform transformation of age from continuous variable
to factor variable for MCA?

3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of
logistic regression model like PC1 of PCA?

I would appreciate your help in advance.

--
Kohkichi Hosoda

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to calculate confidence interval of C statistic by rcorr.cens

2011-05-22 Thread khosoda

Dear Prof. Harrell,

I'm sorry to say this, but I'm afraid I cannot understand what you write 
very well. Do you mean that the method to calculate confidence intervals 
for Dxy or C statistics in logistic model penalized for overfitting has 
not been established yet and what I did is wrong?

Could you elaborate it or teach me some reference point?

Kohkichi

(11/05/23 4:22), Frank Harrell wrote:

Hi Kohkichi,
What we really need to figure out is how to make validate give you
confidence intervals for Dxy or C while it is penalizing for overfitting.
Some people have ad hoc solutions for that but nothing is nailed down yet.
Frank

khosoda wrote:


Thank you for your comment, Prof Harrell.

I changed the function;

CstatisticCI<- function(x)   # x is object of rcorr.cens.
{
  se<- x["S.D."]/2
  Low95<- x["C Index"] - 1.96*se
  Upper95<- x["C Index"] + 1.96*se

  cbind(x["C Index"], Low95, Upper95)
}

  >  CstatisticCI(MyModel.lrm.penalized.rcorr)
Low95   Upper95
C Index 0.8222785 0.7195828 0.9249742

I obtained wider CI than the previous incorrect one.
Regarding your comments on overfitting, this is a sample used in model
development. However, I performed penalization by pentrace and lrm in
rms package. The CI above is CI of penalized model. Results of
validation of each model are followings;

First model
  >  validate(MyModel.lrm, bw=F, B=1000)
index.orig trainingtest optimism index.correctedn
Dxy   0.6385   0.6859  0.6198   0.0661  0.5724 1000
R20.3745   0.4222  0.3388   0.0834  0.2912 1000
Intercept 0.   0. -0.1446   0.1446 -0.1446 1000
Slope 1.   1.  0.8266   0.1734  0.8266 1000
Emax  0.   0.  0.0688   0.0688  0.0688 1000
D 0.2784   0.3248  0.2474   0.0774  0.2010 1000
U-0.0192  -0.0192  0.0200  -0.0392  0.0200 1000
Q 0.2976   0.3440  0.2274   0.1166  0.1810 1000
B 0.1265   0.1180  0.1346  -0.0167  0.1431 1000
g 1.7010   2.0247  1.5763   0.4484  1.2526 1000
gp0.2414   0.2512  0.2287   0.0225  0.2189 1000

penalized model
  >  validate(MyModel.lrm.penalized, bw=F, B=1000)
index.orig trainingtest optimism index.correctedn
Dxy   0.6446   0.6898  0.6256   0.0642  0.5804 1000
R20.3335   0.3691  0.3428   0.0264  0.3072 1000
Intercept 0.   0.  0.0752  -0.0752  0.0752 1000
Slope 1.   1.  1.0547  -0.0547  1.0547 1000
Emax  0.   0.  0.0249   0.0249  0.0249 1000
D 0.2718   0.2744  0.2507   0.0236  0.2481 1000
U-0.0192  -0.0192 -0.0027  -0.0165 -0.0027 1000
Q 0.2910   0.2936  0.2534   0.0402  0.2508 1000
B 0.1279   0.1192  0.1336  -0.0144  0.1423 1000
g 1.3942   1.5259  1.5799  -0.0540  1.4482 1000
gp0.2141   0.2188  0.2298  -0.0110  0.2251 1000

Optimism of slope and intercept were improved from 0.1446 and 0.1734 to
-0.0752 and -0.0547, respectively. Emax was improved from 0.0688 to
0.0249. Therefore, I thought overfitting was improved at least to some
extent. Well, I'm not sure whether this is enough improvement though.

--
Kohkichi

(11/05/22 23:27), Frank Harrell wrote:

S.D. is the standard deviation (standard error) of Dxy.  It already
includes
the effective sample size in its computation so the sqrt(n) terms is not
needed.  The help file for rcorr.cens has an example where the confidence
interval for C is computed.  Note that you are making the strong
assumption
that there is no overfitting in the model or that you are evaluating C on
a
sample not used in model development.
Frank


Kohkichi wrote:


Hi,

I'm trying to calculate 95% confidence interval of C statistic of
logistic regression model using rcorr.cens in rms package. I wrote a
brief function for this purpose as the followings;

CstatisticCI<- function(x)   # x is object of rcorr.cens.
{
  se<- x["S.D."]/sqrt(x["n"])
  Low95<- x["C Index"] - 1.96*se
  Upper95<- x["C Index"] + 1.96*se
  cbind(x["C Index"], Low95, Upper95)
}

Then,


MyModel.lrm.rcorr<- rcorr.cens(x=predict(MyModel.lrm), S=df$outcome)
MyModel.lrm.rcorr

 C IndexDxy   S.D.  n
missing uncensored
   0.8222785  0.6445570  0.1047916104.000
0.000104.000
Relevant Pairs Concordant  Uncertain
3950.000   3248.000  0.000


CstatisticCI(x5factor_final.lrm.pen.rcorr)

Low95   Upper95
C Index 0.8222785 0.8021382 0.8424188

I'm not sure what "S.D." in object of rcorr

Re: [R] How to calculate confidence interval of C statistic by rcorr.cens

2011-05-22 Thread khosoda

Thank you for your comment, Prof Harrell.

I changed the function;

CstatisticCI <- function(x)   # x is object of rcorr.cens.
  {
se <- x["S.D."]/2
Low95 <- x["C Index"] - 1.96*se
Upper95 <- x["C Index"] + 1.96*se

cbind(x["C Index"], Low95, Upper95)
  }

> CstatisticCI(MyModel.lrm.penalized.rcorr)
  Low95   Upper95
C Index 0.8222785 0.7195828 0.9249742

I obtained wider CI than the previous incorrect one.
Regarding your comments on overfitting, this is a sample used in model 
development. However, I performed penalization by pentrace and lrm in 
rms package. The CI above is CI of penalized model. Results of 
validation of each model are followings;


First model
> validate(MyModel.lrm, bw=F, B=1000)
  index.orig trainingtest optimism index.correctedn
Dxy   0.6385   0.6859  0.6198   0.0661  0.5724 1000
R20.3745   0.4222  0.3388   0.0834  0.2912 1000
Intercept 0.   0. -0.1446   0.1446 -0.1446 1000
Slope 1.   1.  0.8266   0.1734  0.8266 1000
Emax  0.   0.  0.0688   0.0688  0.0688 1000
D 0.2784   0.3248  0.2474   0.0774  0.2010 1000
U-0.0192  -0.0192  0.0200  -0.0392  0.0200 1000
Q 0.2976   0.3440  0.2274   0.1166  0.1810 1000
B 0.1265   0.1180  0.1346  -0.0167  0.1431 1000
g 1.7010   2.0247  1.5763   0.4484  1.2526 1000
gp0.2414   0.2512  0.2287   0.0225  0.2189 1000

penalized model
> validate(MyModel.lrm.penalized, bw=F, B=1000)
  index.orig trainingtest optimism index.correctedn
Dxy   0.6446   0.6898  0.6256   0.0642  0.5804 1000
R20.3335   0.3691  0.3428   0.0264  0.3072 1000
Intercept 0.   0.  0.0752  -0.0752  0.0752 1000
Slope 1.   1.  1.0547  -0.0547  1.0547 1000
Emax  0.   0.  0.0249   0.0249  0.0249 1000
D 0.2718   0.2744  0.2507   0.0236  0.2481 1000
U-0.0192  -0.0192 -0.0027  -0.0165 -0.0027 1000
Q 0.2910   0.2936  0.2534   0.0402  0.2508 1000
B 0.1279   0.1192  0.1336  -0.0144  0.1423 1000
g 1.3942   1.5259  1.5799  -0.0540  1.4482 1000
gp0.2141   0.2188  0.2298  -0.0110  0.2251 1000

Optimism of slope and intercept were improved from 0.1446 and 0.1734 to 
-0.0752 and -0.0547, respectively. Emax was improved from 0.0688 to 
0.0249. Therefore, I thought overfitting was improved at least to some 
extent. Well, I'm not sure whether this is enough improvement though.


--
Kohkichi

(11/05/22 23:27), Frank Harrell wrote:

S.D. is the standard deviation (standard error) of Dxy.  It already includes
the effective sample size in its computation so the sqrt(n) terms is not
needed.  The help file for rcorr.cens has an example where the confidence
interval for C is computed.  Note that you are making the strong assumption
that there is no overfitting in the model or that you are evaluating C on a
sample not used in model development.
Frank


Kohkichi wrote:


Hi,

I'm trying to calculate 95% confidence interval of C statistic of
logistic regression model using rcorr.cens in rms package. I wrote a
brief function for this purpose as the followings;

CstatisticCI<- function(x)   # x is object of rcorr.cens.
   {
 se<- x["S.D."]/sqrt(x["n"])
 Low95<- x["C Index"] - 1.96*se
 Upper95<- x["C Index"] + 1.96*se
 cbind(x["C Index"], Low95, Upper95)
   }

Then,


MyModel.lrm.rcorr<- rcorr.cens(x=predict(MyModel.lrm), S=df$outcome)
MyModel.lrm.rcorr

C IndexDxy   S.D.  n
missing uncensored
  0.8222785  0.6445570  0.1047916104.000
0.000104.000
Relevant Pairs Concordant  Uncertain
   3950.000   3248.000  0.000


CstatisticCI(x5factor_final.lrm.pen.rcorr)

   Low95   Upper95
C Index 0.8222785 0.8021382 0.8424188

I'm not sure what "S.D." in object of rcorr.cens means. Is this standard
deviation of "C Index" or standard deviation of "Dxy"?
I thought it is standard deviation of "C Index". Therefore, I wrote the
code above. Am I right?

I would appreciate any help in advance.

--
Kohkichi Hosoda M.D.

 Department of Neurosurgery,
 Kobe University Graduate School of Medicine,

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




-
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-calculate-confidence-interval-of-C-statistic-by-rcorr-cens-tp3541709p3542163.html
Sent from t

[R] How to calculate confidence interval of C statistic by rcorr.cens

2011-05-22 Thread khosoda
Hi,

I'm trying to calculate 95% confidence interval of C statistic of
logistic regression model using rcorr.cens in rms package. I wrote a
brief function for this purpose as the followings;

CstatisticCI <- function(x)   # x is object of rcorr.cens.
  {
se <- x["S.D."]/sqrt(x["n"])
Low95 <- x["C Index"] - 1.96*se
Upper95 <- x["C Index"] + 1.96*se
cbind(x["C Index"], Low95, Upper95)
  }

Then,

> MyModel.lrm.rcorr <- rcorr.cens(x=predict(MyModel.lrm), S=df$outcome)
> MyModel.lrm.rcorr
   C IndexDxy   S.D.  n
missing uncensored
 0.8222785  0.6445570  0.1047916104.000
0.000104.000
Relevant Pairs Concordant  Uncertain
  3950.000   3248.000  0.000

> CstatisticCI(x5factor_final.lrm.pen.rcorr)
  Low95   Upper95
C Index 0.8222785 0.8021382 0.8424188

I'm not sure what "S.D." in object of rcorr.cens means. Is this standard
deviation of "C Index" or standard deviation of "Dxy"?
I thought it is standard deviation of "C Index". Therefore, I wrote the
code above. Am I right?

I would appreciate any help in advance.

--
Kohkichi Hosoda M.D.

Department of Neurosurgery,
Kobe University Graduate School of Medicine,

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question on approximations of full logistic regression model

2011-05-18 Thread khosoda
Thank you for your advice, Tim.

I am reading your paper and other materials in your website.
I could not find R package of your bootknife method. Is there any R
package for this procedure?

(11/05/17 14:13), Tim Hesterberg wrote:
> My usual rule is that whatever gives the widest confidence intervals
> in a particular problem is most accurate for that problem :-)
> 
> Bootstrap percentile intervals tend to be too narrow.
> Consider the case of the sample mean; the usual formula CI is
>  xbar +- t_alpha sqrt( (1/(n-1)) sum((x_i - xbar)^2)) / sqrt(n)
> The bootstrap percentile interval for symmetric data is roughly
>  xbar +- z_alpha sqrt( (1/(n  )) sum((x_i - xbar)^2)) / sqrt(n)
> It is narrower than the formula CI because
>* z quantiles rather than t quantiles
>* standard error uses divisor of n rather than (n-1)
> 
> In stratified sampling, the narrowness factor depends on the
> stratum sizes, not the overall n.
> In regression, estimates for some quantities may be based on a small
> subset of the data (e.g. coefficients related to rare factor levels).
> 
> This doesn't mean we should give up on the bootstrap.
> There are remedies for the bootstrap biases, see e.g.
> Hesterberg, Tim C. (2004), Unbiasing the Bootstrap-Bootknife Sampling
> vs. Smoothing, Proceedings of the Section on Statistics and the
> Environment, American Statistical Association, 2924-2930.
> http://home.comcast.net/~timhesterberg/articles/JSM04-bootknife.pdf
> 
> And other methods have their own biases, particularly in nonlinear
> applications such as logistic regression.
> 
> Tim Hesterberg
> 
>> Thank you for your reply, Prof. Harrell.
>>
>> I agree with you. Dropping only one variable does not actually help a lot.
>>
>> I have one more question.
>> During analysis of this model I found that the confidence
>> intervals (CIs) of some coefficients provided by bootstrapping (bootcov
>> function in rms package) was narrower than CIs provided by usual
>> variance-covariance matrix and CIs of other coefficients wider.  My data
>> has no cluster structure. I am wondering which CIs are better.
>> I guess bootstrapping one, but is it right?
>>
>> I would appreciate your help in advance.
>> --
>> KH
>>
>>
>>
>> (11/05/16 12:25), Frank Harrell wrote:
>>> I think you are doing this correctly except for one thing.  The validation
>>> and other inferential calculations should be done on the full model.  Use
>>> the approximate model to get a simpler nomogram but not to get standard
>>> errors.  With only dropping one variable you might consider just running the
>>> nomogram on the entire model.
>>> Frank
>>>
>>>
>>> KH wrote:

 Hi,
 I am trying to construct a logistic regression model from my data (104
 patients and 25 events). I build a full model consisting of five
 predictors with the use of penalization by rms package (lrm, pentrace
 etc) because of events per variable issue. Then, I tried to approximate
 the full model by step-down technique predicting L from all of the
 componet variables using ordinary least squares (ols in rms package) as
 the followings. I would like to know whether I am doing right or not.

> library(rms)
> plogit<- predict(full.model)
> full.ols<- ols(plogit ~ stenosis+x1+x2+ClinicalScore+procedure, sigma=1)
> fastbw(full.ols, aics=1e10)

Deleted   Chi-Sq d.f. P  Residual d.f. P  AICR2
stenosis   1.41  10.2354   1.41   10.2354  -0.59 0.991
x216.78  10.  18.19   20.0001  14.19 0.882
procedure 26.12  10.  44.31   30.  38.31 0.711
ClinicalScore 25.75  10.  70.06   40.  62.06 0.544
x183.42  10. 153.49   50. 143.49 0.000

 Then, fitted an approximation to the full model using most imprtant
 variable (R^2 for predictions from the reduced model against the
 original Y drops below 0.95), that is, dropping "stenosis".

> full.ols.approx<- ols(plogit ~ x1+x2+ClinicalScore+procedure)
> full.ols.approx$stats
 n  Model L.R.d.f.  R2   g   Sigma
 104.000 487.9006640   4.000   0.9908257   1.3341718   0.1192622

 This approximate model had R^2 against the full model of 0.99.
 Therefore, I updated the original full logistic model dropping
 "stenosis" as predictor.

> full.approx.lrm<- update(full.model, ~ . -stenosis)

> validate(full.model, bw=F, B=1000)
 index.orig trainingtest optimism index.correctedn
 Dxy   0.6425   0.7017  0.6131   0.0887  0.5539 1000
 R20.3270   0.3716  0.3335   0.0382  0.2888 1000
 Intercept 0.   0.  0.0821  -0.0821  0.0821 1000
 Slope 1.   1.  1.0548  -0.0548  1.0548 1000
 Emax  0.   0.  0.0263   0.0263  0.0263 1000
>

Re: [R] Question on approximations of full logistic regression model

2011-05-16 Thread khosoda

Thank you for your comment, Prof. Harrell.
I would appreciate it very much if you could teach me how to simulate 
for the estimation. For reference, following codes are what I did 
(bootcov, summary, and validation).


MyFullModel.boot <- bootcov(MyFullModel, B=1000, coef.reps=T)

> summary(MyFullModel, stenosis=c(70, 80), x1=c(1.5, 2.0), x2=c(1.5, 2.0))
 Effects  Response : outcome

 Factor  Low  High Diff. Effect S.E. Lower 0.95 Upper 0.95
 stenosis70.0 80   10.0  -0.11  0.24 -0.59  0.37
  Odds Ratio 70.0 80   10.0   0.90NA  0.56  1.45
 x1   1.5  20.5   1.21  0.37  0.49  1.94
  Odds Ratio  1.5  20.5   3.36NA  1.63  6.95
 x2   1.5  20.5  -0.29  0.19 -0.65  0.08
  Odds Ratio  1.5  20.5   0.75NA  0.52  1.08
 ClinicalScore3.0  52.0   0.61  0.38 -0.14  1.36
  Odds Ratio  3.0  52.0   1.84NA  0.87  3.89
 procedure - CA:CE2.0  1 NA   0.83  0.46 -0.07  1.72
  Odds Ratio  2.0  1 NA   2.28NA  0.93  5.59

> summary(MyFullModel.boot, stenosis=c(70, 80), x1=c(1.5, 2.0), 
x2=c(1.5, 2.0))

 Effects  Response : outcome

 Factor  Low  High Diff. Effect S.E. Lower 0.95 Upper 0.95
 stenosis70.0 80   10.0  -0.11  0.28 -0.65  0.43
  Odds Ratio 70.0 80   10.0   0.90NA  0.52  1.54
 x1   1.5  20.5   1.21  0.29  0.65  1.77
  Odds Ratio  1.5  20.5   3.36NA  1.92  5.89
 x2   1.5  20.5  -0.29  0.16 -0.59  0.02
  Odds Ratio  1.5  20.5   0.75NA  0.55  1.02
 ClinicalScore3.0  52.0   0.61  0.45 -0.28  1.50
  Odds Ratio  3.0  52.0   1.84NA  0.76  4.47
 procedure - CAS:CEA  2.0  1 NA   0.83  0.38  0.07  1.58
  Odds Ratio  2.0  1 NA   2.28NA  1.08  4.85

> validate(MyFullModel, bw=F, B=1000)
  index.orig trainingtest optimism index.correctedn
Dxy   0.6425   0.7054  0.6122   0.0932  0.5493 1000
R20.3270   0.3745  0.3330   0.0415  0.2855 1000
Intercept 0.   0.  0.0683  -0.0683  0.0683 1000
Slope 1.   1.  1.0465  -0.0465  1.0465 1000
Emax  0.   0.  0.0221   0.0221  0.0221 1000
D 0.2715   0.2795  0.2424   0.0371  0.2345 1000
U-0.0192  -0.0192 -0.0035  -0.0157 -0.0035 1000
Q 0.2908   0.2987  0.2460   0.0528  0.2380 1000
B 0.1265   0.1164  0.1332  -0.0168  0.1433 1000
g 1.3366   1.5041  1.5495  -0.0455  1.3821 1000
gp0.2082   0.2172  0.2258  -0.0087  0.2169 1000

> validate(MyFullModel.boot, bw=F, B=1000)
  index.orig trainingtest optimism index.correctedn
Dxy   0.6425   0.7015  0.6139   0.0877  0.5549 1000
R20.3270   0.3738  0.3346   0.0392  0.2878 1000
Intercept 0.   0.  0.0613  -0.0613  0.0613 1000
Slope 1.   1.  1.0569  -0.0569  1.0569 1000
Emax  0.   0.  0.0226   0.0226  0.0226 1000
D 0.2715   0.2805  0.2438   0.0367  0.2348 1000
U-0.0192  -0.0192 -0.0039  -0.0153 -0.0039 1000
Q 0.2908   0.2997  0.2477   0.0521  0.2387 1000
B 0.1265   0.1177  0.1329  -0.0153  0.1417 1000
g 1.3366   1.5020  1.5523  -0.0503  1.3869 1000
gp0.2082   0.2191  0.2263  -0.0072  0.2154 1000



(11/05/16 22:01), Frank Harrell wrote:

The choice is not clear, and requires some simulations to estimate the
average absolute error of the covariance matrix estimators.
Frank


細田弘吉 wrote:


Thank you for your reply, Prof. Harrell.

I agree with you. Dropping only one variable does not actually help a lot.

I have one more question.
During analysis of this model I found that the confidence
intervals (CIs) of some coefficients provided by bootstrapping (bootcov
function in rms package) was narrower than CIs provided by usual
variance-covariance matrix and CIs of other coefficients wider.  My data
has no cluster structure. I am wondering which CIs are better.
I guess bootstrapping one, but is it right?

I would appreciate your help in advance.
--
KH



(11/05/16 12:25), Frank Harrell wrote:

I think you are doing this correctly except for one thing.  The
validation
and other inferential calculations should be done on the full model.  Use
the approximate model to get a simpler nomogram but not to get standard
errors.  With only dropping one variable you might consider just running
the
nomogram on the entire model.
Frank


KH wrote:


Hi,
I am trying to construct a logistic regression model from my data (104
patients and 25 events). I build a full model consisting of five

Re: [R] Question on approximations of full logistic regression model

2011-05-15 Thread khosoda

Thank you for your reply, Prof. Harrell.

I agree with you. Dropping only one variable does not actually help a lot.

I have one more question.
During analysis of this model I found that the confidence
intervals (CIs) of some coefficients provided by bootstrapping (bootcov 
function in rms package) was narrower than CIs provided by usual 
variance-covariance matrix and CIs of other coefficients wider.  My data 
has no cluster structure. I am wondering which CIs are better.

I guess bootstrapping one, but is it right?

I would appreciate your help in advance.
--
KH



(11/05/16 12:25), Frank Harrell wrote:

I think you are doing this correctly except for one thing.  The validation
and other inferential calculations should be done on the full model.  Use
the approximate model to get a simpler nomogram but not to get standard
errors.  With only dropping one variable you might consider just running the
nomogram on the entire model.
Frank


KH wrote:


Hi,
I am trying to construct a logistic regression model from my data (104
patients and 25 events). I build a full model consisting of five
predictors with the use of penalization by rms package (lrm, pentrace
etc) because of events per variable issue. Then, I tried to approximate
the full model by step-down technique predicting L from all of the
componet variables using ordinary least squares (ols in rms package) as
the followings. I would like to know whether I am doing right or not.


library(rms)
plogit<- predict(full.model)
full.ols<- ols(plogit ~ stenosis+x1+x2+ClinicalScore+procedure, sigma=1)
fastbw(full.ols, aics=1e10)


  Deleted   Chi-Sq d.f. P  Residual d.f. P  AICR2
  stenosis   1.41  10.2354   1.41   10.2354  -0.59 0.991
  x216.78  10.  18.19   20.0001  14.19 0.882
  procedure 26.12  10.  44.31   30.  38.31 0.711
  ClinicalScore 25.75  10.  70.06   40.  62.06 0.544
  x183.42  10. 153.49   50. 143.49 0.000

Then, fitted an approximation to the full model using most imprtant
variable (R^2 for predictions from the reduced model against the
original Y drops below 0.95), that is, dropping "stenosis".


full.ols.approx<- ols(plogit ~ x1+x2+ClinicalScore+procedure)
full.ols.approx$stats

   n  Model L.R.d.f.  R2   g   Sigma
104.000 487.9006640   4.000   0.9908257   1.3341718   0.1192622

This approximate model had R^2 against the full model of 0.99.
Therefore, I updated the original full logistic model dropping
"stenosis" as predictor.


full.approx.lrm<- update(full.model, ~ . -stenosis)



validate(full.model, bw=F, B=1000)

   index.orig trainingtest optimism index.correctedn
Dxy   0.6425   0.7017  0.6131   0.0887  0.5539 1000
R20.3270   0.3716  0.3335   0.0382  0.2888 1000
Intercept 0.   0.  0.0821  -0.0821  0.0821 1000
Slope 1.   1.  1.0548  -0.0548  1.0548 1000
Emax  0.   0.  0.0263   0.0263  0.0263 1000


validate(full.approx.lrm, bw=F, B=1000)

   index.orig trainingtest optimism index.correctedn
Dxy   0.6446   0.6891  0.6265   0.0626  0.5820 1000
R20.3245   0.3592  0.3428   0.0164  0.3081 1000
Intercept 0.   0.  0.1281  -0.1281  0.1281 1000
Slope 1.   1.  1.1104  -0.1104  1.1104 1000
Emax  0.   0.  0.0444   0.0444  0.0444 1000

Validatin revealed this approximation was not bad.
Then, I made a nomogram.


full.approx.lrm.nom<- nomogram(full.approx.lrm,

fun.at=c(0.05,0.1,0.2,0.4,0.6,0.8,0.9,0.95), fun=plogis)

plot(full.approx.lrm.nom)


Another nomogram using ols model,


full.ols.approx.nom<- nomogram(full.ols.approx,

fun.at=c(0.05,0.1,0.2,0.4,0.6,0.8,0.9,0.95), fun=plogis)

plot(full.ols.approx.nom)


These two nomograms are very similar but a little bit different.

My questions are;

1. Am I doing right?

2. Which nomogram is correct

I would appreciate your help in advance.

--
KH

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




-
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: 
http://r.789695.n4.nabble.com/Question-on-approximations-of-full-logistic-regression-model-tp3524294p3525372.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



E-mail address
Office: khos...@med.kobe-u.ac.jp

[R] Question on approximations of full logistic regression model

2011-05-15 Thread khosoda
Hi,
I am trying to construct a logistic regression model from my data (104
patients and 25 events). I build a full model consisting of five
predictors with the use of penalization by rms package (lrm, pentrace
etc) because of events per variable issue. Then, I tried to approximate
the full model by step-down technique predicting L from all of the
componet variables using ordinary least squares (ols in rms package) as
the followings. I would like to know whether I am doing right or not.

> library(rms)
> plogit <- predict(full.model)
> full.ols <- ols(plogit ~ stenosis+x1+x2+ClinicalScore+procedure, sigma=1)
> fastbw(full.ols, aics=1e10)

 Deleted   Chi-Sq d.f. P  Residual d.f. P  AICR2
 stenosis   1.41  10.2354   1.41   10.2354  -0.59 0.991
 x216.78  10.  18.19   20.0001  14.19 0.882
 procedure 26.12  10.  44.31   30.  38.31 0.711
 ClinicalScore 25.75  10.  70.06   40.  62.06 0.544
 x183.42  10. 153.49   50. 143.49 0.000

Then, fitted an approximation to the full model using most imprtant
variable (R^2 for predictions from the reduced model against the
original Y drops below 0.95), that is, dropping "stenosis".

> full.ols.approx <- ols(plogit ~ x1+x2+ClinicalScore+procedure)
> full.ols.approx$stats
  n  Model L.R.d.f.  R2   g   Sigma
104.000 487.9006640   4.000   0.9908257   1.3341718   0.1192622

This approximate model had R^2 against the full model of 0.99.
Therefore, I updated the original full logistic model dropping
"stenosis" as predictor.

> full.approx.lrm <- update(full.model, ~ . -stenosis)

> validate(full.model, bw=F, B=1000)
  index.orig trainingtest optimism index.correctedn
Dxy   0.6425   0.7017  0.6131   0.0887  0.5539 1000
R20.3270   0.3716  0.3335   0.0382  0.2888 1000
Intercept 0.   0.  0.0821  -0.0821  0.0821 1000
Slope 1.   1.  1.0548  -0.0548  1.0548 1000
Emax  0.   0.  0.0263   0.0263  0.0263 1000

> validate(full.approx.lrm, bw=F, B=1000)
  index.orig trainingtest optimism index.correctedn
Dxy   0.6446   0.6891  0.6265   0.0626  0.5820 1000
R20.3245   0.3592  0.3428   0.0164  0.3081 1000
Intercept 0.   0.  0.1281  -0.1281  0.1281 1000
Slope 1.   1.  1.1104  -0.1104  1.1104 1000
Emax  0.   0.  0.0444   0.0444  0.0444 1000

Validatin revealed this approximation was not bad.
Then, I made a nomogram.

> full.approx.lrm.nom <- nomogram(full.approx.lrm,
fun.at=c(0.05,0.1,0.2,0.4,0.6,0.8,0.9,0.95), fun=plogis)
> plot(full.approx.lrm.nom)

Another nomogram using ols model,

> full.ols.approx.nom <- nomogram(full.ols.approx,
fun.at=c(0.05,0.1,0.2,0.4,0.6,0.8,0.9,0.95), fun=plogis)
> plot(full.ols.approx.nom)

These two nomograms are very similar but a little bit different.

My questions are;

1. Am I doing right?

2. Which nomogram is correct

I would appreciate your help in advance.

-- 
KH

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Bootstrapping confidence intervals

2011-05-02 Thread khosoda
Hi,
Sorry for repeated question.

I performed logistic regression using lrm and penalized it with pentrace
function. I wanted to get confidence intervals of odds ratio of each
predictor and summary(MyModel) gave them. I also tried to get
bootstrapping standard errors in the logistic regression. bootcov
function in rms package provided them. Then, I found that the confidence
intervals provided by bootstrapping (bootcov) was narrower than CIs
provided by usual variance-covariance matrix in the followings.

My data has no cluster structure.
I am wondering which confidence interval is better. I guess
bootstrapping one, but is it right?

I would appreciate anybody's help in advance.

> summary(MyModel, stenosis=c(70, 80), x1=c(1.5, 2.0), x2=c(1.5, 2.0))
 Effects  Response : outcome

 Factor  Low  High Diff. Effect S.E. Lower 0.95 Upper 0.95
 stenosis70.0 80   10.0  -0.11  0.24 -0.59  0.37
  Odds Ratio 70.0 80   10.0   0.90NA  0.56  1.45
 x1   1.5  20.5   1.21  0.37  0.49  1.94
  Odds Ratio  1.5  20.5   3.36NA  1.63  6.95
 x2   1.5  20.5  -0.29  0.19 -0.65  0.08
  Odds Ratio  1.5  20.5   0.75NA  0.52  1.08
 ClinicalScore3.0  52.0   0.61  0.38 -0.14  1.36
  Odds Ratio  3.0  52.0   1.84NA  0.87  3.89
 procedure - CA:CE2.0  1 NA   0.83  0.46 -0.07  1.72
  Odds Ratio  2.0  1 NA   2.28NA  0.93  5.59

> summary(MyModel.boot, stenosis=c(70, 80), x1=c(1.5, 2.0), x2=c(1.5, 2.0))
 Effects  Response : outcome

 Factor  Low  High Diff. Effect S.E. Lower 0.95 Upper 0.95
 stenosis70.0 80   10.0  -0.11  0.28 -0.65  0.43
  Odds Ratio 70.0 80   10.0   0.90NA  0.52  1.54
 x1   1.5  20.5   1.21  0.29  0.65  1.77
  Odds Ratio  1.5  20.5   3.36NA  1.92  5.89
 x2   1.5  20.5  -0.29  0.16 -0.59  0.02
  Odds Ratio  1.5  20.5   0.75NA  0.55  1.02
 ClinicalScore3.0  52.0   0.61  0.45 -0.28  1.50
  Odds Ratio  3.0  52.0   1.84NA  0.76  4.47
 procedure - CAS:CEA  2.0  1 NA   0.83  0.38  0.07  1.58
  Odds Ratio  2.0  1 NA   2.28NA  1.08  4.85

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] bootcov or robcov for odds ratio?

2011-04-30 Thread khosoda
Dear list,

I made a logistic regression model (MyModel) using lrm and penalization
by pentrace for data of 104 patients, which consists of 5 explanatory
variables and one binary outcome (poor/good). Then, I found bootcov and
robcov function in rms package for calculation of confidence range of
coefficients and odds ratio by bootstrap covariance matrix and
Huber-White sandwich method, respectively.

> MyModel.boot <- bootcov(MyModel, B=1000, coef.reps=T)
> MyModel.robcov <- robcov(MyModel)

> anova(MyModel)
Wald Statistics  Response: outcome

 FactorChi-Square d.f. P
 stenosis   0.20  10.6547
 x110.69  10.0011
 x2 2.33  10.1270
 procedure  3.27  10.0708
 ClinicalScore  2.55  10.1102
 TOTAL 18.71  50.0022

> anova(MyModel.boot)
Wald Statistics  Response: outcome

 FactorChi-Square d.f. P
 stenosis   0.16  10.6921
 x117.90  1<.0001
 x2 3.36  10.0669
 procedure  4.62  10.0316
 ClinicalScore  1.82  10.1774
 TOTAL 31.82  5<.0001

> anova(MyModel.robcov)
Wald Statistics  Response: outcome

 FactorChi-Square d.f. P
 stenosis   0.17  10.6758
 x120.52  1<.0001
 x2 3.83  10.0505
 procedure  5.09  10.0241
 ClinicalScore  1.84  10.1744
 TOTAL 34.80  5<.0001

The confidence intervals are narrower in bootcov model, and further
narrower in robcov model than in original model, as demonstrated in the
followings.

I am wondering which confidence interval should be used.
I would appreciate anybody's help in advance.
--
KH

> summary(MyModel, stenosis=c(70, 80), x1=c(1.5, 2.0), x2=c(1.5, 2.0))
 Effects  Response : outcome

 Factor  Low  High Diff. Effect S.E. Lower 0.95 Upper 0.95
 stenosis70.0 80   10.0  -0.11  0.24 -0.59  0.37
  Odds Ratio 70.0 80   10.0   0.90NA  0.56  1.45
 x1   1.5  20.5   1.21  0.37  0.49  1.94
  Odds Ratio  1.5  20.5   3.36NA  1.63  6.95
 x2   1.5  20.5  -0.29  0.19 -0.65  0.08
  Odds Ratio  1.5  20.5   0.75NA  0.52  1.08
 ClinicalScore3.0  52.0   0.61  0.38 -0.14  1.36
  Odds Ratio  3.0  52.0   1.84NA  0.87  3.89
 procedure - CA:CE2.0  1 NA   0.83  0.46 -0.07  1.72
  Odds Ratio  2.0  1 NA   2.28NA  0.93  5.59

> summary(MyModel.boot, stenosis=c(70, 80), x1=c(1.5, 2.0), x2=c(1.5, 2.0))
 Effects  Response : outcome

 Factor  Low  High Diff. Effect S.E. Lower 0.95 Upper 0.95
 stenosis70.0 80   10.0  -0.11  0.28 -0.65  0.43
  Odds Ratio 70.0 80   10.0   0.90NA  0.52  1.54
 x1   1.5  20.5   1.21  0.29  0.65  1.77
  Odds Ratio  1.5  20.5   3.36NA  1.92  5.89
 x2   1.5  20.5  -0.29  0.16 -0.59  0.02
  Odds Ratio  1.5  20.5   0.75NA  0.55  1.02
 ClinicalScore3.0  52.0   0.61  0.45 -0.28  1.50
  Odds Ratio  3.0  52.0   1.84NA  0.76  4.47
 procedure - CAS:CEA  2.0  1 NA   0.83  0.38  0.07  1.58
  Odds Ratio  2.0  1 NA   2.28NA  1.08  4.85

> summary(MyModel.robcov, stenosis=c(70, 80), T1=c(1.5, 2.0), T2=c(1.5,
2.0))
 Effects  Response : outcome

 Factor  Low  High Diff. Effect S.E. Lower 0.95 Upper 0.95
 stenosis70.0 80   10.0  -0.11  0.26 -0.62  0.40
  Odds Ratio 70.0 80   10.0   0.90NA  0.54  1.50
 x1   1.5  20.5   1.21  0.27  0.69  1.74
  Odds Ratio  1.5  20.5   3.36NA  1.99  5.68
 x2   1.5  20.5  -0.29  0.15 -0.57  0.00
  Odds Ratio  1.5  20.5   0.75NA  0.56  1.00
 ClinicalScore3.0  52.0   0.61  0.45 -0.27  1.49
  Odds Ratio  3.0  52.0   1.84NA  0.76  4.44
 procedure - CAS:CEA  2.0  1 NA   0.83  0.37  0.11  1.54
  Odds Ratio  2.0  1 NA   2.28NA  1.11  4.68

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Questions about lrm, validate, pentrace

2011-04-29 Thread khosoda

(11/04/29 22:09), Frank Harrell wrote:

Yes I would select that as the final model.


Thank you for your comment. I am able to be confident about my model now.

The difference you saw is caused

by different treatment of penalization of factor variables, related to the
use of the sum squared differences between the estimate at one category from
the average over all categories.  I think that as long as you code it one
way consistently and pick the penalty using that coding you are OK.  But if
the coefficients of the non-factor variables depend on how the binary
predictor is coded, there is a bit more concern.


A lot of previous studies have demonstrated that poor outcome is more 
frequent in treat2 than in treat 1. So, I coded treat1 as 0, and treat2 
as 1 in the first mail. Then, I came back to the original coding of 
treat1 and treat2 in the newer mail. According to your answer, I guess I 
am OK. :-)


Prof Harrell, Your book (Rregression Modeling Strategies) and many kind 
comments helped me a lot. Thank you very much again.


--
KH



Frank


細田弘吉 wrote:


Thank you for you quick reply, Prof. Harrell.
According to your advice, I ran pentrace using a very wide range.

  >  pentrace.x6factor<- pentrace(x6factor.lrm, seq(0, 100, by=0.5))
  >  plot(pentrace.x6factor)

I attached this figure. Then,

  >  pentrace.x6factor<- pentrace(x6factor.lrm, seq(0, 10, by=0.05))

It seems reasonable that the best penalty is 2.55.

  >  x6factor.lrm.pen<- update(x6factor.lrm, penalty=2.55)
  >  cbind(coef(x6factor.lrm), coef(x6factor.lrm.pen),
abs(coef(x6factor.lrm)-coef(x6factor.lrm.pen)))
   [,1][,2][,3]
Intercept -4.32434556 -3.86816460 0.456180958
stenosis  -0.01496757 -0.01091755 0.004050025
T1 3.04248257  2.42443034 0.618052225
T2-0.75335619 -0.57194342 0.181412767
procedure -1.20847252 -0.82589263 0.382579892
ClinicalScore  0.37623189  0.30524628 0.070985611

  >  validate(x6factor.lrm, bw=F, B=200)
index.orig trainingtest optimism index.corrected   n
Dxy   0.6324   0.6849  0.5955   0.0894  0.5430 200
R20.3668   0.4220  0.3231   0.0989  0.2679 200
Intercept 0.   0. -0.1924   0.1924 -0.1924 200
Slope 1.   1.  0.7796   0.2204  0.7796 200
Emax  0.   0.  0.0915   0.0915  0.0915 200
D 0.2716   0.3229  0.2339   0.0890  0.1826 200
U-0.0192  -0.0192  0.0243  -0.0436  0.0243 200
Q 0.2908   0.3422  0.2096   0.1325  0.1582 200
B 0.1272   0.1171  0.1357  -0.0186  0.1457 200
g 1.6328   1.9879  1.4940   0.4939  1.1389 200
gp0.2367   0.2502  0.2216   0.0286  0.2080 200


  >  validate(x6factor.lrm.pen, bw=F, B=200)
index.orig trainingtest optimism index.corrected   n
Dxy   0.6375   0.6857  0.6024   0.0833  0.5542 200
R20.3145   0.3488  0.3267   0.0221  0.2924 200
Intercept 0.   0.  0.0882  -0.0882  0.0882 200
Slope 1.   1.  1.0923  -0.0923  1.0923 200
Emax  0.   0.  0.0340   0.0340  0.0340 200
D 0.2612   0.2571  0.2370   0.0201  0.2411 200
U-0.0192  -0.0192 -0.0047  -0.0145 -0.0047 200
Q 0.2805   0.2763  0.2417   0.0346  0.2458 200
B 0.1292   0.1224  0.1355  -0.0132  0.1423 200
g 1.2704   1.3917  1.5019  -0.1102  1.3805 200
gp0.2020   0.2091  0.2229  -0.0138  0.2158 200

In the penalized model (x6factor.lrm.pen), the apparent Dxy is 0.64, and
bias-corrected Dxy is 0.55. The maximum absolute error is estimated to
be 0.034, smaller than non-penalized model (0.0915 in x6factor.lrm) The
changes in slope and intercept are substantially reduced in penalized
model. I think overfitting is improved at least to some extent. Should I
select this as a final model?

I have one more question. The "procedure" variable was defined as 0/1
value in the previous mail. For some graphical reason, I redefined it as
treat1/treat2 value. Then, the best penalty value was changed from 3.05
to 2.55. I guess change from numeric to factorial caused this reduction
in penalty. Which set up should I select?

I appreciate your help in advance.

--
KH

(11/04/26 0:21), Frank Harrell wrote:

You've done a lot of good work on this.  Yes I would say you have
moderate
overfitting with the first model.  The only thing that saved you from
having
severe overfitting is that there seems to be a signal present [I am
assume
this model is truly pre-specified and was not developed at all by looking
at
patterns of responses Y.]

The use of backwards stepdown demonstrated much worse overfitting.  This
is
in line with what we know about the damage of stepwise selection methods
that do not incorporate shrinkage.  I would throw away the

[R] Questions about lrm, validate, pentrace (Re: BMA, logistic regression, odds ratio, model reduction etc)

2011-04-23 Thread khosoda

According to the advice, I tried rms package.
Just to make sure, I have data of 104 patients (x6.df), which consists 
of 5 explanatory variables and one binary outcome (poor/good) (previous 
model 2 strategy). The outcome consists of 25 poor results and 79 good 
results. Therefore, My events per variable (EPV) is only 5 (much less 
than the rule of thumb of 10).


My questions are about validate and pentrace in rms package.
I present some codes and results.
I appreciate anybody's help in advance.

> x6.lrm <- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore, 
data=x6.df, x=T, y=T)


> x6.lrm
...
Obs  104LR chi2  29.24R2   0.367C   0.816
 negative 79d.f. 5g1.633Dxy 0.632
 positive 25Pr(> chi2) <0.0001   gr5.118gamma   0.632
max |deriv| 1e-08gp0.237tau-a   0.233
 Brier   0.127

   CoefS.E.   Wald Z Pr(>|Z|)
Intercept  -5.5328 2.6287 -2.10  0.0353
stenosis   -0.0150 0.0284 -0.53  0.5979
x1  3.0425 0.9100  3.34  0.0008
x2 -0.7534 0.4519 -1.67  0.0955
procedure   1.2085 0.5717  2.11  0.0345
ClinicalScore   0.3762 0.2287  1.65  0.0999

It seems not too bad. Next, validation by bootstrap ...

> validate(x6.lrm, B=200, bw=F)
  index.orig trainingtest optimism index.corrected   n
Dxy   0.6324   0.6960  0.5870   0.1091  0.5233 200
R20.3668   0.4370  0.3154   0.1216  0.2453 200
Intercept 0.   0. -0.2007   0.2007 -0.2007 200
Slope 1.   1.  0.7565   0.2435  0.7565 200
Emax  0.   0.  0.0999   0.0999  0.0999 200
D 0.2716   0.3368  0.2275   0.1093  0.1623 200
U-0.0192  -0.0192  0.0369  -0.0561  0.0369 200
Q 0.2908   0.3560  0.1906   0.1654  0.1254 200
B 0.1272   0.1155  0.1384  -0.0229  0.1501 200
g 1.6328   2.0740  1.4647   0.6093  1.0235 200
gp0.2367   0.2529  0.2189   0.0341  0.2026 200

The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum 
absolute error is estimated to be 0.099. The changes in slope and 
intercept are also more substantial. In all, there is evidence that I am 
somewhat overfitting the data, right?.


Furthermore, using step-down variable selection ...

> validate(x6.lrm, B=200, bw=T)

Backwards Step-down - Original Model

 DeletedChi-Sq d.f. P  Residual d.f. P  AIC
 stenosis   0.28   10.5979 0.28 10.5979 -1.72
 ClinicalScore  2.60   10.1068 2.88 20.2370 -1.12
 x2 2.86   10.0910 5.74 30.1252 -0.26

Approximate Estimates after Deleting Factors

 Coef   S.E. Wald Z P
Intercept  -5.865 1.4136 -4.149 3.336e-05
x1  2.915 0.8685  3.357 7.889e-04
procedure   1.072 0.5590  1.918 5.508e-02

Factors in Final Model

[1] x1 procedure
  index.orig trainingtest optimism index.corrected   n
Dxy   0.5661   0.6755  0.5559   0.1196  0.4464 200
R20.2876   0.4085  0.2784   0.1301  0.1575 200
Intercept 0.   0. -0.2459   0.2459 -0.2459 200
Slope 1.   1.  0.7300   0.2700  0.7300 200
Emax  0.   0.  0.1173   0.1173  0.1173 200
D 0.2038   0.3130  0.1970   0.1160  0.0877 200
U-0.0192  -0.0192  0.0382  -0.0574  0.0382 200
Q 0.2230   0.3323  0.1589   0.1734  0.0496 200
B 0.1441   0.1192  0.1452  -0.0261  0.1702 200
g 1.2628   1.9524  1.3222   0.6302  0.6326 199
gp0.2041   0.2430  0.2043   0.0387  0.1654 199

If I select only two variables (x1 and procedure), bias-corrected Dxy 
goes down to 0.45.


[Question 1]
I have EPV problem. Even so, should I keep the full model (5-variable 
model)? or can I use the 2-variable (x1 and procedure) model which the 
validate() with step-down provides?


[Question 2]
If I use 2-variable model, should I do
x2.lrm <- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
or keep the value showed above by validate function?

Next, shrinkage ...

> pentrace(x6.lrm, seq(0, 5.0, by=0.05))
Best penalty:
penalty df
   3.05   4.015378

The best penalty is 3.05. So, I update it with this penalty to obtain 
the corresponding penalized model:


> x6.lrm.pen <- update(x6.lrm, penalty=3.05, x=T, y=T)
> x6.lrm.pen
.
Penalty factors

 simple nonlinear interaction nonlinear.interaction
   3.05  3.053.05  3.05
Final penalty on -2 log L
 [,1]
[1,]  3.8

Obs 104LR chi2  28.18R2   0.313C   0.818
 negative79d.f. 4.015g1.264Dxy 0.635
 positive25   Pr(> chi2) <0.0001 gr   3.538gamma   0.637
max |deriv| 3e-05 

Re: [R] BMA, logistic regression, odds ratio, model reduction etc

2011-04-21 Thread khosoda

Thank you for your comment.
I forgot to mention that varclus and pvclust showed similar results for 
my data.


BTW, I did not realize rms is a replacement for the Design package.
I appreciate your suggestion.
--
KH

(11/04/21 8:00), Frank Harrell wrote:

I think it's OK.  You can also use the Hmisc package's varclus function.
Frank


細田弘吉 wrote:


Dear Prof. Harrel,

Thank you very much for your quick advice.
I will try rms package.

Regarding model reduction, is my model 2 method (clustering and recoding
that are blinded to the outcome) permissible?

Sincerely,

--
KH

(11/04/20 22:01), Frank Harrell wrote:

Deleting variables is a bad idea unless you make that a formal part of
the
BMA so that the attempt to delete variables is penalized for.  Instead of
BMA I recommend simple penalized maximum likelihood estimation (see the
lrm
function in the rms package) or pre-modeling data reduction that is
blinded
to the outcome variable.
Frank


細田弘吉 wrote:


Hi everybody,
I apologize for long mail in advance.

I have data of 104 patients, which consists of 15 explanatory variables
and one binary outcome (poor/good). The outcome consists of 25 poor
results and 79 good results. I tried to analyze the data with logistic
regression. However, the 15 variables and 25 events means events per
variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
package, "BMA" to perform logistic regression with BMA to avoid this
problem.

model 1 (full model):
x1, x2, x3, x4 are continuous variables and others are binary data.


x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,

glm.family="binomial", OR20, strict=FALSE)

summary(x16.bic.glm)

(The output below has been cut off at the right edge to save space)

62  models were selected
   Best  5  models (cumulative posterior probability =  0.3606 ):

   p!=0EV SDmodel 1model2
Intercept100-5.1348545  1.652424-4.4688  -5.15
-5.1536
age3.3   0.0001634  0.007258  .
sex4.0
 .M   -0.0243145  0.220314  .
side  10.8
  .R   0.0811227  0.301233  .
procedure 46.9  -0.5356894  0.685148  .  -1.163
symptom3.8  -0.0099438  0.129690  .  .
stenosis   3.4  -0.0003343  0.005254  .
x13.7  -0.0061451  0.144084  .
x2   100.0   3.1707661  0.892034 3.2221 3.11
x351.3  -0.4577885  0.551466-0.9154 .
HT 4.6
.positive  0.0199299  0.161769  .  .
DM 3.3
.positive -0.0019986  0.105910  .  .
IHD3.5
 .positive 0.0077626  0.122593  .  .
smoking9.1
 .positive 0.0611779  0.258402  .  .
hyperlipidemia16.0
.positive  0.1784293  0.512058  .  .
x4 8.2   0.0607398  0.267501  .  .


nVar   2  2
   1  3  3
BIC   -376.9082
-376.5588  -376.3094  -375.8468  -374.5582
post prob0.104
0.087  0.077  0.061  0.032

[Question 1]
Is it O.K to calculate odds ratio and its 95% confidence interval from
"EV" (posterior distribution mean) and“SD”(posterior distribution
standard deviation)?
For example, 95%CI of EV of x2 can be calculated as;

exp(3.1707661)

[1] 23.82573 ->   odds ratio

exp(3.1707661+1.96*0.892034)

[1] 136.8866

exp(3.1707661-1.96*0.892034)

[1] 4.146976
-->   95%CI (4.1 to 136.9)
Is this O.K.?

[Question 2]
Is it permissible to delete variables with small value of "p!=0" and
"EV", such as age (3.3% and 0.0001634) to reduce the number of
explanatory variables and reconstruct new model without those variables
for new session of BMA?

model 2 (reduced model):
I used R package, "pvclust", to reduce the model. The result suggested
x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
Based on the subject knowledge, I made a simple unweighted sum, by
counting the number of clinical features. For 9 features (sex, side,
HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
from 0 to 9. This score was defined as ClinicalScore. Consequently, I
made up new data set (x6.df), which consists of 5 variables (stenosis,
x2, x3, procedure, and ClinicalScore) and one binary outcome
(poor/good). Then, for alternative BMA session...


BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,

glm.family="binomial", OR=20, strict=FALSE)

summary(BMAx6.glm)

(The output below has been cut off at the right edge to 

Re: [R] BMA, logistic regression, odds ratio, model reduction etc

2011-04-20 Thread khosoda

Dear Prof. Harrel,

Thank you very much for your quick advice.
I will try rms package.

Regarding model reduction, is my model 2 method (clustering and recoding 
that are blinded to the outcome) permissible?


Sincerely,

--
KH

(11/04/20 22:01), Frank Harrell wrote:

Deleting variables is a bad idea unless you make that a formal part of the
BMA so that the attempt to delete variables is penalized for.  Instead of
BMA I recommend simple penalized maximum likelihood estimation (see the lrm
function in the rms package) or pre-modeling data reduction that is blinded
to the outcome variable.
Frank


細田弘吉 wrote:


Hi everybody,
I apologize for long mail in advance.

I have data of 104 patients, which consists of 15 explanatory variables
and one binary outcome (poor/good). The outcome consists of 25 poor
results and 79 good results. I tried to analyze the data with logistic
regression. However, the 15 variables and 25 events means events per
variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
package, "BMA" to perform logistic regression with BMA to avoid this
problem.

model 1 (full model):
x1, x2, x3, x4 are continuous variables and others are binary data.


x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,

glm.family="binomial", OR20, strict=FALSE)

summary(x16.bic.glm)

(The output below has been cut off at the right edge to save space)

   62  models were selected
  Best  5  models (cumulative posterior probability =  0.3606 ):

  p!=0EV SDmodel 1model2
Intercept100-5.1348545  1.652424-4.4688  -5.15
-5.1536
age3.3   0.0001634  0.007258  .
sex4.0
.M   -0.0243145  0.220314  .
side  10.8
 .R   0.0811227  0.301233  .
procedure 46.9  -0.5356894  0.685148  .  -1.163
symptom3.8  -0.0099438  0.129690  .  .
stenosis   3.4  -0.0003343  0.005254  .
x13.7  -0.0061451  0.144084  .
x2   100.0   3.1707661  0.892034 3.2221 3.11
x351.3  -0.4577885  0.551466-0.9154 .
HT 4.6
   .positive  0.0199299  0.161769  .  .
DM 3.3
   .positive -0.0019986  0.105910  .  .
IHD3.5
.positive 0.0077626  0.122593  .  .
smoking9.1
.positive 0.0611779  0.258402  .  .
hyperlipidemia16.0
   .positive  0.1784293  0.512058  .  .
x4 8.2   0.0607398  0.267501  .  .


nVar   2  2
  1  3  3
BIC   -376.9082
-376.5588  -376.3094  -375.8468  -374.5582
post prob0.104
0.087  0.077  0.061  0.032

[Question 1]
Is it O.K to calculate odds ratio and its 95% confidence interval from
"EV" (posterior distribution mean) and“SD”(posterior distribution
standard deviation)?
For example, 95%CI of EV of x2 can be calculated as;

exp(3.1707661)

[1] 23.82573 ->  odds ratio

exp(3.1707661+1.96*0.892034)

[1] 136.8866

exp(3.1707661-1.96*0.892034)

[1] 4.146976
-->  95%CI (4.1 to 136.9)
Is this O.K.?

[Question 2]
Is it permissible to delete variables with small value of "p!=0" and
"EV", such as age (3.3% and 0.0001634) to reduce the number of
explanatory variables and reconstruct new model without those variables
for new session of BMA?

model 2 (reduced model):
I used R package, "pvclust", to reduce the model. The result suggested
x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
Based on the subject knowledge, I made a simple unweighted sum, by
counting the number of clinical features. For 9 features (sex, side,
HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
from 0 to 9. This score was defined as ClinicalScore. Consequently, I
made up new data set (x6.df), which consists of 5 variables (stenosis,
x2, x3, procedure, and ClinicalScore) and one binary outcome
(poor/good). Then, for alternative BMA session...


BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,

glm.family="binomial", OR=20, strict=FALSE)

summary(BMAx6.glm)

(The output below has been cut off at the right edge to save space)
Call:
bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
"binomial", strict = FALSE, OR = 20)


   13  models were selected
  Best  5  models (cumulative posterior probability =  0.7626 ):

 p!=0EV SD   model 1model 2
Intercept   100-5.6918362  1.81220-4.4688-6.3166
stenosis 

[R] BMA, logistic regression, odds ratio, model reduction etc

2011-04-20 Thread khosoda
Hi everybody,
I apologize for long mail in advance.

I have data of 104 patients, which consists of 15 explanatory variables
and one binary outcome (poor/good). The outcome consists of 25 poor
results and 79 good results. I tried to analyze the data with logistic
regression. However, the 15 variables and 25 events means events per
variable (EPV) is much less than 10 (rule of thumb). Therefore, I used R
package, "BMA" to perform logistic regression with BMA to avoid this
problem.

model 1 (full model):
x1, x2, x3, x4 are continuous variables and others are binary data.

> x16.bic.glm <- bic.glm(outcome ~ ., data=x16.df,
glm.family="binomial", OR20, strict=FALSE)
> summary(x16.bic.glm)
(The output below has been cut off at the right edge to save space)

  62  models were selected
 Best  5  models (cumulative posterior probability =  0.3606 ):

 p!=0EV SDmodel 1model2
Intercept100-5.1348545  1.652424-4.4688  -5.15
-5.1536
age3.3   0.0001634  0.007258  .
sex4.0
   .M   -0.0243145  0.220314  .
side  10.8
.R   0.0811227  0.301233  .
procedure 46.9  -0.5356894  0.685148  .  -1.163
symptom3.8  -0.0099438  0.129690  .  .
stenosis   3.4  -0.0003343  0.005254  .
x13.7  -0.0061451  0.144084  .
x2   100.0   3.1707661  0.892034 3.2221 3.11
x351.3  -0.4577885  0.551466-0.9154 .
HT 4.6
  .positive  0.0199299  0.161769  .  .
DM 3.3
  .positive -0.0019986  0.105910  .  .
IHD3.5
   .positive 0.0077626  0.122593  .  .
smoking9.1
   .positive 0.0611779  0.258402  .  .
hyperlipidemia16.0
  .positive  0.1784293  0.512058  .  .
x4 8.2   0.0607398  0.267501  .  .


nVar   2  2
 1  3  3
BIC   -376.9082
-376.5588  -376.3094  -375.8468  -374.5582
post prob0.104
0.087  0.077  0.061  0.032

[Question 1]
Is it O.K to calculate odds ratio and its 95% confidence interval from
"EV" (posterior distribution mean) and“SD”(posterior distribution
standard deviation)?
For example, 95%CI of EV of x2 can be calculated as;
> exp(3.1707661)
[1] 23.82573 -> odds ratio
> exp(3.1707661+1.96*0.892034)
[1] 136.8866
> exp(3.1707661-1.96*0.892034)
[1] 4.146976
--> 95%CI (4.1 to 136.9)
Is this O.K.?

[Question 2]
Is it permissible to delete variables with small value of "p!=0" and
"EV", such as age (3.3% and 0.0001634) to reduce the number of
explanatory variables and reconstruct new model without those variables
for new session of BMA?

model 2 (reduced model):
I used R package, "pvclust", to reduce the model. The result suggested
x1, x2 and x4 belonged to the same cluster, so I picked up only x2.
Based on the subject knowledge, I made a simple unweighted sum, by
counting the number of clinical features. For 9 features (sex, side,
HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges
from 0 to 9. This score was defined as ClinicalScore. Consequently, I
made up new data set (x6.df), which consists of 5 variables (stenosis,
x2, x3, procedure, and ClinicalScore) and one binary outcome
(poor/good). Then, for alternative BMA session...

> BMAx6.glm <- bic.glm(postopDWI_HI ~ ., data=x6.df,
glm.family="binomial", OR=20, strict=FALSE)
> summary(BMAx6.glm)
(The output below has been cut off at the right edge to save space)
Call:
bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =
"binomial", strict = FALSE, OR = 20)


  13  models were selected
 Best  5  models (cumulative posterior probability =  0.7626 ):

p!=0EV SD   model 1model 2
Intercept   100-5.6918362  1.81220-4.4688-6.3166
stenosis  8.1  -0.0008417  0.00815  .  .
x2  100.0   3.0606165  0.87765 3.2221 3.1154
x3   46.5  -0.3998864  0.52688-0.9154  .
procedure   49.3   0.5747013  0.70164  . 1.1631
ClinicalScore   27.1   0.0966633  0.19645  .  .


nVar 2  2  1
 3  3
BIC -376.9082  -376.5588
-376.3094  -375.8468  -375.5025
post prob  0.208  0.175
0.154  0.122  0.103

[Question 3]
Am I doing it correctly or not?
I 

Re: [R] A question on glmnet analysis

2011-03-28 Thread khosoda

(11/03/27 22:49), KH wrote:

(11/03/25 22:40), Nick Sabbe wrote:


2. Which model, I mean lasso or elastic net, should be selected? and
why? Both models chose the same variables but different coefficient values.
You may want to read 'the elements of statistical learning' to find some
info on the advantages of ridge/lasso/elnet compared. Lasso should work fine
in this relatively low-dimensional setting, although it depends on the
correlation structure of your covariates.


I should have used vif from car package for logistic model.

library(car)
test3 <- glm(y ~ x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15, 
family="binomial", data=MyData)

vif(test3)
 x1x2x3x4x5   x6x7 x8 x9 
x10   x11   x12   x13   x14   x15
1.339349  1.477299   1.292232   1.309631   1.375251 
1.192694   1.763012   2.358474  1.755591   1.281404 
1.229909   1.353517   1.304637   1.486188   1.428996


Anyway, multicollinearity is unlikely to be a problem.

KH


I also checked correlation structure of my covariates.

test<- lm(y ~ x15std)
library(DAAG)
vif(test)
x15std1  x15std2  x15std3  x15std4  x15std5  x15std6  x15std7  x15std8
x15std9 x15std10 x15std11 x15std12 x15std13 x15std14
   1.2299   1.2880   1.1011   1.1559   1.3033   1.0774   1.5369   1.9604
   1.4664   1.1754   1.1396   1.2683   1.1685   1.1667
x15std15
   1.5534

Variance inflation are less than 5 suggesting that multicollinearity is
unlikely to be a problem.

Therefore, Lasso model should be selected?

Thanks a lot in advance,

KH


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.