Re: [R] FW: variable format

2007-09-07 Thread Frank E Harrell Jr
Martin Becker wrote:
 Dear Cory,
 
 I am not familiar with SAS, but is this what you are looking for?
 
 divisionTable - matrix(c(1, New England,
   2, Middle Atlantic,
   3, East North Central,
   4, West North Central,
   5, South Atlantic,
   6, East South Central,
   7, West South Central,
   8, Mountain,
   9, Pacific),
 ncol=2, byrow=T)

How about just divisionTable - c('New England', 'Middle Atlantic', ...) 
then factor(old, 1:9, divisionTable) ?

Frank

 a - NULL
 a$divisionOld - c(0,1,2,3,4,5)
 a$divisionNew - 
 as.character(factor(a$divisionOld,levels=divisionTable[,1],labels=divisionTable[,2]))
 a$divisionNew
 
 [1] NA   New EnglandMiddle Atlantic  
 [4] East North Central West North Central South Atlantic 
 
 
 Kind regards,
 
   Martin
 
 
 Cory Nissen schrieb:
   

 Anybody?  


 

 From: Cory Nissen
 Sent: Tue 9/4/2007 9:30 AM
 To: r-help@stat.math.ethz.ch
 Subject: variable format


 Okay, I want to do something similar to SAS proc format.

 I usually do this...

 a - NULL
 a$divisionOld - c(1,2,3,4,5)
 divisionTable - matrix(c(1, New England,
   2, Middle Atlantic,
   3, East North Central,
   4, West North Central,
   5, South Atlantic),
 ncol=2, byrow=T)
 a$divisionNew[match(a$divisionOld, divisionTable[,1])] - divisionTable[,2]

 But how do I handle the case where...
 a$divisionOld - c(0,1,2,3,4,5)   #no format available for 0, this throws an 
 error.
 OR
 divisionTable - matrix(c(1, New England,
   2, Middle Atlantic,
   3, East North Central,
   4, West North Central,
   5, South Atlantic,
   6, East South Central,
   7, West South Central,
   8, Mountain,
   9, Pacific),
 ncol=2, byrow=T)   
 There are extra formats available... this throws a warning.

 Thanks

 Cory

  [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Stratified weighted statistics

2007-08-29 Thread Frank E Harrell Jr
Vikas Rawal wrote:
 I need to compute weighted descriptive statistics (mean, median,
 variance) for a vector of data disaggregated by a set of index
 variables. The weights to be used are vector in the same data frame
 and have to be disaggregated by the same index variables.
 
 Package Hmisc has functions for computing weighted statistics. But
 both by/aggregate and summarise (in Hmisc) allow a function to have
 only a simgle argument. In my case, I need to specify by the variable
 and the weights as argument to the function.
 
 Could anyone please advice how to do it.

Please look again at the help files for summarize and wtd.mean which 
show how to do this.

Frank

 
 As an example, say I have a data.frame giving incomes of persons in
 different towns along with weights to be given for each sampled
 person. How will I compute weighted statistics for each town?
 
 Vikas Rawal
 JNU, New Delhi.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Excel

2007-08-29 Thread Frank E Harrell Jr
Rolf Turner wrote:
 On 30/08/2007, at 8:49 AM, Greg Snow wrote:
 
 Erich Neuwirth said:

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Erich Neuwirth
 Sent: Wednesday, August 29, 2007 12:43 PM
 To: r-help
 Subject: Re: [R] Excel

 Excel bashing can be fun but also can be dangerous because
 you are makeing your life harder than necessary.
 My experience differs, so far using excel (other than as a table  
 layout
 program) has made my life harder more times than it has made it  
 easier.
 
   Remainder of message deleted.
 
   Bravo!!!  Very well and very cogently expressed!
 
   cheers,
 
   Rolf Turner

Yes!  In addition I'd like to stress that the Excel model represents 
non-reproducible research.

Frank Harrell


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] validate (package Design): error message subscript out of bounds

2007-08-28 Thread Frank E Harrell Jr
Wentzel-Larsen, Tore wrote:
 Thanks,
 I use Design version 2.1-1 (and as provided, R 2.5.1 on Windows XP).

Sorry I missed the latter.

 The redundancies in object names were for my own convience, as part of 
 a larger command file, and the validation of this univariate model was
 only included to ease comparison with the main multivariate model. I have 

A minor point: adding more variables to the right hand side makes the 
model a multivariable model.  It is still univariate as it has only one 
dependent variable.

 tried to access Design version 2.0_12, but only managed to access version
 2.1_1 of Design in my Windows implementation of R2.5.1 (the choice of 
 operative system is made by my institution and I am only entitled to use 
 Windows).

My mistake again.  I forgot that the Debian repositories for CRAN are 
behind.  I updated to latest Design in CRAN and found the bug.  I will 
send a separate note with a new version of the function in question for 
you to source( ) to override the current function.

Frank

 Best, Tore
 
 -Opprinnelig melding-
 Fra: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] 
 Sendt: 28. august 2007 03:17
 Til: Wentzel-Larsen, Tore
 Kopi: r-help@stat.math.ethz.ch
 Emne: Re: [R] validate (package Design): error message subscript out of 
 bounds
 
 Wentzel-Larsen, Tore wrote:
 Dear R users 

 I use Windows XP, R2.5.1 (I have read the posting guide, I have 
 contacted the package maintainer first, it is not homework).

 In a research project on renal cell carcinoma we want to compute 
 Harrell's c index, with optimism correction, for a multivariate 
 Cox regression and also for some univariate Cox models.
 For some of these univariate models I have encountered an error
 message (and no result produced) from the function validate i 
 Frank Harrell's Design package:

 Error in Xb(x[, xcol, drop = FALSE], coef, non.slopes, non.slopes.in.x,  : 
 subscript out of bounds

 The following is an artificial example wherein I have been able to 
 reproduce this error message (actual data has been changed to preserve
 confidentiality):
 
 I could not reproduce the error on R 2.5.1 on linux using version 2.0-12 
 of Design (you did not provide this information).
 
 Your code involved a good deal of extra typing.  Here is a streamlined 
 version:
 
 bc - data.frame(time1 = c(9,24,28,43,58,62,66,107,116,118,123,
   127,129,131,137,138,139,140,148,169,176,179,188,196,210,218,
 
 bc
 
 library(Design)
 
 dd - with(bc, datadist(bc1, age, adjto.cat='first'))
 options(datadist = 'dd')
 
 f - cph(Surv(time1,status1) ~ bc1,
   data = bc, x=TRUE, y=TRUE, surv=TRUE)
 anova(f)
 f
 summary(f)
 
 val - validate(f, B=200, dxy=TRUE)
 
 I don't get much value of putting the type of an object as part of the 
 object's name, as information within objects defines the object type/class.
 
 There is little reason to validate a one degree of freedom model.
 
 Frank
 
 library(Design)

 # an example data frame:
 frame.bc - data.frame(time1 = c(9,24,28,43,58,62,66,107,116,118,123,
  127,129,131,137,138,139,140,148,169,176,179,188,196,210,218,
  1,1,1,2,2,3,4,8,23,32,33,34,43,44,48,51,52,54,59,59,60,60,62,
  65,65,68,70,72,73,74,81,84,88,98,99,106,107,115,115,117,119,
  120,122,122,122,122,126,128,130,135,136,136,138,149,151,154,
  157,159,161,164,164,164,166,172,172,176,179,180,183,183,184,
  187,190,197,201,201,203,203,203,209,210,214,219,227,233,4,18,
  49,113,147,1,1,2,2,2,2,2,3,4,6,6,6,6,6,6,6,6,9,9,9,9,9,10,10,
  10,11,12,12,12,13,14,14,17,18,18,19,19,20,20,21,21,21,21,22,23,
  23,24,28,28,29,29,32,34,35,38,38,48,48,52,52,54,54,56,64,67,67,
  69,70,70,72,84,88,90,114,115,140,142,154,171,195),
  status1 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1),
  bc1 = factor(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
  labels=c('bc.1','bc.2')),
  age = c(58,68,23,20,50,43,41,69,20,48,19,27,39,20,65,49,70,59,31,43,25,
  61,60,45,34,59,32,58,30,62,26,44,52,29,40,57,33,18,50,50,55,51,38,34,
  69,56,67,38,66,21,48,39,62,62,29,68,66,19,60,39,55,42,24,29,56,61,40,
  52,19,40,33,67,66,51,48,63,60,58,68,60,53,20,45,62,37,38,61,63,43,67

Re: [R] how to include bar values in a barplot?

2007-08-27 Thread Frank E Harrell Jr
Donatas G. wrote:
 On Tuesday 07 August 2007 22:09:52 Donatas G. wrote:
 How do I include bar values in a barplot (or other R graphics, where this
 could be applicable)?

 To make sure I am clear I am attaching a barplot created with
 OpenOffice.org which has barplot values written on top of each barplot.
 
 Here is the barplot mentioned above:
 http://dg.lapas.info/wp-content/barplot-with-values.jpg
 
 it appeaars that this list does not allow attachments...
 
That is a TERRIBLE graphic.  Can't we finally leave this subject alone?

Frank Harrell

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] grouping scat1d/rug and plotting to 2 axes

2007-08-27 Thread Frank E Harrell Jr
Mike wrote:
 Hi,
 
 I'm wondering if anybody can offer a bit of  guidance on how to add a 
 couple of features to a plot. 
 
 I'm using Frank Harrell's Design library to model some survival data in 
 R (2.3.1, windows platform).  I'm fairly comfortable with the survival 
 modeling in Design, but am still at a frustratingly low level of 
 competence when it comes to creating anything beyond simple plots in R.
 
 A simplified version of the model is:
 
 fit - cph(Surv(survtime,deceased) ~ rcs(smw,4), 
 data=survdata,x=T,y=T,surv=T )
 
 And the basic plot is:
 
 plot(fit,smw=NA, fun=function(x) 1/(1+exp(-x)))

or plot(fit, smw=NA, fun=plogis).  But what does the logistic model have 
to do with the Cox model you fitted?  You can instead do plot(fit, 
smw=NA, time=1) to plot estimated 1-year survival prob.

 
 I know that if I add
 
 scat1d(smw)
 
 I get a nice jittered rug plot of all values of the predictor smw on the 
 top axis.
 
 What I'd like to do, however, is to plot on bottom axis the values of 
 smw for only those participants who are alive, and then on the top axis, 
 plot the values of smw for those who are deceased.  I'd appreciate any 
 tips as to how I might approach this.

That isn't so well defined because of variable follow-up time.  I would 
not get very much out of such a plot.

Frank

 
 Thanks,
 
 Mike Babyak
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] validate (package Design): error message subscript out of bounds

2007-08-27 Thread Frank E Harrell Jr
  1.000
 Slope  1.00  0.8075246
 D  0.025717  0.0147536
 U -0.002447  0.0005348
 Q  0.028163  0.0142188
 Error in Xb(x[, xcol, drop = FALSE], coef, non.slopes, non.slopes.in.x,  : 
 subscript out of bounds
 
 
 Any help/suggestions will be highly appreciated.
 
 
 Sincerely,
 Tore Wentzel-Larsen
 statistician
 Centre for Clinical research
 Armauer Hansen house 
 Haukeland University Hospital
 N-5021 Bergen
 tlf   +47 55 97 55 39 (a)
 faks  +47 55 97 60 88 (a)
 email [EMAIL PROTECTED]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Random number expert needed

2007-08-24 Thread Frank E Harrell Jr
Colleagues:  An agency contacted me that has need of a true expert in 
pseudo random number generation who can perform an audit of a system 
currently in use.  If you live within say 400 miles of Nashville TN and 
have in-depth expertise, please contact me and I will put you in touch 
with the agency.

Thanks
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Ask for functions to obtain partial R-square (squared partial correlation coefficients)

2007-08-20 Thread Frank E Harrell Jr
Ye Xingwang wrote:
 The partial R-square (or coefficient of partial determination, or
 squared partial correlation coefficients) measures the marginal
 contribution of one explanatory variable when all others are already
 included in multiple linear regression model.
 
 The following link has very clear explanations on partial and
 semi-partial correlation:
 http://www.psy.jhu.edu/~ashelton/courses/stats315/week2.pdf
 
 In SAS, the options is PCORR2 and SCORR2.
 For example(from 
 http://www.ats.ucla.edu/stat/sas/examples/alsm/alsmsasch7.htm)
 
 data ch7tab01;
   input X1 X2 X3 Y;
   label x1 = 'Triceps'
 x2 = 'Thigh cir.'
 x3 = 'Midarm cir.'
  y = 'body fat';
   cards;
   19.5  43.1  29.1  11.9
   24.7  49.8  28.2  22.8
   30.7  51.9  37.0  18.7
   29.8  54.3  31.1  20.1
   19.1  42.2  30.9  12.9
   25.6  53.9  23.7  21.7
   31.4  58.5  27.6  27.1
   27.9  52.1  30.6  25.4
   22.1  49.9  23.2  21.3
   25.5  53.5  24.8  19.3
   31.1  56.6  30.0  25.4
   30.4  56.7  28.3  27.2
   18.7  46.5  23.0  11.7
   19.7  44.2  28.6  17.8
   14.6  42.7  21.3  12.8
   29.5  54.4  30.1  23.9
   27.7  55.3  25.7  22.6
   30.2  58.6  24.6  25.4
   22.7  48.2  27.1  14.8
   25.2  51.0  27.5  21.1
 ;
 run;
 
 proc reg data = ch7tab01;
   model y = x1 x2 / pcorr2 SCORR2;
   model y = x1-x3 / pcorr2 SCORR2;
 run;
 quit;
 
 There has been a post in
 http://tolstoy.newcastle.edu.au/R/help/05/03/0437.html
 
 It will be great appreciated if someone could write a general function
 to work with class lm or glm to obtain the
 pcorr2 (squared partial correlation coefficients using Type II sums of 
 squares)
 and scorr2 (squared semi-partial correlation coefficients using Type
 II sums of squares)
 for all independent variables (3 variables) simultaneously?
 
 Thank you.
 Xingwang Ye
 

library(Design)  # requires Hmisc
f - ols(y ~ x1 + x2)
p - plot(anova(f), what='partial R2')
p

The anova.Design function called above handles pooling related degrees 
of freedom and pools main effects with related interaction effects to 
get the total partial effect.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Regulatory Compliance and Validation Issues

2007-08-20 Thread Frank E Harrell Jr
Cody Hamilton wrote:
 Dear Thomas,
 
 Thank you for your reply.  You are of course quite right - the R Foundation 
 couldn't be responsible for any individually contributed package.
 
 I am curious as to how an orgainization operating in a regulated environment 
 could safely use a contributed package.  What if the author/maintainer 
 retires or loses interest in maintaining the package?  The organization would 
 then find itself in the awkward position of being reliant on software for 
 which there is no technical support and which may not be compatible with 
 future versions of the base R software.  I suppose the organization could 
 take responsibility for maintaining the individual functions within a package 
 on its own (one option made possible by the open source nature of R), but 
 this would require outstanding programming resources which the company may 
 not have (Thomas Lumleys are sadly rare).  In addition, as the organization 
 is claiming the functions as their own (and not as out-of-the-box software), 
 the level of required validation would be truly extraordinary.  I also wonder 
 if an everyone-maintain-their-own-copy approach could lead to multiple 
 mutated vers
i!
  ons of a package's functions across the R universe (e.g. Edwards' version of 
 sas.get() vs. Company X's version of sas.get(), etc.).
 
 Regards,
-Cody

Cody,

I think of this issue as not unlike an organization using its own code 
written by its own analysts or SAS programmers.  Code is reused all the 
time.

Frank

 
 As always, I am speaking for myself and not necessarily for Edwards 
 Lifesciences.
 
 -Original Message-
 From: Thomas Lumley [mailto:[EMAIL PROTECTED]
 Sent: Sunday, August 19, 2007 8:50 AM
 To: Cody Hamilton
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] Regulatory Compliance and Validation Issues
 
 On Fri, 17 Aug 2007, Cody Hamilton wrote:
 
 snip
 I have a few specific comments/questions that I would like to present to
 the R help list.
 snip
 2. While the document's scope is limited to base R plus recommended
 packages, I believe most companies will need access to functionalities
 provided by packages not included in the base or recommended packages.
 (For example, I don't think I could survive without the sas.get()
 function from the Design library.)  How can a company address the issues
 covered in the document for packages outside its scope?  For example,
 what if a package's author does not maintain historical archive versions
 of the package?  What if the author no longer maintains the package?
 Is the solution to add more packages to the recommended list (I'm fairly
 certain that this would not be a simple process) or is there another
 solution?
 
 This will have to be taken up with the package maintainer.  The R
 Foundation doesn't have any definitive knowledge about, eg, Frank
 Harrell's development practices and I don't think the FDA would regard our
 opinions as relevant.
 
 Archiving, at least, is addressed by CRAN: all the previously released
 versions of packages are available
 
 3. At least at my company, each new version must undergo basically the
 same IQ/OQ/PQ as the first installation.  As new versions of R seem to
 come at least once a year, the ongoing validation effort would be
 painful if the most up-to-date version of R is to be maintained within
 the company.  Is there any danger it delaying the updates (say updating
 R within the company every two years or so)?
 
 It's worse than that: there are typically 4 releases of R per year (the
 document you are commenting on actually gives dates).  The ongoing
 validation effort may indeed be painful, and this was mentioned as an
 issue in the talk by David James  Tony Rossini.
 
 The question of what is missed by delaying updates can be answered by
 looking at the NEWS file. The question of whether it is dangerous is
 really an internal risk management issue for you.
 
 -thomas
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Subject: Re: how to include bar values in a barplot?

2007-08-09 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Greg, I'm going to join issue with your here! Not that I'll go near
 advocating Excel-style graphics (abominable, and the Patrick Burns
 URL which you cite is remarkable in its restraint). Also, I'm aware
 that this is potential flame-war territory --  again, I want to avoid
 that too.
 
 However, this is the second time you have intervened on this theme
 (previously Mon 6 August), along with John Kane on Wed 1 August and
 again today on similar lines, and I think it's time an alternative
 point of view was presented, to counteract (I hope usefully) what
 seems to be a draconianly prescriptive approach to the presentation
 of information.


---snip---

Ted,

You make many excellent points and provide much food for thought.  I 
still think that Greg's points are valid too, and in this particular 
case, bar plots are a bad choice and adding numbers at variable heights 
causes a perception error as I wrote previously.

Thanks for your elaboration on this important subject.

Frank

 
 On 07-Aug-07 21:37:50, Greg Snow wrote:
 Generally adding the numbers to a graph accomplishes 2 things:

 1) it acts as an admission that your graph is a failure
 
 Generally, I disagree. Different elements in a display serve different
 purposes, according to the psychological aspects of visual preception.

. . .

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Subject: Re: how to include bar values in a barplot?

2007-08-09 Thread Frank E Harrell Jr
Gabor Grothendieck wrote:
 You could put the numbers inside the bars in which
 case it would not add to the height of the bar:

I think the Cleveland/Tufte prescription would be much different: 
horizontal dot charts with the numbers in the right margin.  I do this 
frequently with great effect.  The Hmisc dotchart2 function makes this easy.

Frank

 
 x - 1:5
 names(x) - letters[1:5]
 bp - barplot(x)
 text(bp, x - .02 * diff(par(usr)[3:4]), x)
 
 On 8/9/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 Greg, I'm going to join issue with your here! Not that I'll go near
 advocating Excel-style graphics (abominable, and the Patrick Burns
 URL which you cite is remarkable in its restraint). Also, I'm aware
 that this is potential flame-war territory --  again, I want to avoid
 that too.

 However, this is the second time you have intervened on this theme
 (previously Mon 6 August), along with John Kane on Wed 1 August and
 again today on similar lines, and I think it's time an alternative
 point of view was presented, to counteract (I hope usefully) what
 seems to be a draconianly prescriptive approach to the presentation
 of information.

 ---snip---

 Ted,

 You make many excellent points and provide much food for thought.  I
 still think that Greg's points are valid too, and in this particular
 case, bar plots are a bad choice and adding numbers at variable heights
 causes a perception error as I wrote previously.

 Thanks for your elaboration on this important subject.

 Frank

 On 07-Aug-07 21:37:50, Greg Snow wrote:
 Generally adding the numbers to a graph accomplishes 2 things:

 1) it acts as an admission that your graph is a failure
 Generally, I disagree. Different elements in a display serve different
 purposes, according to the psychological aspects of visual preception.
 . . .


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to include bar values in a barplot?

2007-08-07 Thread Frank E Harrell Jr
Donatas G. wrote:
 How do I include bar values in a barplot (or other R graphics, where this 
 could be applicable)? 
 
 To make sure I am clear I am attaching a barplot created with OpenOffice.org 
 which has barplot values written on top of each barplot. 
 

This causes an optical illusion in which the viewer adds some of the 
heights of the numbers to the heights of the bars.  And in general dot 
charts are greatly preferred to bar plots, not the least because of 
their reduced ink:information ratio.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] (Censboot, Z-score, Cox) How to use Z-score as the statistic within censboot?

2007-08-06 Thread Frank E Harrell Jr
David Lloyd wrote:
 Dear R Help list,
 
 My question is regarding extracting the standard error or Z-score from a
 cph or coxph call. My Cox model is: -
 
 modz=cph(Surv(TSURV,STATUS)~RAGE+DAGE+REG_WTIME_M+CLD_ISCH+POLY_VS,
 data=kidneyT,method=breslow, x=T, y=T)

You are assuming linearity for all predictors, which is unlikely.  When 
you expand predictors into multiple parameters to allow for 
nonlinearity, or when you have factors with 2 levels, individual 
P-values will be meaningless.  You can get all Wald multiple d.f. test 
stats and P-values by saving the result of anova(modz) using 
anova.Design, which creates a matrix.

Frank

 
 I've used names(modz) but can't see anything that will let me extract
 the Z scores for each coefficient or the standard errors in the same way
 one can use coef (modz).
 
 I need to get either the se or Zscore as I am hoping to be able to use
 the Zscore within a statistic function used in censboot. I wish to find
 the proportion of times each of the variables in the cox model are
 significant out of the R=1000 bootstrap resamples.
 
 My current function for us with the bootstrap only gives me the
 coefficients: - 
 
 func=function(kidneyT,indices)
 {
 
 kidneyT=kidneyT[indices,]
 modz=coxph(Surv(TSURV,STATUS)~RAGE+DAGE+REG_WTIME_M+CLD_ISCH+POLY_VS,dat
 a=kidneyT,method=breslow)
 
 coefficients(modz)
 
 }
 
 
 I would like to be able to modify the following code so modz.cens.boot$t
 below gives me an array of the Z-scores for each variable and each
 resample - not the coefficients
 
 modz.cens.boot=censboot(kidneyT, func,R=1000,sim=ordinary)
 modz.cens.boot
 
 
 I hope this is clear enough? I'm still getting to terms with the
 Bootstrap, having only just started trying to use it a few days ago.
 
 Any help on this would clear a massive headache for me as I've been
 going around in circles for hours and hours on this.
 
 Thx 
 
 DaveL
 
 
 
 
 
 Click to find information on your credit score and your credit report.
 http://tagline.bidsystem.com/fc/Ioyw36XIe1IYw28KqOSu66l6ED7AsGYWlz4EWyK
 N5OTcqKjQM8gfiW/ 
 
 
 
 span id=m2wTlpfont face=Arial, Helvetica, sans-serif size=2 
 style=font-size:13.5px___BRGet
  the Free email that has everyone talking at a 
 href=http://www.mail2world.com target=newhttp://www.mail2world.com/abr  
 font color=#99Unlimited Email Storage #150; POP3 #150; Calendar 
 #150; SMS #150; Translator #150; Much More!/font/font/span
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve in R

2007-08-04 Thread Frank E Harrell Jr
Dylan Beaudette wrote:
 On Thursday 26 July 2007 10:45, Frank E Harrell Jr wrote:
 Dylan Beaudette wrote:
 On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
 Note that even though the ROC curve as a whole is an interesting
 'statistic' (its area is a linear translation of the
 Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
 statistics), each individual point on it is an improper scoring rule,
 i.e., a rule that is optimized by fitting an inappropriate model.  Using
 curves to select cutoffs is a low-precision and arbitrary operation, and
 the cutoffs do not replicate from study to study.  Probably the worst
 problem with drawing an ROC curve is that it tempts analysts to try to
 find cutoffs where none really exist, and it makes analysts ignore the
 whole field of decision theory.

 Frank Harrell
 Frank,

 This thread has caught may attention for a couple reasons, possibly
 related to my novice-level experience.

 1. in a logistic regression study, where i am predicting the probability
 of the response being 1 (for example) - there exists a continuum of
 probability values - and a finite number of {1,0} realities when i either
 look within the original data set, or with a new 'verification' data set.
 I understand that drawing a line through the probabilities returned from
 the logistic regression is a loss of information, but there are times
 when a 'hard' decision requiring prediction of {1,0} is required. I have
 found that the ROCR package (not necessarily the ROC Curve) can be useful
 in identifying the probability cutoff where accuracy is maximized. Is
 this an unreasonable way of using logistic regression as a predictor?
 
 Thanks for the detailed response Frank. My follow-up questions are below:
 
 Logistic regression (with suitable attention to not assuming linearity
 and to avoiding overfitting) is a great way to estimate P[Y=1].  Given
 good predicted P[Y=1] and utilities (losses, costs) for incorrect
 positive and negative decisions, an optimal decision is one that
 optimizes expected utility.  The ROC curve does not play a direct role
 in this regard.  
 
 Ok.
 
 If per-subject utilities are not available, the analyst 
 may make various assumptions about utilities (including the unreasonable
 but often used assumption that utilities do not vary over subjects) to
 find a cutoff on P[Y=1]. 
 
 Can you elaborate on what exactly a per-subject utility is? In my case, I 
 am 
 trying to predict the occurance of specific soil features based on two 
 predictor variables: 1 continuous, the other categorical.  Thus far my 
 evaluation of how well this method works is based on how often I can 
 correctly predict (a categorical) quality.

This could be called a per-unit utility in your case.  It is the 
consequence of decisions at the point in which you decide Y=0 or Y=1. 
If consequences are the same over all units, you just have to deal with 
the single ratio of cost of false positive to cost of false negative.

One way to limit bad consequences is to not make any decision when the 
predicted probability is in the middle, i.e., the decision is 'obtain 
more data'.  That is a real advantage of having a continuous risk estimate.

 
 
 A very nice feature of P[Y=1] is that error 
 probabilities are self-contained.  For example if P[Y=1] = .02 for a
 single subject and you predict Y=0, the probability of an error is .02
 by definition.  One doesn't need to compute an overall error probability
 over the whole distribution of subjects' risks.  If the cost of a false
 negative is C, the expected cost is .02*C in this example.
 
 Interesting. The hang-up that I am having is that I need to predict from 
 {O,1}, as the direct users of this information are not currently interested 
 in in raw probabilities. As far as I know, in order to predict a class from a 
 probability I need use a cutoff... How else can I accomplish this without 
 imposing a cutoff on the entire dataset? One thought, identify a cutoff for 
 each level of the categorical predictor term in the model... (?)

You're right you have to ultimately use a cutoff (or better still, 
educate the users about the meaning of probabilities and let them make 
the decision without exposing the cutoff).  And see the comment 
regarding gray zones above.

 
 2. The ROC curve can be a helpful way of communicating false positives /
 false negatives to other users who are less familiar with the output and
 interpretation of logistic regression.
 What is more useful than that is a rigorous calibration curve estimate
 to demonstrate the faithfulness of predicted P[Y=1] and a histogram
 showing the distribution of predicted P[Y=1]
 
 Ok. I can make that histogram - how would one go about making the 'rigorous 
 calibration curve' ? Note that I have a training set, from which the model is 
 built, and a smaller testing set for evaluation. 

See the val.prob function in the Design package.  This assumes your test 
samples and training samples

Re: [R] proportional odds model

2007-08-02 Thread Frank E Harrell Jr
Ramon Martínez Coscollà wrote:
 Hi all!!
 
 I am using a proportinal odds model to study some ordered categorical
 data. I am trying to predict one ordered categorical variable taking
 into account only another categorical variable.
 
 I am using polr from the R MASS library. It seems to work ok, but I'm
 still getting familiar and I don't know how to assess goodness of fit.
 I have this output, when using response ~ independent variable:
 
 Residual Deviance: 327.0956
 AIC: 333.0956
 polr.out$df.residual
 [1] 278
 polr.out$edf
 [1] 3
 
 When taking out every variable... (i.e., making formula: response ~ 1), I 
 have:
 
 Residual Deviance: 368.2387
 AIC: 372.2387
 
 How can I test if the model fits well? How can I check that the
 independent variable effectively explains the model? Is there any
 test?
 
 Moreover, sendig summary(polr.out) I get this error:
 
 
 Error in optim(start, fmin, gmin, method = BFGS, hessian = Hess, ...) :
 initial value in 'vmmin' is not finite
 
 Something to do with the optimitation procedure... but, how can I fix
 it? Any help would be greatly appreciated.
 
 Thanks.

You might also look at lrm and residuals.lrm in the Design package, 
which provides partial and score residual plots to check PO model 
assumptions.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2007-07-26 Thread Frank E Harrell Jr
Mary,

The 10-group approach results in a low-resolution and fairly arbitrary 
calibration curve.  Also, it is the basis of the original 
Hosmer-Lemeshow goodness of fit statistic which has been superceded by 
the Hosmer et al single degree of freedom GOF test that does not require 
any binning.  The Design package handles both.  Do ?calibrate.lrm, 
?residuals.lrm, ?lrm for details.

Frank Harrell


Sullivan, Mary M wrote:
 Greetings,
  
 
 I am working on a logistic regression model in R and I am struggling with the 
 code, as it is a relatively new program for me.  In searching Google for 
 'logistic regression diagnostics' I came Elizabeth Brown's Lecture 14 from 
 her Winter 2004 Biostatistics 515 course  
 (http://courses.washington.edu/b515/l14.pdf) .  I found most of the code to 
 be very helpful, but I am struggling with the lines on to calculate the 
 observed and expected values in the 10 groups created by the cut function.  I 
 get error messages in trying to create the E and O matrices:  R won't accept 
 assignment of fi1c==j and it won't calculate the sum.  
 
  
 
 I am wondering whether someone might be able to offer me some assistance...my 
 search of the archives was not fruitful.
 
  
 
 Here is the code that I adapted from the lecture notes:
 
  
 
 fit - fitted(glm.lyme)
 
 fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)),1))  
 
 t-table(fitc)
 
 fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)), 1), labels = F)
 
 t-table(fitc)
 
  
 
 #Calculate observed and expected values in ea group
 
 E - matrix(0, nrow=10, ncol = 2)
 
 O - matrix(0, nrow=10, ncol=2)
 
 for (j in 1:10) {
 
   E[j, 2] = sum(fit[fitc==j])
 
   E[j, 1] = sum((1- fit)[fitc==j])
 
   O[j, 2] = sum(pcdata$lymdis[fitc==j])
 
   O[j, 1] = sum((1-pcdata$lymdis)[fitc==j])
 
 
 
 }
 
  
 
 Here is the error message:  Error in Summary.factor(..., na.rm = na.rm) : 
 sum not meaningful for factors
 
  
 
  
 
 I understand what it means; I just can't figure out how to get around it or 
 how to get the output printed in table form.  Thank you in advance for any 
 assistance.
 
  
 
 Mary Sullivan
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve in R

2007-07-26 Thread Frank E Harrell Jr
Note that even though the ROC curve as a whole is an interesting 
'statistic' (its area is a linear translation of the 
Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation 
statistics), each individual point on it is an improper scoring rule, 
i.e., a rule that is optimized by fitting an inappropriate model.  Using 
curves to select cutoffs is a low-precision and arbitrary operation, and 
the cutoffs do not replicate from study to study.  Probably the worst 
problem with drawing an ROC curve is that it tempts analysts to try to 
find cutoffs where none really exist, and it makes analysts ignore the 
whole field of decision theory.

Frank Harrell


[EMAIL PROTECTED] wrote:
 http://search.r-project.org/cgi-bin/namazu.cgi?query=ROCmax=20result=normalsort=scoreidxname=Rhelp02aidxname=functionsidxname=docs
 
 there is a lot of help try help.search(ROC curve) gave
 Help files with alias or concept or title matching 'ROC curve' using fuzzy 
 matching:
 
 
 
 granulo(ade4) Granulometric Curves
 plot.roc(analogue)Plot ROC curves and associated 
 diagnostics
 roc(analogue) ROC curve analysis
 colAUC(caTools)   Column-wise Area Under ROC Curve 
 (AUC)
 DProc(DPpackage)  Semiparametric Bayesian ROC 
 curve analysis
 cv.enet(elasticnet)   Computes K-fold cross-validated 
 error curve for elastic net
 ROC(Epi)  Function to compute and draw 
 ROC-curves.
 lroc(epicalc) ROC curve
 cv.lars(lars) Computes K-fold cross-validated 
 error curve for lars
 roc.demo(TeachingDemos)   Demonstrate ROC curves by 
 interactively building one
 
 HTH
 see the help and examples those will suffice
 
 Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'.
 
 
 
 Regards,
 
 Gaurav Yadav
 +++
 Assistant Manager, CCIL, Mumbai (India)
 Mob: +919821286118 Email: [EMAIL PROTECTED]
 Bhagavad Gita:  Man is made by his Belief, as He believes, so He is
 
 
 
 Rithesh M. Mohan [EMAIL PROTECTED] 
 Sent by: [EMAIL PROTECTED]
 07/26/2007 11:26 AM
 
 To
 R-help@stat.math.ethz.ch
 cc
 
 Subject
 [R] ROC curve in R
 
 
 
 
 
 
 Hi,
 
  
 
 I need to build ROC curve in R, can you please provide data steps / code
 or guide me through it.
 
  
 
 Thanks and Regards
 
 Rithesh M Mohan
 
 
  [[alternative HTML version deleted]]
 
-
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC curve in R

2007-07-26 Thread Frank E Harrell Jr
Dylan Beaudette wrote:
 On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote:
 Note that even though the ROC curve as a whole is an interesting
 'statistic' (its area is a linear translation of the
 Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation
 statistics), each individual point on it is an improper scoring rule,
 i.e., a rule that is optimized by fitting an inappropriate model.  Using
 curves to select cutoffs is a low-precision and arbitrary operation, and
 the cutoffs do not replicate from study to study.  Probably the worst
 problem with drawing an ROC curve is that it tempts analysts to try to
 find cutoffs where none really exist, and it makes analysts ignore the
 whole field of decision theory.

 Frank Harrell
 
 Frank,
 
 This thread has caught may attention for a couple reasons, possibly related 
 to 
 my novice-level experience. 
 
 1. in a logistic regression study, where i am predicting the probability of 
 the response being 1 (for example) - there exists a continuum of probability 
 values - and a finite number of {1,0} realities when i either look within the 
 original data set, or with a new 'verification' data set. I understand that 
 drawing a line through the probabilities returned from the logistic 
 regression is a loss of information, but there are times when a 'hard' 
 decision requiring prediction of {1,0} is required. I have found that the 
 ROCR package (not necessarily the ROC Curve) can be useful in identifying the 
 probability cutoff where accuracy is maximized. Is this an unreasonable way 
 of using logistic regression as a predictor? 

Logistic regression (with suitable attention to not assuming linearity 
and to avoiding overfitting) is a great way to estimate P[Y=1].  Given 
good predicted P[Y=1] and utilities (losses, costs) for incorrect 
positive and negative decisions, an optimal decision is one that 
optimizes expected utility.  The ROC curve does not play a direct role 
in this regard.  If per-subject utilities are not available, the analyst 
may make various assumptions about utilities (including the unreasonable 
but often used assumption that utilities do not vary over subjects) to 
find a cutoff on P[Y=1].  A very nice feature of P[Y=1] is that error 
probabilities are self-contained.  For example if P[Y=1] = .02 for a 
single subject and you predict Y=0, the probability of an error is .02 
by definition.  One doesn't need to compute an overall error probability 
over the whole distribution of subjects' risks.  If the cost of a false 
negative is C, the expected cost is .02*C in this example.

 
 2. The ROC curve can be a helpful way of communicating false positives / 
 false 
 negatives to other users who are less familiar with the output and 
 interpretation of logistic regression. 

What is more useful than that is a rigorous calibration curve estimate 
to demonstrate the faithfulness of predicted P[Y=1] and a histogram 
showing the distribution of predicted P[Y=1].  Models that put a lot of 
predictions near 0 or 1 are the most discriminating.  Calibration curves 
and risk distributions are easier to explain than ROC curves.  Too often 
a statistician will solve for a cutoff on P[Y=1], imposing her own 
utility function without querying any subjects.

 
 
 3. I have been using the area under the ROC Curve, kendall's tau, and cohen's 
 kappa to evaluate the accuracy of a logistic regression based prediction, the 
 last two statistics based on a some probability cutoff identified before 
 hand. 

ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank 
correlation between pred. P[Y=1] and Y) is a measure of pure 
discrimination, not a measure of accuracy per se.  Rank correlation 
(concordance) measures do not require the use of cutoffs.

 
 
 How does the topic of decision theory relate to some of the circumstances 
 described above? Is there a better way to do some of these things?

See above re: expected loses/utilities.

Good questions.

Frank
 
 Cheers,
 
 Dylan
 
 
 
 [EMAIL PROTECTED] wrote:
 http://search.r-project.org/cgi-bin/namazu.cgi?query=ROCmax=20result=no
 rmalsort=scoreidxname=Rhelp02aidxname=functionsidxname=docs

 there is a lot of help try help.search(ROC curve) gave
 Help files with alias or concept or title matching 'ROC curve' using
 fuzzy matching:



 granulo(ade4) Granulometric Curves
 plot.roc(analogue)Plot ROC curves and associated
 diagnostics
 roc(analogue) ROC curve analysis
 colAUC(caTools)   Column-wise Area Under ROC
 Curve (AUC)
 DProc(DPpackage)  Semiparametric Bayesian ROC
 curve analysis
 cv.enet(elasticnet)   Computes K-fold cross-validated
 error curve for elastic net
 ROC(Epi)  Function to compute and draw
 ROC-curves.
 lroc(epicalc) ROC curve
 cv.lars(lars

Re: [R] Constructing bar charts with standard error bars

2007-07-25 Thread Frank E Harrell Jr
John Zabroski wrote:
 I am new to R.
 
 I want to graph group data using a Traditional Bar Chart with Standard
 Error Bar, like the kind shown here:
 http://samiam.colorado.edu/~mcclella/ftep/twoGroups/twoGroupGraphs.html

There are severe problems with dynamite plots such as these.  See 
http://biostat.mc.vanderbilt.edu/DynamitePlots for a list of problems 
and solutions.

Frank

 
 Is there a simple way to do this?
 
 So far, I have only figured out how to plot the bars using barplot.
 
 testdata - scan(, list(group=0,xbar=0,se=0))
 400 0.36038 0.02154
 200 0.35927 0.02167
 100 0.35925 0.02341
 50 0.35712 0.01968
 25 0.35396 0.01931
 
 barplot(testdata$xbar, names.arg=as.character(testdata$group), main=a=4.0,
 xlab=Group, ylab=xbar)
 xvalues - c(0.7, 1.9, 3.1, 4.3, 5.5)
 arrows(xvalues, testdata$xbar, xvalues, testdata$xbar+testdata$se, length=
 0.4, angle=90, code=3)
 
 
 The best clue I have so far is Rtips #5.9:
 http://pj.freefaculty.org/R/Rtips.html#5.9 which is what I based my present
 solution off of.
 
 However, I do not understand how this works.  It seems like there is no
 concrete way to determine the arrow drawing parameters x0 and x1 for a
 barplot.  Moreover, the bars seem to be cut off.
 
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Subscript out of bounds when using datadist() from Design library

2007-07-25 Thread Frank E Harrell Jr
Cody Hamilton wrote:
 I am running R version 2.4.1 on Windows XP.  I have a question regarding the 
 datadist() function from the Design library.  I have a data.frame (call it 
 my.data) with 4 columns.  When I submit the code
 
 datadist(data=my.data)

You can just use
dd - datadist(my.data); options(datadist='dd')

If that doesn't fix it, generate a test dataset that fails or do 
save(..., compress=TRUE) and send me your dataset so I can debug.

Frank

 
 I get the following error message:
 
 Error in X[[1]] : subscript out of bounds
 
 I suspect there may be something wrong with my data.frame (I'm certain there 
 is nothing wrong with datadist()), but I don't know what.  Has anyone 
 experienced the mistake I seem to be making?
 
 Regards,
 Cody Hamilton
 Edwards Lifesciences
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off-topic: simulate time dependent covariates

2007-07-22 Thread Frank E Harrell Jr
Jin Lo wrote:
 Dear R friends,
 
 this is an off-topic question. Could you please point
 me in ways (e.g., references or even R code) for
 simulating time varying covariates in a survival
 analysis setting.
 
 Thanks in advance for any responses.
 
 yours sincerely,
 
 Jin

I'm not sure if this article addresses time-dependent covariates but it 
may be worth a look:

@Article{ben05gen,
   author =  {Bender, Ralf and Augustin, Thomas and Blettner, 
Maria},
   title =   {Generating survival times to simulate {Cox}
proportional hazards models},
   journal = Stat in Med,
   year =2005,
   volume =  24,
   pages =   {1713-1723},
   annote =  {simulation setup;simulating survival
times;censoring;Cox model;errata 25:1978-1979}
}

The spower function in the Hmisc package can simulate data for a 
non-constant-hazard ratio treatment effect.  The user specifies the 
hazard ratio as a function of time.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] exclude1 in summary.formula from Hmisc

2007-07-20 Thread Frank E Harrell Jr
david dav wrote:
 Dear Users,
 Still having troubles with sharing my results, I'm trying to display a
 contingency table using summary.formula.
 Computing works well but I'd like to display information on redundant
 entries say, males AND females.

Please tell us what output you got.  And it is sometimes helpful to get 
a trivial example of the failure with simulated data.

Also see a related parameter long.

Frank

 I wonder if this code is correct :
 
 desqualjum - summary.formula(varqual1$X ~.,  data = subset( varqual1,
 select = -X), method = reverse,  overall = T, test = T, exclude1 = FALSE)
 where varqual1 is the data frame containing the variable sex.
 
 I'm using R 2.5.1 and Hmisc 3.4-2 on a Mac (OSX.4, intel)  but I had the
 same trouble with  former versions of R and the package.
 
 Thank you for your help.
 David
 
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] exclude1 in summary.formula from Hmisc

2007-07-20 Thread Frank E Harrell Jr
david dav wrote:
 Here is a peace of the data and code :
 
 sex -c(2,2,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1,2,2)
 AGN - 
 c(C,C,C,C,C,A,A,C,B,B,C,C,C,C,C,C,B,B,C,C,C)
  
 
 X - c(2,2,2,2,1,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,2)
 
 varqual - data.frame(sex,AGN,X)
 
 desqual - summary.formula(varqual$X ~.,  data = subset( varqual, select 
 = -X), method = reverse,  overall = T, test = T, long = T, exclude1 = F)
 
 desqual doesn't show the results for sex ==1 as it is redundant.
 I also tried long =T wich didn't change anything here.

Oh yes.  exclude1 is not an argument to summary.formula but is an 
argument to the print, plot, and latex methods.  So do print(desqual, 
exclude1=FALSE, long=TRUE).  Thanks for the reproducible example.

Note that you say summary( ) and don't need to type out summary.formula

 
 Bonus question if I may :
 This function is a little bit complex for me and  I couldn't figure out 
 how to get either Yates' continuity correction or Fisher Test instead of 
 Chisquare. Does it ask a lot of program coding ?

Neither of those is recommended so they are not automatically supported. 
  But users can add their own test functions - see the help file for the 
catTest argument to summary.formula.  Fisher's test is conservative. 
The Yates' continuity correction tries to mimic the conservatism of 
Fisher's test.  I don't like to get larger P-values when I can avoid it. 
  And the recommendations about worrying about the chi-square 
approximation when some cell sizes are small are a bit overdone.  Better 
might be to use the likelihood ratio tests, and many other tests are 
available.

Frank

 
 Regards.
 
 David


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Package for .632 (and .632+) bootstrap and the cross-validation of ROC Parameters

2007-07-13 Thread Frank E Harrell Jr
spime wrote:
 Suppose I have
 
 Training data: my.train
 Testing data: my.test

The bootstrap does not need split samples.

 
 I want to calculate bootstrap error rate for logistic model. My wrapper
 function for prediction
 
 pred.glm - function(object, newdata) {
 ret - as.factor(ifelse(predict.glm(object, newdata,
 type='response')  0.4, 0, 1))
 return(ret)
 }
 
 But i thing i cant understand if i want to calculate misclassification error
 for my testing data what will be in my data in the following formula.

Misclassification error has many problems because it is not a proper 
scoring rule, i.e., it is optimized by bogus models.

Frank

 
 errorest(RES ~., data=???, model=glm, estimator=boot, predict=pred.glm, 
est.para=control.errorest(nboot = 10))
 
 Using my.test got following error,
 
 Error in predict(mymodel, newdata = outbootdata) : 
 unused argument(s) (newdata = list(RES = c(1, 0, 0, 0, 1, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
 0), CAT01 = c(4, 4, 2, 4, 4, 4, 4, 4, 4, 2, 1, 2, 2, 4, 4, 4, 1, 1, 2, 2, 1,
 4, 1, 4, 1, 4, 2, 4, 1, 4, 2, 3, 1, 1, 3, 3, 4, 2, 4, 2, 1, 2, 2, 1, 1, 
 
 please reply...
 
 
 
 
 
 
 Frank E Harrell Jr wrote:
 spime wrote:
 Hi users,

 I need to calculate .632 (and .632+) bootstrap and the cross-validation
 of
 area under curve (AUC) to compare my models. Is there any package for the
 same. I know about 'ipred' and using it i can calculate misclassification
 errors. 

 Please help. It's urgent. 
 See the validate* functions in the Design package.

 Note that some simulations (see http://biostat.mc.vanderbilt.edu/rms) 
 indicate that the advantages of .632 and .632+ over the ordinary 
 bootstrap are highly dependent on the choice of the accuracy measure 
 being validated.  The bootstrap variants seem to have advantages mainly 
 if an improper, inefficient, discontinuous scoring rule such as the 
 percent classified correct is used.

 -- 
 Frank E Harrell Jr   Professor and Chair   School of Medicine
   Department of Biostatistics   Vanderbilt University

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Package for .632 (and .632+) bootstrap and the cross-validation of ROC Parameters

2007-07-12 Thread Frank E Harrell Jr
spime wrote:
 
 Hi users,
 
 I need to calculate .632 (and .632+) bootstrap and the cross-validation of
 area under curve (AUC) to compare my models. Is there any package for the
 same. I know about 'ipred' and using it i can calculate misclassification
 errors. 
 
 Please help. It's urgent. 

See the validate* functions in the Design package.

Note that some simulations (see http://biostat.mc.vanderbilt.edu/rms) 
indicate that the advantages of .632 and .632+ over the ordinary 
bootstrap are highly dependent on the choice of the accuracy measure 
being validated.  The bootstrap variants seem to have advantages mainly 
if an improper, inefficient, discontinuous scoring rule such as the 
percent classified correct is used.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] variable and value labels

2007-07-11 Thread Frank E Harrell Jr
Steve Powell wrote:
 Dear list
 I am new here - enjoying the power of R compared to SPSS.
 Looking for sets of tips and tricks for people with old SPSS habits.
 In particular, would like to know an easy way to set variable labels 
 across a dataset and to set value labels for sets of variables.
 Grateful for any help,
 Steve Powell

In the Hmisc package see the functions spss.get, label, upData, and 
describe.
Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Incidence estimated from Kaplan-Meier

2007-07-05 Thread Frank E Harrell Jr
Nguyen Dinh Nguyen wrote:
 Dear all,
 
 I have a stat question that may not be related to R, but I would like to
 have your advice.
 
  
 
 I have just read a medical paper in which the authors report the 1-p (where
 p is the cumulative survival probability from the Kaplan Meier curve) as
 incidence of disease.  
 
  
 
 Specifically, the study followed ~12000 women on drug A and ~2 women on
 drug B for 12 months.  During that period 29 women on drug A and 80 on drug
 B had the disease.  The incidence of disease for A and B was 0.24% and 0.30%
 respectively.  However, instead of reporting these numbers, they report the
 1-p figure which was 0.3% for A and 0.6% for B. 
 
  
 
 So, the incidence from 1-p was substantially higher than the actual
 incidence.  My question is: is it appropriate to use 1-p estimated from
 Kaplan-Meier as the incidence of disease?  If not, why not? 
 
  
 
 Regards,
 
 Nguyen

Yes it's appropriate, and it makes you state the cumulative incidence by 
time t rather than leaving time unspecified.  In your example it is 
likely that all women weren't followed completely, so simple incidences 
are not appropriate to compute because the denominator is not constant.

Frank

 
  
 
  
 Nguyen Dinh Nguyen, 
 
 Bone and Mineral Research Program 
 Garvan Institute of Medical Research 
 St Vincent's Hospital 
 384 Victoria Street, Darlinghurst 
 Sydney, NSW 2010 
 Australia 
 Tel; 61-2-9295 8274 
 Fax: 61-2-9295 8241 
 E-mail: [EMAIL PROTECTED] 
 
  
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Formula syntax question

2007-07-03 Thread Frank E Harrell Jr
Isaac Kohane wrote:
 Forgive me if this is obvious:
 
   I have a frame of data with the variables in each column (e.g.  
 Discrete_Variable1, ContinuousVariable_1, ContinuousVariable_2,  ...   
 ContinuousVariable_n)
 
   and I want to create a model using lrm i.e.
   model - lrm(Discrete_Variable1 ~ ContinuousVariable_1,  
 data=lotsofdata)
 
   Is there a syntax for having all the continuous variables referenced  
 in the formula without having to enumerate them all?
 
   I've seen the ~ . notation but when I try
 
 
   model - lrm(Discrete_Variable1 ~  ., data=lotsofdata)
 
   I get this error:
 
   Error in terms.formula(formula, specials = strat) :
   '.' in formula and no 'data' argument
   
 
   Any help is appreciated.
 
 -Zak

It may be best to write a function to determine what is continuous (= 
10 unique values for example, and numeric) and to run sapply on that 
function, over your data frame.  Then you could use lrm(y ~ ., 
data=mydata[continuous]) if it were not for a problem with lrm which 
Charles Thomas Dupont (the Design package maintainer) and I will work 
on.  Until then you can write a command to compose a formula, e.g.,

form - as.formula(paste('y', paste(names(mydata)[continuous], 
collapse='+'), sep='~'))
lrm(form, data=mydata)


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Harrell's C

2007-07-03 Thread Frank E Harrell Jr
Madero Jarabo, Rosario wrote:
 I need to calculate Harrell's C for some survival analyses using Design 
 package with R version 2.4.1. ¿How can I try or do it?
 
 Rosario Madero
 Sección de Bioestadística
 Hospital Universitario La Paz
 Pºde la Castellana, 261
 28046 Madrid, España
 Tfno: 917277112
 [EMAIL PROTECTED]

It's in the documention with the Hmisc package.  Type
?rcorr.cens
?rcorrp.cens

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression and dummy variable coding

2007-06-29 Thread Frank E Harrell Jr
Bingshan Li wrote:
 Hi All,
 
 Now it works. Thanks for all your answers and the explanations are  
 very clear.
 
 Bingshan

But note that you are not using R correctly unless you are doing a 
simulation and have some special speed issues.  Let the model functions 
do all this for you.

Frank

 
 On Jun 28, 2007, at 7:44 PM, Seyed Reza Jafarzadeh wrote:
 
 NewVar - relevel( factor(OldVar), ref = b)
 should create a dummy variable, and change the reference category  
 for the model.

 Reza


 On 6/28/07, Bingshan Li [EMAIL PROTECTED] wrote:
 Hello everyone,

 I have a variable with several categories and I want to convert this
 into dummy variables and do logistic regression on it. I used
 model.matrix to create dummy variables but it always picked the
 smallest one as the reference. For example,

 model.matrix(~.,data=as.data.frame(letters[1:5]))

 will code 'a' as '0 0 0 0'. But I want to code another category as
 reference, say 'b'. How to do it in R using model.matrix? Is there
 other way to do it if model.matrix  has no such functionality?

 Thanks!



 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting- 
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Warning message in Design ...

2007-06-29 Thread Frank E Harrell Jr
Petar Milin wrote:
 Hello!
 There were some questions regarding warning messages in Design library,
 but no answer how to fix them. What about:
 
 use of storage.mode(x) - single is deprecated: use mode- instead
 
 What does it mean? How to use properly simple model like:
 
 m1 = ols(Y ~ A + B + C, data = someData)
 
 After specifying the model above, warning message to use mode(x)
 instead of storage.mode(x) appears.
 
 Best,
 PM

It's just a warning to be ignored.  It will go away in a future release 
of Design.

Frank

 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Marginal Effects of continuous variable in lrm model (Design package)

2007-06-26 Thread Frank E Harrell Jr
Minyu Chen wrote:
 Dear all:
 
 When I am trying to get the marginal effects:
 
 summary(result7,adv_inc_ratio=mean(m9201 
 $adv_inc_ratio),adv_price_ratio=mean(m9201$adv_price_ratio), ...(SOME  
 MORE CONTINUOUS AND DISCRETE VARIABLES BUT I AM NOT LISTING)... regW=c 
 (0,mean(m9201$regW),1), regWM=c(0,mean(m9201$regWM),1))
 
 It gave out an error message:
 
 Error in summary.Design(result7, adv_inc_ratio = mean(m9201 
 $adv_inc_ratio),  :
   ranges not defined here or with datadist for adv_inc_ratio  
 adv_price_ratio agem1 change2 yr1Libor yr1Libor1mtLag yr1Libor1yrLag  
 change21yrLag change21mtLag fwdReal4yr fwdInfl4yr
 
 But I remember from my previous operation a few months ago (I  
 recorded the commands) that to summary the marginal effect, I don't  
 have to specify the ranges for the discrete variables. However, I use  
 the command again (with slight modification because of the newly  
 added variables) it doesn't work. I don't know what went wrong.
 
 Thank you very much.
 
 Thanks,
 Minyu Chen

Please read the package's help files especially about the datadist function.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help on the use of ldBand

2007-06-22 Thread Frank E Harrell Jr
Tomas Goicoa wrote:
   Hi R Users,
 
   I am trying to use the ldBand package. Together 
 with the package, I have downloaded the ld98 
 program (version for windows) as indicated in the 
 help page on ldBand. I did it, but obtained an 
 error message Error in (head + 1):length(w) : 
 Argument NA/NaN when I copied the help examples, 
 so it seems that a conection between R and ld98 
 is not well performed in my computer.  Did I put 
 ld98.exe in the wrong place? If so,  where should 
 I put it? Thanks a lot in advance,
 
 
   Berta Ibañez Beroiz

Do you mean the Hmisc package?  Do you mean the ldBands function?  Did 
you put ld98.exe in a place that is in your system path as specified in 
the ldBands help file?  And please read the posting guide referenced below.

Frank

 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression

2007-06-19 Thread Frank E Harrell Jr
spime wrote:
 
 I haven't find any helpful thread. How can i calculate BIC and
 Hosmer-Lemeshow statistic for a logistic regression model. I have used glm
 for logistic fit.

See the Design package's lrm function and residuals.lrm for a better GOF 
test.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression

2007-06-19 Thread Frank E Harrell Jr
spime wrote:
 
 Is there any windows version of Design package???

Soon the new version will will make its way to Windows, probably in a 
day or two.

Frank

 
 
 
 
 
 
 Frank E Harrell Jr wrote:
 spime wrote:
 I haven't find any helpful thread. How can i calculate BIC and
 Hosmer-Lemeshow statistic for a logistic regression model. I have used
 glm
 for logistic fit.
 See the Design package's lrm function and residuals.lrm for a better GOF 
 test.



 -- 
 Frank E Harrell Jr   Professor and Chair   School of Medicine
   Department of Biostatistics   Vanderbilt University

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] psm/survreg coefficient values ?

2007-06-18 Thread Frank E Harrell Jr
sj wrote:
 I am using psm to model some parametric survival data, the data is for
 length of stay in an emergency department. There are several ways a
 patient's stay in the emergency department can end (discharge, admit, etc..)
 so I am looking at modeling the effects of several covariates on the various
 outcomes. Initially I am trying to fit a  survival model for each type of
 outcome using the psm function in the design package,  i.e., all  patients
 who's visits  come to an end  due to  any event other than the event of
 interest are considered to be censored.  Being new to the psm and  survreg
 packages (and to parametric survival modeling) I am not entirely sure how to
 interpret the coefficient values that psm returns. I have included the
 following code to illustrate code similar to what I am using on my data. I
 suppose that the coefficients are somehow rescaled , but I am not sure how
 to return them to the original scale and make sense out of the coefficients,
 e.g., estimate the the effect of higher acuity on time to event in minutes.
 Any explanation or direction on how to interpret the  coefficient values
 would be greatly appreciated.
 
 this is from the documentation for survreg.object.
 coefficientsthe coefficients of the linear.predictors, which multiply the
 columns of the model matrix. It does not include the estimate of error
 (sigma). The names of the coefficients are the names of the
 single-degree-of-freedom effects (the columns of the model matrix). If the
 model is over-determined there will be missing values in the coefficients
 corresponding to non-estimable coefficients.
 
 code:
 LOS - sort(rweibull(1000,1.4,108))
 AGE - sort(rnorm(1000,41,12))
 ACUITY - sort(rep(1:5,200))
 EVENT -  sample(x=c(0,1),replace=TRUE,1000)
 psm(Surv(LOS,EVENT)~AGE+as.factor(ACUITY),dist='weibull')
 
 output:
 
 psm(formula = Surv(LOS, CENS) ~ AGE + as.factor(ACUITY), dist = weibull)
 
Obs Events Model L.R.   d.f.  P R2
   10005132387.62  5  0   0.91
 
   Value  Std. Error  z p
 (Intercept) 1.10550.04425  24.98 8.92e-138
 AGE 0.07720.00152  50.93  0.00e+00
 ACUITY=2 0.09440.01357   6.96  3.39e-12
 ACUITY=3 0.17520.02111   8.30  1.03e-16
 ACUITY=4 0.13910.02722   5.11  3.18e-07
 ACUITY=5-0.05440.03789  -1.43  1.51e-01
 Log(scale)-2.72870.03780 -72.18  0.00e+00
 
 Scale= 0.0653
 
 best,
 
 Spencer

I have a case study using psm (survreg wrapper) in my book.  Briefly, 
coefficients are on the log median survival time scale.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] error bars on survival curve

2007-06-17 Thread Frank E Harrell Jr
Murray Pung wrote:
 I am using plot(survfit(Surv(time,status) ~...) and would like to add
 error bars rather than the confidence intervals. Am I able to do this
 at specified times? e.g. when time = 20  40.
 
 
 leukemia.surv - survfit(Surv(time, status) ~ x, data = aml)
 plot(leukemia.surv, lty = 2:3,xlim = c(0,50))
 #can i add error bars at times 20  40?
 legend(100, .9, c(Maintenance, No Maintenance), lty = 2:3)
 
 
 Thanks
 Murray

Error bars when done correctly should equal confidence intervals in the 
minds of many.

When the Design package is available again for the latest version of R, 
you can have more control using the survplot function, with bars and 
bands options.  Do ?survplot.survfit

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] selecting cut-off in Logistic regression using ROCR package

2007-06-16 Thread Frank E Harrell Jr
Tirthadeep wrote:
 
 Hi,
 
 I am using logistic regression to classify a binary psychometric data. using
 glm() and then predict.glm() i got the predicted odds ratio of the testing
 data. Next i am going to plot ROC curve for the analysis of my study.
 
 Now what i will do:
 
 1. first select a cut-off (say 0.4) and classify the output of predict.glm()
 into {0,1} segment and then use it to draw ROC curve using ROCR package 
 
 OR
 
 2. just use the predicted odds ratio in ROCR package to get error rate and
 use the minimum error rate (as new cut-off) to draw new ROC curve.
 
 waiting for reply.
 
 with regards and thanks.
 
 Tirtha.

It's not clear why any cutoff or ROC curve is needed.  Please give us 
more information about why a continuous variable should be dichotomized, 
and read

@Article{roy06dic,
   author =  {Royston, Patrick and Altman, Douglas G. and
Sauerbrei, Willi},
   title =   {Dichotomizing continuous predictors in multiple
regression: a bad idea},
   journal = Stat in Med,
   year =2006,
   volume =  25,
   pages =   {127-141},
   annote =  {continuous
covariates;dichotomization;categorization;regression;efficiency;clinical
research;residual confounding;destruction of statistical inference
when cutpoints are chosen using the response variable;varying effect
estimates from change in cutpoints;difficult to interpret effects
when dichotomize;nice plot showing effect of categorization;PBC data}
}

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] complex contrasts and logistic regression

2007-06-16 Thread Frank E Harrell Jr
Nicholas Lewin-Koh wrote:
 Hi,
 I am doing a retrospective analysis on a cohort from a designed trial,
 and I am fitting
 the model
 
 fit-glmD(survived ~ Covariate*Therapy + confounder,myDat,X=TRUE,
 Y=TRUE, family=binomial()) 

For logistic regression you can also use Design's lrm function which 
gives you more options.

 
 My covariate has three levels (A,B and C) and therapy has two
 (treated and control), confounder is a continuous variable.
 Also patients were randomized to treatment in the trial, but Covariate
 is something that is measured
 posthoc and can vary in the population.

If by posthoc you mean that the covariate is measured after baseline, it 
is difficult to get an interpretable analysis.

  
 I am trying to wrap my head around how to calculate a few quantities
 from the model
 and get reasonable confidence intervals for them, namely I would like to
 test
 
 H0: gamma=0, where gamma is the regression coefficient of the odds
 ratios of surviving
  under treatment vs control at each level of Covariate
  (adjusted for the confounder)

You mean regression coefficient on the log odds ratio scale.  This is 
easy to do with the contrast( ) function in Design.  Do ?contrast.Design 
for details and examples.

 
 and I would like to get the odds of surviving at each level of Covariate
 under treatment and control
 for each level of covariate adjusted for the confounder. I have looked
 at contrast in the Design 
 library but I don't think it gives me the right quantity, for instance 
 
 contrast(fit,list(covariate=A, Therapy=Treated,
 confounder=median(myDat$confounder), X=TRUE)
 ( A is the baseline level of Covariate) 
 
 gives me beta0 + beta_Treated + beta_confounder*68  
 
 Is this correctly interpreted as the conditional odds of dying? 
 As to the 1st contrast I am not sure how to get it, would it be using
 type = 'average' with some weights 
 in contrast? The answers are probably staring me in the face, i am just
 not seeing them today.

contrast( ) is for contrasts (differences).  Sounds like you want 
predicted values.  Do ?predict  ?predict.lrm  ?predict.Design.  Also do 
?gendata which will generate a data frame for getting predictors, with 
unspecified predictors set to reference values such as medians.

Frank

 
 Nicholas
 
 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Design library installation problem

2007-06-13 Thread Frank E Harrell Jr
Uwe Ligges wrote:
 
 
 Ian Watson wrote:
 Dear Listers

 I have tried to install Frank Harrell's two libaries: Hmisc and Design.

 I found that Hmisc was listed in the list of packages from the Install 
 Packages command on the Packages menu, but Design was not. I installed 
 Hmisc from this list, and when I   issued the library(Hmisc) command, 
 it loaded into memory correctly.

 I then copied the Design 1.1-1.zip file from the 
 http://lib.stat.cmu.edu/S/Harrell/library/r/ site and used the Install 
 Packages from Local Zip file command.
 I received no error messages and a visual inspection of the R\library 
 directory shows Design has been installed.

 However, when I issued the library(Design) command I get the following 
 error message:

 Error in library(Design) : 'Design' is not a valid package -- 
 installed  2.0.0?


 I also notice, from a visual inspection of the R\library\Design\R 
 directory that there is only one file: design. In other directories, 
 eg. R\library\Hmisc\R there are usually 3 files:
 Hmisc
 Hmisc.rdx
 Hmisc.rdb

 I am new to R, and a bit lost. I have read the R-admin.pdf 
 documentation on packages but am still unsure how to proceed from here.

 I would appreciate any advice, and any answers to the following 
 questions:

 1) is there a reason why Design is not listed in the Install Packages 
 list as Hmisc is?
 
 Yes. The current version does not pass the checks under Windows. Please 
 convince the maintainer to fix the package, and a binary will be made 
 available shortly.

Please note that this would be a lot easier if we did not get a segfault 
when running R CMD CHECK --- a segfault that does not happen with the 
example code is run outside of CMD CHECK.  Charles Thomas Dupont is 
working hard on this.

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R vs. Splus in Pharma/Devices Industry

2007-06-11 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Following up to some extent on Friday's discussion regarding the
 'validation' of R, could I ask the list group's opinion on possible
 advantages of R over Splus from a pharma/devices perspective?  I wish to
 exclude the obvious price difference, which doesn’t seem to carry as much
 weight as I would have thought.  Besides, I have noticed many former Splus
 users gravitating towards R, and I suspect that the reasons are not purely
 economic.
 
 I can think of a few advantages of Splus:
 1. SeqTrial (of course that means more $)
 2. Tech support
 3. The warm fuzzies that management seems to get from proprietary software
 
 I can also think of a few advantages of R:
 1. Based on my personal experiences, simulations requiring a lot of looping
 seem to run faster.
 2. R interfaces with BUGS, for example through BRUGS.
 3. The wonderful help list!
 
 As always, I am speaking for myself and not necessarily for Edwards
 Lifesciences.
 
 Regards,
-Cody
 
 Cody Hamilton, PhD
 Edwards Lifesciences
   [[alternative HTML version deleted]]

A big one for us is plotmath (for clinical trial reports we put a lot of 
greek letters and subscripts on plots), and later we will consider 
migrating a lot of our stuff to the ggplot package which I don't think 
is available in S-Plus.  Lexical scoping is another advantage of R as is 
the ability to reference files on the internet.

Frank

 
 
 
 
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R is not a validated software package..

2007-06-08 Thread Frank E Harrell Jr
Giovanni Parrinello wrote:
 Dear All,
 discussing with a statistician of a pharmaceutical company I received 
 this answer about the statistical package that I have planned to use:
 
 As R is not a validated software package, we would like to ask if it 
 would rather be possible for you to use SAS, SPSS or another approved 
 statistical software system.
 
 Could someone suggest me a 'polite' answer?
 TIA
 Giovanni
 

Search the archives and you'll find a LOT of responses.

Briefly, in my view there are no requirements, just some pharma 
companies that think there are.  FDA is required to accepted all 
submissions, and they get some where only Excel was used, or Minitab, 
and lots more.  There is a session on this at the upcoming R 
International Users Meeting in Iowa in August.  The session will include 
dicussions of federal regulation compliance for R, for those users who 
feel that such compliance is actually needed.

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R is not a validated software package..

2007-06-08 Thread Frank E Harrell Jr
Sicotte, Hugues Ph.D. wrote:
 People, don't get angry at the pharma statistician, he is just trying to
 abide by an FDA requirement that is designed to insure that test perform
 reliably the same. There is no point in getting into which product is
 better. As far as the FDA rules are concerned a validated system beats a
 better system any day of the week.

There is no such requirement.

 
 Here is your polite answer.
 You can develop and try your software in R.
 Should they need to use those results in a report that will matter to
 the FDA, then you can work together with him to set up a validated
 environment for S-plus. You then have to commit to port your code to
 S-plus.

That doesn't follow.  What matters is good statistical analysis practice 
no matter which environment you use.  Note that more errors are made in 
the data preparation / derived variables stage than are made by 
statistical software.

Frank

 
 As I assume that you do not work in a regulated environment, you
 probably wouldn't have access to a validated SAS environment anyways. It
 is not usually enough to install a piece of software, you have to
 validate every step of the installation. Since AFAIK the FDA uses
 S-plus, it would be to your pharma person's advantage to speed-up
 submissions if they also had a validated S-plus environment.
 
 http://www.msmiami.com/custom/downloads/S-PLUSValidationdatasheet_Final.
 pdf
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Wensui Liu
 Sent: Friday, June 08, 2007 9:24 AM
 To: Giovanni Parrinello
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] R is not a validated software package..
 
 I like to know the answer as well.
 To be honest, I really have hard time to understand the mentality of
 clinical trial guys and rather believe it is something related to job
 security.
 
 On 6/8/07, Giovanni Parrinello [EMAIL PROTECTED] wrote:
 Dear All,
 discussing with a statistician of a pharmaceutical company I received
 this answer about the statistical package that I have planned to use:

 As R is not a validated software package, we would like to ask if it
 would rather be possible for you to use SAS, SPSS or another approved
 statistical software system.

 Could someone suggest me a 'polite' answer?
 TIA
 Giovanni

 --
 dr. Giovanni Parrinello
 External Lecturer
 Medical Statistics Unit
 Department of Biomedical Sciences
 Viale Europa, 11 - 25123 Brescia Italy
 Tel: +390303717528
 Fax: +390303717488
 email: [EMAIL PROTECTED]


 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R is not a validated software package..

2007-06-08 Thread Frank E Harrell Jr
Bert Gunter wrote:
 Frank et. al:
 
 I believe this is a bit too facile. 21 CFR Part 11 does necessitate a
 software validation **process** -- but this process does not require any

For database software and for medical devices -

 particular software. Rather, it requires that those using whatever software
 demonstrate to the FDA's satisfaction that the software does what it's
 supposed to do appropriately. This includes a lot more than assuring, say,
 the numerical accuracy of computations; I think it also requires
 demonstration that the data are secure, that it is properly transferred
 from one source to another, etc. I assume that the statistical validation of
 R would be relatively simple, as R already has an extensive test suite, and
 it would simply be a matter of providing that test suite info. A bit more
 might be required, but I don't think it's such a big deal. 
 
 I think Wensui Liu's characterization of clinical statisticians as having a
 mentality related to job security is a canard. Although I work in
 nonclinical, my observation is that clinical statistics is complex and
 difficult, not only because of many challenging statistical issues, but also
 because of the labyrinthian complexities of the regulated and extremely
 costly environment in which they work. It is certainly a job that I could
 not do.
 
 That said, probably the greatest obstacle to change from SAS is neither
 obstinacy nor ignorance, but rather inertia: pharmaceutical companies have
 over the decades made a huge investment in SAS infrastructure to support the
 collection, organization, analysis, and submission of data for clinical
 trials. To convert this to anything else would be a herculean task involving
 huge expense, risk, and resources. R, S-Plus (and much else -- e.g. numerous
 unvalidated data mining software packages) are routinely used by clinical
 statisticians to better understand their data and for exploratory analyses
 that are used to supplement official analyses (e.g. for trying to justify
 collection of tissue samples or a pivotal study in a patient subpopulation).
 But it is difficult for me to see how one could make a business case to
 change clinical trial analysis software infrastructure from SAS to S-Plus,
 SPSS, or anything else.

What I would love to have is some efficiency estimates for SAS macro 
programming as done in pharma vs. using a high-level language.  My bias 
is that SAS macro programming, which costs companies more than SAS 
licenses, is incredibly inefficient.

Frank

 
 **DISCLAINMER** 
 My opinions only. They do not in any way represent the view of my company or
 its employees.
 
 
 Bert Gunter
 Genentech Nonclinical Statistics
 South San Francisco, CA 94404
 650-467-7374
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr
 Sent: Friday, June 08, 2007 7:45 AM
 To: Giovanni Parrinello
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] R is not a validated software package..
 
 Giovanni Parrinello wrote:
 Dear All,
 discussing with a statistician of a pharmaceutical company I received 
 this answer about the statistical package that I have planned to use:

 As R is not a validated software package, we would like to ask if it 
 would rather be possible for you to use SAS, SPSS or another approved 
 statistical software system.

 Could someone suggest me a 'polite' answer?
 TIA
 Giovanni

 
 Search the archives and you'll find a LOT of responses.
 
 Briefly, in my view there are no requirements, just some pharma 
 companies that think there are.  FDA is required to accepted all 
 submissions, and they get some where only Excel was used, or Minitab, 
 and lots more.  There is a session on this at the upcoming R 
 International Users Meeting in Iowa in August.  The session will include 
 dicussions of federal regulation compliance for R, for those users who 
 feel that such compliance is actually needed.
 
 Frank
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-08 Thread Frank E Harrell Jr
Dale Steele wrote:
 For windows users, EpiData Entry http://www.epidata.dk/ is an
 excellent (free) tool for data entry and documentation.--Dale

Note that EpiData seems to work well under linux using wine.
Frank

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-07 Thread Frank E Harrell Jr
/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Harrell's C

2007-06-02 Thread Frank E Harrell Jr
Wachtel, Mitchell wrote:
 R will not let me load Design or Hmisc anymore. I need to calculate Harrell's 
 C for some survival analyses. Can anyone provide an R formula for Harrell's C?
 With kindest regards.
 Mitchell Wachtel, MD

You provided very little information.  We have had no reports of Hmisc 
not loading under Windows but Design is not currently available for 
Windows for R 2.5.  We are working to fix that.

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with Weighted Variance in Hmisc

2007-06-01 Thread Frank E Harrell Jr
jiho wrote:
 On 2007-June-01  , at 01:03 , Tom La Bone wrote:
 The function wtd.var(x,w) in Hmisc calculates the weighted variance  
 of x
 where w are the weights.  It appears to me that wtd.var(x,w) = var 
 (x) if all
 of the weights are equal, but this does not appear to be the case. Can
 someone point out to me where I am going wrong here?  Thanks.
 
 The true formula of weighted variance is this one:
   http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/ 
 weighvar.pdf
 But for computation purposes, wtd.var uses another definition which  
 considers the weights as repeats instead of true weights. However if  
 the weights are normalized (sum to one) to two formulas are equal. If  
 you consider weights as real weights instead of repeats, I would  
 recommend to use this option.
 With normwt=T, your issue is solved:
 
   a=1:10
   b=a
   b[]=2
   b
 [1] 2 2 2 2 2 2 2 2 2 2
   wtd.var(a,b)
 [1] 8.68421
 # all weights equal 2 = there are two repeats of each element of a
   var(c(a,a))
 [1] 8.68421
   wtd.var(a,b,normwt=T)
 [1] 9.17
   var(a)
 [1] 9.17
 
 Cheers,
 
 JiHO

The issue is what is being assumed for N in the denominator of the 
variance formula, since the unbiased estimator subtracts one.  Using 
normwt=TRUE means you are in effect assuming N is the number of elements 
in the data vector, ignoring the weights.

Frank Harrell

 ---
 http://jo.irisson.free.fr/
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R's Spearman

2007-05-31 Thread Frank E Harrell Jr
Mendiburu, Felipe (CIP) wrote:
 Dear Ray,
 
 The R's Spearman calculated by R is correct for ties or nonties, which is not 
 correct is the probability for the case of ties. I send to you formulates it 
 for the correlation with ties, that is equal to R. 
 
 Regards,
 
 Felipe de Mendiburu
 Statistician

Just use midranks for ties (as with rank()) and compute the ordinary 
correlation on those (see also the spearman2 and rcorr functions in 
Hmisc package).  No need to use complex formulas.  And the t 
approximation for p-values works pretty well.

Frank Harrell

 
 
 # Spearman correlation rs with ties or no ties
 rs-function(x,y) {
 d-rank(x)-rank(y)
 tx-as.numeric(table(x))
 ty-as.numeric(table(y))
 Lx-sum((tx^3-tx)/12)
 Ly-sum((ty^3-ty)/12)
 N-length(x)
 SX2- (N^3-N)/12 - Lx
 SY2- (N^3-N)/12 - Ly
 rs- (SX2+SY2-sum(d^2))/(2*sqrt(SX2*SY2))
 return(rs)
 }
 
 # Aplicacion
 cor(y[,1],y[,2],method=spearman)
 [1] 0.2319084
 rs(y[,1],y[,2])
 [1] 0.2319084
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] Behalf Of Raymond Wan
 Sent: Monday, May 28, 2007 10:29 PM
 To: r-help@stat.math.ethz.ch
 Subject: [R] R's Spearman
 
 
 
 Hi all,
 
 I am trying to figure out the formula used by R's Spearman rho (using 
 cor(method=spearman)) because I can't seem to get the same value as by 
 calculating by hand.  Perhaps I'm using cor wrong, but I don't know 
 where.  Basically, I am running these commands:
 
   y=read.table(file=tmp,header=TRUE,sep=\t)
   y
IQ Hours
 1 106 7
 2  86 0
 3  9720
 4 11312
 5 12012
 6 11017
   cor(y[1],y[2],method=spearman)
Hours
 IQ 0.2319084
 
 [it's an abbreviated example of one I took from Wikipedia].  I 
 calculated by hand (apologies if the table looks strange when pasted 
 into e-mail):
 
   IQHoursrank(IQ)  rank(hours)diffdiff^2
 110673 2 11
 2 8601 1 00
 3 9720   2 6-416
 411312   5 3.5 1.52.25
 512012   6 3.5 2.56.25
 611017   4 5-11
   26.5

   rho=0.242857
 
 where rho = (1 - ((6 * 26.5) / 6 * (6^2 - 1))).  I kept modifying the 
 table and realized that the difference in result comes from ties.  i.e., 
 if I remove the tie in rows 4 and 5, I get the same result from both cor 
 and calculating by hand.  Perhaps I'm handling ties wrong...does anyone 
 know how R does it or perhaps I need to change how I'm using it?
 
 Thank you!
 
 Ray
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] AIC for lrm(Hmisc/Design) model.

2007-05-29 Thread Frank E Harrell Jr
Milton Cezar Ribeiro wrote:
 Dear all,
 
 I am adjusting a Logistic Regression Model using lmr() function of 
 Hmisc/Design package. Now I would like to compute AIC for this model. How can 
 I do that?
 
 Kind regards,
 
 miltinho
 Brazil

I like to change AIC to have it on the chi-square scale.  For that you 
can do

aic - function(fit)
   round(unname(fit$stats['Model L.R.'] - 2*fit$stats['d.f.']),2)

f - lrm( )
aic(f)

If unname doesn't exist in S-Plus as it does in R, you can remove that part.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normality tests

2007-05-25 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Hi all,
 
 apologies for seeking advice on a general stats question. I ve run
 normality tests using 8 different methods:
 - Lilliefors
 - Shapiro-Wilk
 - Robust Jarque Bera
 - Jarque Bera
 - Anderson-Darling
 - Pearson chi-square
 - Cramer-von Mises
 - Shapiro-Francia
 
 All show that the null hypothesis that the data come from a normal
 distro cannot be rejected. Great. However, I don't think it looks nice
 to report the values of 8 different tests on a report. One note is
 that my sample size is really tiny (less than 20 independent cases).
 Without wanting to start a flame war, are there any advices of which
 one/ones would be more appropriate and should be reported (along with
 a Q-Q plot). Thank you.
 
 Regards,
 

Wow - I have so many concerns with that approach that it's hard to know 
where to begin.  But first of all, why care about normality?  Why not 
use distribution-free methods?

You should examine the power of the tests for n=20.  You'll probably 
find it's not good enough to reach a reliable conclusion.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normality tests [Broadcast]

2007-05-25 Thread Frank E Harrell Jr
Lucke, Joseph F wrote:
  Most standard tests, such as t-tests and ANOVA, are fairly resistant to
 non-normalilty for significance testing. It's the sample means that have
 to be normal, not the data.  The CLT kicks in fairly quickly.  Testing
 for normality prior to choosing a test statistic is generally not a good
 idea. 

I beg to differ Joseph.  I have had many datasets in which the CLT was 
of no use whatsoever, i.e., where bootstrap confidence limits were 
asymmetric because the data were so skewed, and where symmetric 
normality-based confidence intervals had bad coverage in both tails 
(though correct on the average).  I see this the opposite way: 
nonparametric tests works fine if normality holds.

Note that the CLT helps with type I error but not so much with type II 
error.

Frank

 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy
 Sent: Friday, May 25, 2007 12:04 PM
 To: [EMAIL PROTECTED]; Frank E Harrell Jr
 Cc: r-help
 Subject: Re: [R] normality tests [Broadcast]
 
 From: [EMAIL PROTECTED]
 On 25/05/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 Hi all,

 apologies for seeking advice on a general stats question. I ve run
 
 normality tests using 8 different methods:
 - Lilliefors
 - Shapiro-Wilk
 - Robust Jarque Bera
 - Jarque Bera
 - Anderson-Darling
 - Pearson chi-square
 - Cramer-von Mises
 - Shapiro-Francia

 All show that the null hypothesis that the data come from a normal
 
 distro cannot be rejected. Great. However, I don't think
 it looks nice
 to report the values of 8 different tests on a report. One note is
 
 that my sample size is really tiny (less than 20
 independent cases).
 Without wanting to start a flame war, are there any
 advices of which
 one/ones would be more appropriate and should be reported
 (along with
 a Q-Q plot). Thank you.

 Regards,

 Wow - I have so many concerns with that approach that it's
 hard to know
 where to begin.  But first of all, why care about
 normality?  Why not
 use distribution-free methods?

 You should examine the power of the tests for n=20.  You'll probably
 
 find it's not good enough to reach a reliable conclusion.
 And wouldn't it be even worse if I used non-parametric tests?
 
 I believe what Frank meant was that it's probably better to use a
 distribution-free procedure to do the real test of interest (if there is
 one) instead of testing for normality, and then use a test that assumes
 normality.
 
 I guess the question is, what exactly do you want to do with the outcome
 of the normality tests?  If those are going to be used as basis for
 deciding which test(s) to do next, then I concur with Frank's
 reservation.
 
 Generally speaking, I do not find goodness-of-fit for distributions very
 useful, mostly for the reason that failure to reject the null is no
 evidence in favor of the null.  It's difficult for me to imagine why
 there's insufficient evidence to show that the data did not come from a
 normal distribution would be interesting.
 
 Andy
 
  
 Frank


 --
 Frank E Harrell Jr   Professor and Chair   School 
 of Medicine
   Department of Biostatistics   
 Vanderbilt University

 --
 yianni

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 
 
 
 --
 Notice:  This e-mail message, together with any
 attachments,...{{dropped}}
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normality tests [Broadcast]

2007-05-25 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Thank you all for your replies they have been more useful... well
 in my case I have chosen to do some parametric tests (more precisely
 correlation and linear regressions among some variables)... so it
 would be nice if I had an extra bit of support on my decisions... If I
 understood well from all your replies... I shouldn't pay s much
 attntion on the normality tests, so it wouldn't matter which one/ones
 I use to report... but rather focus on issues such as the power of the
 test...

If doing regression I assume your normality tests were on residuals 
rather than raw data.

Frank

 
 Thanks again.
 
 On 25/05/07, Lucke, Joseph F [EMAIL PROTECTED] wrote:
  Most standard tests, such as t-tests and ANOVA, are fairly resistant to
 non-normalilty for significance testing. It's the sample means that have
 to be normal, not the data.  The CLT kicks in fairly quickly.  Testing
 for normality prior to choosing a test statistic is generally not a good
 idea.

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy
 Sent: Friday, May 25, 2007 12:04 PM
 To: [EMAIL PROTECTED]; Frank E Harrell Jr
 Cc: r-help
 Subject: Re: [R] normality tests [Broadcast]

 From: [EMAIL PROTECTED]
 
  On 25/05/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote:
   [EMAIL PROTECTED] wrote:
Hi all,
   
apologies for seeking advice on a general stats question. I ve run

normality tests using 8 different methods:
- Lilliefors
- Shapiro-Wilk
- Robust Jarque Bera
- Jarque Bera
- Anderson-Darling
- Pearson chi-square
- Cramer-von Mises
- Shapiro-Francia
   
All show that the null hypothesis that the data come from a normal

distro cannot be rejected. Great. However, I don't think
  it looks nice
to report the values of 8 different tests on a report. One note is

that my sample size is really tiny (less than 20
  independent cases).
Without wanting to start a flame war, are there any
  advices of which
one/ones would be more appropriate and should be reported
  (along with
a Q-Q plot). Thank you.
   
Regards,
   
  
   Wow - I have so many concerns with that approach that it's
  hard to know
   where to begin.  But first of all, why care about
  normality?  Why not
   use distribution-free methods?
  
   You should examine the power of the tests for n=20.  You'll probably

   find it's not good enough to reach a reliable conclusion.
 
  And wouldn't it be even worse if I used non-parametric tests?

 I believe what Frank meant was that it's probably better to use a
 distribution-free procedure to do the real test of interest (if there is
 one) instead of testing for normality, and then use a test that assumes
 normality.

 I guess the question is, what exactly do you want to do with the outcome
 of the normality tests?  If those are going to be used as basis for
 deciding which test(s) to do next, then I concur with Frank's
 reservation.

 Generally speaking, I do not find goodness-of-fit for distributions very
 useful, mostly for the reason that failure to reject the null is no
 evidence in favor of the null.  It's difficult for me to imagine why
 there's insufficient evidence to show that the data did not come from a
 normal distribution would be interesting.

 Andy


  
   Frank
  
  
   --
   Frank E Harrell Jr   Professor and Chair   School
  of Medicine
 Department of Biostatistics
  Vanderbilt University
  
 
 
  --
  yianni
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 


 
 --
 Notice:  This e-mail message, together with any
 attachments,...{{dropped}}

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] normality tests [Broadcast]

2007-05-25 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Following up on Frank's thought, why is it that parametric tests are so
 much more popular than their non-parametric counterparts?  As
 non-parametric tests require fewer assumptions, why aren't they the
 default?  The relative efficiency of the Wilcoxon test as compared to the
 t-test is 0.955, and yet I still see t-tests in the medical literature all
 the time.  Granted, the Wilcoxon still requires the assumption of symmetry
 (I'm curious as to why the Wilcoxon is often used when asymmetry is
 suspected, since the Wilcoxon assumes symmetry), but that's less stringent
 than requiring normally distributed data.  In a similar vein, one usually
 sees the mean and standard deviation reported as summary statistics for a
 continuous variable - these are not very informative unless you assume the
 variable is normally distributed.  However, clinicians often insist that I
 included these figures in reports.
 
 Cody Hamilton, PhD
 Edwards Lifesciences

Well said Cody, just want to add that Wilcoxon does not assume symmetry 
if you are interested in testing for stochastic ordering and not just 
for a mean.

Frank

 
 
 

  Frank E Harrell   
  Jr
  [EMAIL PROTECTED]  To 
  bilt.edu Lucke, Joseph F   
  Sent by:  [EMAIL PROTECTED]
  [EMAIL PROTECTED]  cc 
  at.math.ethz.ch   r-help r-help@stat.math.ethz.ch   
Subject 
Re: [R] normality tests 
  05/25/2007 02:42  [Broadcast] 
  PM





 
 
 
 
 Lucke, Joseph F wrote:
  Most standard tests, such as t-tests and ANOVA, are fairly resistant to
 non-normalilty for significance testing. It's the sample means that have
 to be normal, not the data.  The CLT kicks in fairly quickly.  Testing
 for normality prior to choosing a test statistic is generally not a good
 idea.
 
 I beg to differ Joseph.  I have had many datasets in which the CLT was
 of no use whatsoever, i.e., where bootstrap confidence limits were
 asymmetric because the data were so skewed, and where symmetric
 normality-based confidence intervals had bad coverage in both tails
 (though correct on the average).  I see this the opposite way:
 nonparametric tests works fine if normality holds.
 
 Note that the CLT helps with type I error but not so much with type II
 error.
 
 Frank
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy
 Sent: Friday, May 25, 2007 12:04 PM
 To: [EMAIL PROTECTED]; Frank E Harrell Jr
 Cc: r-help
 Subject: Re: [R] normality tests [Broadcast]

 From: [EMAIL PROTECTED]
 On 25/05/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 Hi all,

 apologies for seeking advice on a general stats question. I ve run
 normality tests using 8 different methods:
 - Lilliefors
 - Shapiro-Wilk
 - Robust Jarque Bera
 - Jarque Bera
 - Anderson-Darling
 - Pearson chi-square
 - Cramer-von Mises
 - Shapiro-Francia

 All show that the null hypothesis that the data come from a normal
 distro cannot be rejected. Great. However, I don't think
 it looks nice
 to report the values of 8 different tests on a report. One note is
 that my sample size is really tiny (less than 20
 independent cases).
 Without wanting to start a flame war, are there any
 advices of which
 one/ones would be more appropriate and should be reported
 (along with
 a Q-Q plot). Thank you.

 Regards,

 Wow - I have so many concerns with that approach that it's
 hard to know
 where to begin.  But first of all, why care about
 normality?  Why not
 use distribution-free methods?

 You should examine the power of the tests for n=20.  You'll probably
 find it's not good enough to reach a reliable conclusion.
 And wouldn't it be even worse if I used non-parametric tests?
 I believe what Frank meant was that it's probably better to use a
 distribution-free procedure to do the real test of interest (if there is
 one) instead of testing for normality, and then use a test that assumes
 normality.

 I guess the question is, what exactly do you want

Re: [R] clustered standarderrors using design package

2007-05-19 Thread Frank E Harrell Jr
Anders Eklund wrote:
 Please help,
 
 I have a strange problem. I've got a balanced panel data set. I use dummy
 variable regression and I've got results with lm function.
 
 summary(lm(y ~ post + t19961 + t19962 + t19963 + t19964 + t19971 + t19972
 + t19973 + t19974 + t19981+factor( id)))
 
 
 The problem is that I would like to get my standard errors clustered but
 then gets the following error message:
 
 f-(lm(y ~ post + t19961 + t19962 + t19963 + t19964 + t19971 + t19972 +
 t19973 + t19974 + t19981+factor( id)))

Omit id from the model (usually) and do
library(Design)

f - ols(y ~ post + . . . , x=TRUE,  y=TRUE)
g.clust1 - robcov(f, id)

Frank

 library(Design)
 g.clust1 - robcov(f, id)
 Error in match.arg(type) : 'arg' should be one of working, response,
 deviance, pearson, partial
 
 All my variables is vectors and I've tried with other variables inside and
 outside the model and all results in the same errormessage.
 
 Best regards
 
 Anders Eklund
 Stockholm, Sweden.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to analyse simple study: Placebo-controlled (2 groups) repeated measurements (ANOVA, ANCOA???)

2007-05-17 Thread Frank E Harrell Jr
Karl Knoblick wrote:
 Hallo!
 
 I have two groups (placebo/verum), every subject is measured at 5 times, the 
 first time t0 is the baseline measurement, t1 to t4 are the measurements 
 after applying the medication (placebo or verum). The question is, if there 
 is a significant difference in the two groups and how large the differnce is 
 (95% confidence intervals).
 
 Let me give sample data
 # Data
 ID-factor(rep(1:50,each=5)) # 50 subjects
 GROUP-factor(c(rep(Verum, 115), rep(Placebo, 135)))
 TIME-factor(rep(paste(t,0:4,sep=), 50))
 set.seed(1234)
 Y-rnorm(250)
 # to have an effect:
 Y[GROUP==Verum  TIME==t1]-Y[GROUP==Verum  TIME==t1] + 0.6 
 Y[GROUP==Verum  TIME==t2]-Y[GROUP==Verum  TIME==t2] + 0.3 
 Y[GROUP==Verum  TIME==t3]-Y[GROUP==Verum  TIME==t3] + 0.9 
 Y[GROUP==Verum  TIME==t4]-Y[GROUP==Verum  TIME==t4] + 0.9 
 DF-data.frame(Y, ID, GROUP, TIME)
 
 I have heard of different ways to analyse the data
 1) Comparing the endpoint t4 between the groups (t-test), ignoring baseline

Don't even consider this

 2) Comparing the difference t4 minus t0 between the two groups (t-test)

This is not optimal

 3) Comparing the endpoint t4 with t0 as a covariate between the groups (ANOVA 
 - how can this model be calculated in R?)

Using t0 as a covariate is the way to go.  A question is whether to just 
use t4.  Generally this is not optimum.

 4) Taking a summary score (im not sure but this may be a suggestion of 
 Altman) istead of t4
 5) ANOVA (repeated measurements) times t0 to t5, group placebo/verum), 
 subject as random factor - interested in interaction times*groups (How to do 
 this in R?)
 6) as 5) but times t1 to t5, ignoring baseline (How to do this in R?)
 7) as 6) but additional covariate baseline t0 (How to do this in R?)
 
 What will be best? - (Advantages / disadvantages?)
 How to analyse these models in R with nested and random effects and possible 
 covariate(ID, group - at least I think so) and random parameter ID)? Or is 
 there a more simple possibility?

It's not obvious that random effects are needed if you take the 
correlation into account in a good way.  Generalized least squares using 
for example an AR1 correlation structure (and there are many others) is 
something I often prefer.  A detailed case study with R code (similar to 
your situation) is in http://biostat.mc.vanderbilt.edu/FrankHarrellGLS . 
  This includes details about why t0 is best to consider as a covariate. 
  One reason is that the t0 effect may not be linear.

If you want to focus on t4 it is easy to specify a contrast (after 
fitting is completed) that tests t4.  If time is continuous this 
contrast would involve predicted values at the 4th time, otherwise 
testing single parameters.

Frank Harrell

 
 Perhaps somebody can recommend a book or weblink where these different 
 strategies of analysing are discussed - preferable with examples with raw 
 data which I can recalculate. And if there is the R syntax includede - this 
 would be best!
 
 Any help will be appreciate!
 
 Thanks!
 Karl

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Re : Bootstrap sampling for repeated measures

2007-05-15 Thread Frank E Harrell Jr
justin bem wrote:
 Hi, 
 
 If it was me I would have done this
 - First reshape the data frame to get some thing like
 
 header   measure1 measure3 measure3 
 12800012.471.482.23 ...
 
 Since you have same number of measure for all subject. The you define you 
 statistic with the data frame in this form. and you can use the boot function 
 in boot or  Hmisc  bootstrap function.
 
 
 Justin BEM

I don't think that's the best way to go.  As in the Design package (see 
the predab.resample, validate, calibrate functions) and the Hmisc 
rm.boot function you can easily sample from subject IDs and put together 
the needed records.

Frank Harrell

 Elève Ingénieur Statisticien Economiste
 BP 294 Yaoundé.
 Tél (00237)9597295.
 
 - Message d'origine 
 De : Niccolò Bassani [EMAIL PROTECTED]
 À : r-help@stat.math.ethz.ch
 Envoyé le : Mardi, 15 Mai 2007, 11h15mn 51s
 Objet : [R] Bootstrap sampling for repeated measures
 
 Dear R users,
 I'm having some problems trying to create a routine for a bootstrap
 resampling issue. Suppose I've got a dataset like this:
 
 Header  inr    weeks  .
 12800012.47     0   ...
 12800011.48     1  ...
 12800012.23   . 2  ..
 
 
 1280369  2.5   ...   56
 
 i.e. a dataset with n subjects identified by the column header, with a set
 of repeated mesaures. The amount of repeated measures for each subject is
 57, with a few of them being more or lesse frequent. That is, generalizing,
 that I haven't got the same number of observations for each patient.
 I've created a function allowing me to to reorder, subsetting and calculate
 some statistics over this dataset, but now I need to bootstrap it all. I was
 looking for a routine in R that could resample longitudinal data, in order
 to resample on the ID of the subjects. This means that while resampling
 (suppose m samples of n length) I wish to consider (better with replacement)
 either none or all of the observations related to a subject.
 So, if my bootstrap 1st sample takes the patient with header 1280001, I want
 the routine to consider all of the observations related with a subject with
 such a header.
 Thus, I shall obtain a bootstrap sample of my original dataset to wich apply
 the function cited before (whose only argument is the dataset).
 Can anybody help me? I'm trying to understand how the rm.boot function from
 Hmisc package resamples this way, but it's not that easy, so if anyone could
 help me I'd be very grateful.
 Thanks in advance
 Niccolò
 
 [[alternative HTML version deleted]]
 
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 
 
 
   
   
   
 ___
 
 
 
 
 
   [[alternative HTML version deleted]]
 
 
 
 
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Nicely formatted summary table with mean, standard deviation or number and proportion

2007-05-13 Thread Frank E Harrell Jr
Keith Wong wrote:
 Dear all,
 
 The incredibly useful Hmisc package provides a method to generate 
 summary tables that can be typeset in latex. The Alzola and Harrell book 
   An introduction to S and the Hmisc and Design libraries provides an 
 example that generates mean and quartiles for continuous variables, and 
 numbers and percentages for count variables: summary() with method = 
 'reverse'.
 
 I wonder if there is a way to change it so the mean and standard 
 deviation are reported instead for continuous variables.
 
 I illustrate my question below using an example from the book.
 
 Thank you.
 
 Keith

Newer versions of Hmisc have an option to add mean and SD for 
method='reverse'.  Quartiles are always there.

Frank

 
 
   
   library(Hmisc)
  
   set.seed(173)
   sex = factor(sample(c(m, f), 500, rep = T))
   age = rnorm(500, 50, 5)
   treatment = factor(sample(c(Drug, Placebo), 500, rep = T))
   summary(sex ~ treatment, fun = table)
 sexN=500
 
 +-+---+---+---+---+
 | |   |N  |f  |m  |
 +-+---+---+---+---+
 |treatment|Drug   |263|140|123|
 | |Placebo|237|133|104|
 +-+---+---+---+---+
 |Overall  |   |500|273|227|
 +-+---+---+---+---+
  
  
  
   (x = summary(treatment ~ age + sex, method = reverse))
   # generates quartiles for continuous variables
 
 
 Descriptive Statistics by treatment
 
 +---+--+--+
 |   |Drug  |Placebo   |
 |   |(N=263)   |(N=237)   |
 +---+--+--+
 |age|46.5/49.9/53.2|46.7/50.0/53.4|
 +---+--+--+
 |sex : m|   47% (123)  |   44% (104)  |
 +---+--+--+
  
  
   # latex(x) generates a very nicely formatted table
   # but I'd like mean (standard deviation) instead of quartiles.
 
 
 
   # this function from 
 http://tolstoy.newcastle.edu.au/R/e2/help/06/11/4713.html
   g - function(y) {
 +   s - apply(y, 2,
 +  function(z) {
 +z - z[!is.na(z)]
 +n - length(z)
 +if(n==0) c(NA,NA,NA,0) else
 +if(n==1) c(z, NA,NA,1) else {
 +  m - mean(z)
 +  s - sd(z)
 +  c(N=n, Mean=m, SD=s)
 +}
 +  })
 +   w - as.vector(s)
 +   names(w) -  as.vector(outer(rownames(s), colnames(s), paste, sep=''))
 +   w
 + }
 
  
   summary(treatment ~ age + sex, method = reverse, fun = g)
   # does not work, 'fun' or 'FUN argument is ignored.
 
 
 Descriptive Statistics by treatment
 
 +---+--+--+
 |   |Drug  |Placebo   |
 |   |(N=263)   |(N=237)   |
 +---+--+--+
 |age|46.5/49.9/53.2|46.7/50.0/53.4|
 +---+--+--+
 |sex : m|   47% (123)  |   44% (104)  |
 +---+--+--+
  
  
   (x1 = summarize(cbind(age), llist(treatment), FUN = g, 
 stat.name=c(n, mean, sd)))
treatment   n mean   sd
 1  Drug 263 49.9 4.94
 2   Placebo 237 50.1 4.97
  
   # this works but table is rotated, and it count data has to be
   # treated separately.
 
 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Read SAS data into R

2007-05-11 Thread Frank E Harrell Jr
John Kane wrote:
 Have a look at the Hmisc package:sas.get or
 sasxport.get.  In either case you need to have a
 working SAS installation on the same machine. 
 
 Another way, although you lose label is simply to
 export the SAS data file as a csv or delim file and
 import it.

I'm also looking into running the SAS Viewer under wine on linux to read 
non-transport SAS files in a new Hmisc function.  But so far there is a 
problem - SAS in all its wisdom doesn't quote character string output 
when creating csv files with the Viewer, so you can't have commas in 
your character strings.  There is a tab delimiter option as long as you 
don't have tabs in the strings.  Amazing that SAS couldn't do better 
than that.

Frank

 
 
 --- AbouEl-Makarim Aboueissa
 [EMAIL PROTECTED] wrote:
 
 Dear ALL:

 Could you please let me know how to read SAS data
 file into R.

 Thank you so much for your helps.

 Regards;

 Abou


 ==
 AbouEl-Makarim Aboueissa, Ph.D.
 Assistant Professor of Statistics
 Department of Mathematics  Statistics
 University of Southern Maine
 96 Falmouth Street
 P.O. Box 9300
 Portland, ME 04104-9300

 Tel: (207) 228-8389
 Email: [EMAIL PROTECTED]
   [EMAIL PROTECTED]
 Office: 301C Payson Smith

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained,
 reproducible code.

 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Follow-up about ordinal logit with mixtures: how about 'continuation ratio' strategy?

2007-05-10 Thread Frank E Harrell Jr
Paul Johnson wrote:
 This is a follow up to the message I posted 3 days ago about how to
 estimate mixed ordinal logit models.  I hope you don't mind that I am
 just pasting in the code and comments from an R file for your
 feedback.  Actual estimates are at the end of the post.

. . .

Paul,

lrm does not give an incorrect sign on the intercepts.  Just look at how 
it states the model in terms of Prob(Y=j) so that its coefficients are 
consistent with the way people state binary models.

I'm not clear on your generation of simulated data.  I specify the 
population logit, anti-logit that, and generate binary responses with 
those probabilities.  I don't use rlogis.

See if using the PO model with lrm with penalization on the factor does 
what you need.

lrm is not set up to omit an intercept with the -1 notation.

My book goes into details about the continuation ratio model.

Frank Harrell

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate

2007-05-08 Thread Frank E Harrell Jr
Does anyone know of an R function for computing the Greenland-Robins 
variance for Mantel-Haenszel relative risks?

Thanks
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate

2007-05-08 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Would this function help:
 http://www.csm.ornl.gov/~frome/ES/RRMHex/MHanalysis.txt ?
 
 Regards, -Cody

I think so.  Thank you Cody.  If you have time would you mind defining, 
probably offline, the input data elements?  I assume that one of them is 
a stratification factor other than occupational group.

Thanks again
Frank

 
 
 

  Frank E Harrell   
  Jr
  [EMAIL PROTECTED]  To 
  bilt.edu R list r-help@stat.math.ethz.ch   
  Sent by:   cc 
  [EMAIL PROTECTED] 
  at.math.ethz.ch   Subject 
[R] Mantel-Haenszel relative risk   
with Greenland-Robins variance  
  05/08/2007 02:38  estimate
  PM





 
 
 
 
 Does anyone know of an R function for computing the Greenland-Robins
 variance for Mantel-Haenszel relative risks?
 
 Thanks
 Frank
 --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] pseudo-R2 or GOF for regression trees?

2007-05-05 Thread Frank E Harrell Jr
Prof. Jeffrey Cardille wrote:
 Hello,
 
 Is there an accepted way to convey, for regression trees, something  
 akin to R-squared?
 
 I'm developing regression trees for a continuous y variable and I'd  
 like to say how well they are doing. In particular, I'm analyzing the  
 results of a simulation model having highly non-linear behavior, and  
 asking what characteristics of the inputs are related to a particular  
 output measure.  I've got a very large number of points: n=4000.  I'm  
 not able to do a model sensitivity analysis because of the large  
 number of inputs and the model run time.
 
 I've been googling around both on the archives and on the rest of the  
 web for several hours, but I'm still having trouble getting a firm  
 sense of the state of the art.  Could someone help me to quickly  
 understand what strategy, if any, is acceptable to say something like  
 The regression tree in Figure 3 captures 42% of the variance?  The  
 target audience is readers who will be interested in the subsequent  
 verbal explanation of the relationship, but only once they are  
 comfortable that the tree really does capture something.  I've run  
 across methods to say how well a tree does relative to a set of trees  
 on the same data, but that doesn't help much unless I'm sure the  
 trees in question are really capturing the essence of the system.
 
 I'm happy to be pointed to a web site or to a thread I may have  
 missed that answers this exact question.
 
 Thanks very much,
 
 Jeff
 
 --
 Prof. Jeffrey Cardille
 [EMAIL PROTECTED]
 R-help@stat.math.ethz.ch mailing list

Ye (below) has a method to get a nearly unbiased estimate of R^2 from 
recursive partitioning.  In his examples the result was similar to using 
the formula for adjusted R^2 with regression degrees of freedom equal to 
about 3n/4.  You can also use something like 10-fold cross-validation 
repeated 20 times to get a fairly precise and unbiased estimate of R^2.

Frank


@ARTICLE{ye98mea,
   author = {Ye, Jianming},
   year = 1998,
   title = {On measuring and correcting the effects of data mining and model
   selection},
   journal = JASA,
   volume = 93,
   pages = {120-131},
   annote = {generalized degrees of freedom;GDF;effective degrees of
freedom;data mining;model selection;model
uncertainty;overfitting;nonparametric regression;CART;simulation
setup}
}
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in if (!length(fname) || !any(fname == zname)) { :

2007-05-04 Thread Frank E Harrell Jr
hongyuan cao wrote:
 Dear R users,
 
 I tried to fit a cox proportional hazard model to get estimation of 
 stratified survival probability. my R code is as follows:
 
 cph(Surv(time.sur, status.sur)~ strat(colon[,13])+colon[,18] 
 +colon[,20]+colon[,9], surv=TRUE)
 Error in if (!length(fname) || !any(fname == zname)) { : 
 missing value where TRUE/FALSE needed
 Here colon[,13] is the one that I want to stratify and the others are all 
 coefficients. How can I solve this problem?  Thanks a lot!
 
 Grace

The Design package does not like you to have complex variable names like 
that, and in general storing your data in a matrix when you want to 
treat columns as separate variables is not the best approach.  I would 
use something like

S - with(d, Surv(  ) )   # d = data frame
h - as.data.frame(colon)  # assumes colon is a matrix;make sure it had 
column names
cph(S ~ strat(v1)+v2+v3+v4, data=h)

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Hmisc curve label size cex

2007-04-28 Thread Frank E Harrell Jr
Brian O'Connor wrote:
 R-Masters,
 
 I need to produce high resolution line plots and place labels on the 
 curves. It seems that cex must be high relative to the other cex 
 values in order to produce sufficiently large  legible tick labels 
 at high resolutions. But high cex values cause the curve labels to 
 become gigantic when using Hmisc. I've struggled and searched the 
 archives, but cannot find a way of controlling the sizes of the curve 
 labels in this situation.
 
 These commands produce the problem on my PC using XP:
 
 
 png(trial.png, width=3000, height=2400, res = 600, pointsize=12 )
 par(ann=F, font.main=1, font.lab=1, font.axis=1, cex=5, cex.main=1, 
 cex.lab=1, cex.axis=1,
 lwd=12, las=1, mar=c(4, 4, 2, 2)   )
 
 x = seq(-2.5, 2.5, length=100)
 
 labcurve( list( One=  list( x,sin(x)), Two=  list( x,cos(x)),
Three=list( x,(x*x)), Four= list( x,exp(x)) ),
keys=c('1','2','3','4'),  keyloc=none, pl=TRUE )
 
 dev.off()
 
 
 Thanks for your time.
 

cex.main .lab .axis etc. are relative so yo need for your case to 
specify something like cex.axis=1/5

Not sure why you are using keys of 1-4 when you've already given nice 
labels.  I tried

  labcurve( list( One=  list( x,sin(x)), Two=  list( x,cos(x)),
 Three=list( x,(x*x)), Four= list( x,exp(x)) ),
 pl=TRUE )

and got some nice output after reducing cex.*

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Integrating R-programs into larger systems

2007-04-27 Thread Frank E Harrell Jr
Ralf Finne wrote:
 Hi experts!
 
 Have anybody experience in including an R-program 
 as part of a larger system?  In Matlab there is a toolbox
 that converts a m-script into C-code. 
 One application in mind is that I do the model building in R,
 for estimating the risk for cancer based on clinical measurements.
 
 When the model is ready, a small R-program can simulate
 the model to estimate the risk for a new patient. The idea is
 that a doctor gets the measurements for the patient sitting in his
 office.  Then he goes to Internet and types in the test measuremnets
 and gets the estimated risk.
 Look at www.finne.info for an early version to get the idea.
 There I developed the model in Matlab and converted it to Excel.
 Don't use the results!  Much better are available in R!
 There are many more applications that need a higher degree
 of integration.
 
 Than you in advance.
 
 Ralf Finne
 SYH, University of Applied Sciences
 Vasa Finland

The Design package for R has a version for S-Plus.  In S-Plus you can 
use its Dialog function to automatically create a GUI for getting 
predicted values from a series of fitted models.  We will be working on 
an R version but it will publish the model to a web server.

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Not showing dvi with Hmisc latex()

2007-04-26 Thread Frank E Harrell Jr
Duncan Murdoch wrote:
 On 4/26/2007 9:20 PM, Gad Abraham wrote:
 Hi,

 I'm using latex() from Frank Harrell's Hmisc library to produce LaTeX 
 files. By default, it calls xdvi and displays the dvi.

 How can I make xdvi not show? I couldn't find a clue in the extensive 
 documentation.
 
 Unclass the result so it doesn't print as a latex object.  For example,
 
   unclass(latex(1, file=test.tex))
 $file
 [1] test.tex
 
 $style
 character(0)
 
 Alternatively, if you just assign the result you can print it later. 
 It's when you print that the latex'ing happens.
 
 Duncan Murdoch

Just say

w - latex(object, file='foo.tex')

Frank

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coercing data types for use in model.frame

2007-04-25 Thread Frank E Harrell Jr
Prof Brian Ripley wrote:
 Moved to R-devel 
 
 What is the 'data class'?  In particular what is its underlying type? 
 And where in model.frame[.default] are you trying to use it (in the 
 formula, data, in ..., etc).
 
 This is an example of where some reproducible code and the error 
 messages would be very helpful indeed.
 
 Brian

Brian,

Sorry - this was one of those too late in the day errors.  The problem 
was in a function called just before model.frame.  model.frame seems to 
work fine with an object of class c('mChoice', 'labelled').  It keeps 
mChoice variables as mChoice.  After model.frame is finished I'll change 
such variables to factors or matrices.

Frank


 
 On Tue, 24 Apr 2007, Frank E Harrell Jr wrote:
 
 In the Hmisc package there is a new data class 'mChoice' for multiple
 choice variables.  There are format and as.numeric methods (the latter
 creates a matrix of dummy variables).  mChoice variables are not allowed
 by model.frame.  Is there a way to specify a conversion function that
 model.frame will use automatically?  I would use as.factor here.
 model.frame does not seem to use as.data.frame.foo for individual 
 variables.

 Thanks
 Frank

 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Biostatistician Opportunities at Vanderbilt

2007-04-25 Thread Frank E Harrell Jr
The Department of Biostatistics at Vanderbilt University's School of 
Medicine has openings for biostatisticians at all levels.  Details and 
application procedures may be found at 
http://biostat.mc.vanderbilt.edu/JobOpenings .  For M.S. and B.S. 
biostatisticians we are especially interested in statisticians 
proficient in R, S-Plus, or Stata.  We have faculty positions available 
at the Assistant, Associate, and Professor levels.

Frank Harrell
Chairman, Dept. of Biostatistics

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Coercing data types for use in model.frame

2007-04-24 Thread Frank E Harrell Jr
In the Hmisc package there is a new data class 'mChoice' for multiple 
choice variables.  There are format and as.numeric methods (the latter 
creates a matrix of dummy variables).  mChoice variables are not allowed 
by model.frame.  Is there a way to specify a conversion function that 
model.frame will use automatically?  I would use as.factor here. 
model.frame does not seem to use as.data.frame.foo for individual variables.

Thanks
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help comparing two median with R

2007-04-18 Thread Frank E Harrell Jr
Thomas Lumley wrote:
 On Tue, 17 Apr 2007, Frank E Harrell Jr wrote:
 
 The points that Thomas and Brian have made are certainly correct, if 
 one is truly interested in testing for differences in medians or 
 means.  But the Wilcoxon test provides a valid test of x  y more 
 generally.  The test is consonant with the Hodges-Lehmann estimator: 
 the median of all possible differences between an X and a Y.

 
 Yes, but there is no ordering of distributions (taken one at a time) 
 that agrees with the Wilcoxon two-sample test, only orderings of pairs 
 of distributions.
 
 The Wilcoxon test provides a test of xy if it is known a priori that 
 the two distributions are stochastically ordered, but not under weaker 
 assumptions.  Otherwise you can get xyzx. This is in contrast to the 
 t-test, which orders distributions (by their mean) whether or not they 
 are stochastically ordered.
 
 Now, it is not unreasonable to say that the problems are unlikely to 
 occur very often and aren't worth worrying too much about. It does imply 
 that it cannot possibly be true that there is any summary of a single 
 distribution that the Wilcoxon test tests for (and the same is true for 
 other two-sample rank tests, eg the logrank test).
 
 I know Frank knows this, because I gave a talk on it at Vanderbilt, but 
 most people don't know it. (I thought for a long time that the Wilcoxon 
 rank-sum test was a test for the median pairwise mean, which is actually 
 the R-estimator corresponding to the *one*-sample Wilcoxon test).
 
 
 -thomas
 

Thanks for your note Thomas.  I do feel that the problems you have 
rightly listed occur infrequently and that often I only care about two 
groups.  Rank tests generally are good at relatives, not absolutes.  We 
have an efficient test (Wilcoxon) for relative shift but for estimating 
an absolute one-sample quantity (e.g., median) the nonparametric 
estimator is not very efficient.  Ironically there is an exact 
nonparametric confidence interval for the median (unrelated to Wilcoxon) 
but none exists for the mean.

Cheers,
Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help comparing two median with R

2007-04-18 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Has anyone proposed using a bootstrap for Pedro's problem?
 
 What about taking a boostrap sample from x, a boostrap sample from y, take
 the difference in the medians for these two bootstrap samples, repeat the
 process 1,000 times and calculate the 95th percentiles of the 1,000
 computed differences?  You would get a CI on the difference between the
 medians for these two groups, with which you could determine whether the
 difference was greater/less than zero.  Too crude?
 
 Regards,
-Cody

As hinted at by Brian Ripley, the following code will approximate that. 
  It gets the nonparametric confidence interval for the median and 
solves for the variance that would give the same confidence interval 
width if normality of the median held.

  g - function(y) {
 y - sort(y[!is.na(y)])
 n - length(y)
 if(n  4) return(c(median=median(y),q1=NA,q3=NA,variance=NA))
 qu - quantile(y, c(.5,.25,.75))
 names(qu) - NULL
 r - pmin(qbinom(c(.025,.975), n, .5) + 1, n)  ## Exact 0.95 C.L.
 w - y[r[2]] - y[r[1]] ## Width of C.L.
 var.med - ((w/1.96)^2)/4  ## Approximate variance of median
 c(median=qu[1], q1=qu[2], q3=qu[3], variance=var.med)
   }

Run g separately by group, add the two variances, and take the square 
root to approximate the variance of the difference in medians and get a 
confidence interval.

Frank
 
 
 
 

  Frank E Harrell   
  Jr
  [EMAIL PROTECTED]  To 
  bilt.edu Thomas Lumley   
  Sent by:  [EMAIL PROTECTED]  
  [EMAIL PROTECTED]  cc 
  at.math.ethz.ch   r-help@stat.math.ethz.ch
Subject 
Re: [R] help comparing two median   
  04/18/2007 05:02  with R  
  AM





 
 
 
 
 Thomas Lumley wrote:
 On Tue, 17 Apr 2007, Frank E Harrell Jr wrote:

 The points that Thomas and Brian have made are certainly correct, if
 one is truly interested in testing for differences in medians or
 means.  But the Wilcoxon test provides a valid test of x  y more
 generally.  The test is consonant with the Hodges-Lehmann estimator:
 the median of all possible differences between an X and a Y.

 Yes, but there is no ordering of distributions (taken one at a time)
 that agrees with the Wilcoxon two-sample test, only orderings of pairs
 of distributions.

 The Wilcoxon test provides a test of xy if it is known a priori that
 the two distributions are stochastically ordered, but not under weaker
 assumptions.  Otherwise you can get xyzx. This is in contrast to the
 t-test, which orders distributions (by their mean) whether or not they
 are stochastically ordered.

 Now, it is not unreasonable to say that the problems are unlikely to
 occur very often and aren't worth worrying too much about. It does imply
 that it cannot possibly be true that there is any summary of a single
 distribution that the Wilcoxon test tests for (and the same is true for
 other two-sample rank tests, eg the logrank test).

 I know Frank knows this, because I gave a talk on it at Vanderbilt, but
 most people don't know it. (I thought for a long time that the Wilcoxon
 rank-sum test was a test for the median pairwise mean, which is actually
 the R-estimator corresponding to the *one*-sample Wilcoxon test).


 -thomas

 
 Thanks for your note Thomas.  I do feel that the problems you have
 rightly listed occur infrequently and that often I only care about two
 groups.  Rank tests generally are good at relatives, not absolutes.  We
 have an efficient test (Wilcoxon) for relative shift but for estimating
 an absolute one-sample quantity (e.g., median) the nonparametric
 estimator is not very efficient.  Ironically there is an exact
 nonparametric confidence interval for the median (unrelated to Wilcoxon)
 but none exists for the mean.
 
 Cheers,
 Frank
 --
 Frank E Harrell Jr   Professor and Chair   School of Medicine
   Department of Biostatistics   Vanderbilt University

Re: [R] help comparing two median with R

2007-04-17 Thread Frank E Harrell Jr
Thomas Lumley wrote:
 On Tue, 17 Apr 2007, Robert McFadden wrote:
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Jim Lemon
 Sent: Tuesday, April 17, 2007 12:37 PM
 To: Pedro A Reche
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] help comparing two median with R

 Pedro A Reche wrote:
 Dear R users,
 I am new to R and  I would like to ask your help with the following
 topic. I have three sets of numeral data, 2 sets are paired and a
 third is independent of the other two. For each of these sets I have
 obtained their basic statistics (mean, median, stdv, range ...).
 Now I want to compare if these sets differ. I could compare
 the mean
 doing a basic T test . However, I was looking for a test to compare
 the medians using R.   If that is possible I would love to
 hear the
 specifics.
 Hi Pedro,
 You can use the Mann-Whitney test (wilcox with two
 samples), but you would have to check that the second and
 third moments of the variable distributions were the same, I think.

 Jim
 Use Mann-Whitney U test, but remember about 2 assumption:
 1. samples come from continuous distribution (there are no tied
 obserwations)
 2. distributions are identical in shape. It's very similar to t-test but
 Mann-Whitney U test is not as affected by violation of the homogeneity of
 variance assumption as t-test is.

 
 This turns out not to be quite correct.
 
 If the two distributions differ only by a location shift then the 
 hypothesis that the shift is zero is equivalent to the medians being the 
 same (or the means, or the 3.14159th percentile), and the Mann-Whitney U 
 test will test this hypothesis. Otherwise the Mann-Whitney U test does not 
 test for equal medians.
 
 The assumption that the distributions are continuous is for convenience -- 
 it makes the distribution of the test statistic easier to calculate and 
 otherwise R uses a approximation.  The assumption of a location shift is 
 critical -- otherwise it is easy to construct three data sets x,y,z so 
 that the Mann-Whitney U test thinks x is larger than y, y is larger than z 
 and z is larger than x (Google for Efron Dice). That is, the Mann-Whitney 
 U test cannot be a test for any location statistic.
 
 There actually is an exact test for the median that does not assume a 
 location shift:  dichotomize your data at the pooled median to get a 2x2 
 table of above/below median by group, and do Fisher's exact test on the 
 table.  This is almost never useful (because it doesn't come with an 
 interval estimate), but is interesting because it (and the generalizations 
 to other quantiles) is the only exactly distribution-free location test 
 that does not have the 'non-transitivity' problem of the Mann-Whitney U 
 test.  I believe this median test is attributed to Mood, but I have not 
 seen the primary source.
 
   -thomas

The Mood test is so inefficient that its use is no longer recommended:

@Article{fri00sho,
   author =   {Freidlin, Boris and Gastwirth, Joseph L.},
   title ={Should the median test be retired from 
general use?},
   journal =  American Statistician,
   year = 2000,
   volume =   54,
   number =   3,
   pages ={161-164},
   annote =   {inefficiency of Mood median test}
}

The points that Thomas and Brian have made are certainly correct, if one 
is truly interested in testing for differences in medians or means.  But 
the Wilcoxon test provides a valid test of x  y more generally.  The 
test is consonant with the Hodges-Lehmann estimator: the median of all 
possible differences between an X and a Y.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss (package foreign) and SPSS 15.0 files

2007-04-17 Thread Frank E Harrell Jr
John Kane wrote:
 --- Charilaos Skiadas [EMAIL PROTECTED] wrote:
 
 On Apr 16, 2007, at 10:41 AM, John Kane wrote:
 --- Charilaos Skiadas [EMAIL PROTECTED] wrote:
 It is not an export option, it is a save as
 option. I don't have
 a 14 to check, but on a 15 I just go to File -
 Save
 As, and change
 the Save as type field to Comma Delimited
 (.*.csv). (I suppose
 tab delimited would be another option). Then
 there
 are two check-
 boxes below the window that allow a bit further
 customizing, one of
 them is about using value labels where defined
 instead of data values.
 I'm now back on a machine with SPSS 14.  No csv
 option
 that I can see.  Perhaps an enhancement to v15.
 I don't have a 14, but I did check a 13 today and
 you are correct, no  
 csv option is there, which in my opinion is quite
 unacceptable for a  
 statistical package that is on its 13/14th version.
 But there was an  
 option for Excel 97 and ..., and that seemed to
 allow using value  
 labels instead of the values (again you have to
 check the  
 corresponding box). So perhaps that would be an
 option.
 
 Not my problem at the moment but a colleague pointed
 out that he had 750+ variables. Excel handles 256. 
 
 Actually my problem is now a SAS one where I can get a
 clean csv export but lose the variable labels.  My
 crude workaround was just to do a proc contents and
 cut-and-paste the results.  A  pain but it worked.  
 
 I've got to figure out why I cannot get Hmisc to work!

And note that Hmisc has other SAS import options besides sas.get that 
keep labels.  Some of them require you to run PROC CONTENTS CNTLOUT= to 
save metadata in a dataset.

Frank

 
 
 Thanks for the suggestion.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Nonparametric Effect size indices

2007-04-13 Thread Frank E Harrell Jr
Chuck Cleland wrote:
 Martin Plöderl wrote:
 Hello!

 For comparing two non-normally distributed samples, Leech (2002) suggested
 to report nonparametric effect size indices, such as Vargha  Delaney's A or
 Cliff's d. I tried to search the R-sites, but could not find related
 procedures or packages that include nonparametric effect sizes. 
 Thank you for your help!

 Citation: Leech (2002). A call for greater use of nonparametric statistics.
 Paper presented at the Annual Meeting of the Mid-South Educational Research
 Association, Chattanooga, TN, November 6-8.

Please beware.  That literature is just giving new names to much older 
concepts.  Look at Mann-Whitney, Wilcoxon, Kendall, Somers for better 
citations.  And Cohen's d in the pdf below should just be called a z-score.

Frank Harrell

 
   Based on the description of Cliff's d in
 http://www.florida-air.org/romano06.pdf, you could do something like the
 following:
 
 g1 - c(1,1,2,2,2,3,3,3,4,5)
 g2 - c(1,2,3,4,4,5)
 
 # Dominance matrix
 sign(outer(g1, g2, FUN=-))
   [,1] [,2] [,3] [,4] [,5] [,6]
  [1,]0   -1   -1   -1   -1   -1
  [2,]0   -1   -1   -1   -1   -1
  [3,]10   -1   -1   -1   -1
  [4,]10   -1   -1   -1   -1
  [5,]10   -1   -1   -1   -1
  [6,]110   -1   -1   -1
  [7,]110   -1   -1   -1
  [8,]110   -1   -1   -1
  [9,]11100   -1
 [10,]111110
 
 mean(rowMeans(sign(outer(g1, g2, FUN=-
 [1] -0.25
 
   If you can point us to a description of Vargha  Delaney's A, someone
 can likely suggest a way of obtaining that too.
 
 Regards, 

 Martin Plöderl PhD
 Suicide Prevention Research Program, Institute of Public Health
 Paracelsus Private Medical University
 Dept. of Suicide Prevention, University Clinic of Psychiatry and
 Psychotherapy
 Christian - Doppler - Klinik
 Ignaz Harrerstrasse 79
 A-5020 Salzburg
 AUSTRIA
 Phone: +43-662-4483-4345
 Fax: +43-662-4483-4344
 E-mail: [EMAIL PROTECTED]

 __
 [EMAIL PROTECTED] mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code. 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sas.get problem

2007-04-11 Thread Frank E Harrell Jr
,
 format.library = F:/sas,
 2: \C:/Program not found
 -
 I really don't see why the sagprog does not work
 unless  an early error is falling through.

 Thanks for all the help

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained,
 reproducible code.





 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reasons to Use R

2007-04-10 Thread Frank E Harrell Jr
Taylor, Z Todd wrote:
 On Monday, April 09, 2007 3:23 PM, someone named Wilfred wrote:
 
 So what's the big deal about S using files instead of memory
 like R. I don't get the point. Isn't there enough swap space
 for S? (Who cares anyway: it works, isn't it?) Or are there
 any problems with S and large datasets? I don't get it. You
 use them, Greg. So you might discuss that issue.
 
 S's one-to-one correspondence between S objects and filesystem
 objects is the single remaining reason I haven't completely
 converted over to R.  With S I can manage my objects via
 makefiles.  Corrections to raw data or changes to analysis
 scripts get applied to all objects in the project (and there
 are often thousands of them) by simply typing 'make'.  That
 includes everything right down to the graphics that will go
 in the report.
 
 How do people live without that?

Personally I'd rather have R's save( ) and load( ).

Frank

 
 --Todd


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.spss (package foreign) and SPSS 15.0 files

2007-04-06 Thread Frank E Harrell Jr
Charilaos Skiadas wrote:
 On Apr 6, 2007, at 12:32 PM, John Kane wrote:
 
 I have simply moved to exporting the SPSS file to a
 delimited file and loading it. Unfortunately I'm
 losing all the labelling which can be time-consuming
 to redo.Some of the data has something like 10
 categories for a variable.
 
 I save as csv format all the time, and it offers me a choice to use  
 the labels instead of the corresponding numbers. So you shouldn't  
 have to lose that labelling.
 
 Haris Skiadas
 Department of Mathematics and Computer Science
 Hanover College

That's a different point.  The great advantage of read.spss (and the 
spss.get function in Hmisc that uses it) is that long variable labels 
are supported in addition to variable names.  That's why I like getting 
SPSS or Stata files instead of csv files.  I'm going to enhance csv.get 
in Hmisc to allow a row number to be specified, to contain long variable 
labels.

Frank

 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Doing partial-f test for stepwise regression

2007-04-01 Thread Frank E Harrell Jr
Petr Klasterecky wrote:
 And what about to read the help page ?anova ...?
 
  
 When given a sequence of objects, 'anova' tests the models against
   one another in the order specified.
 
 
 Generally you almost never fit a full model (including all possible 
 interactions etc) - no one can interpret such complicated models. Anova 
 gives you a comparison between a broader model (the first argument to 
 anova) and its submodel(s).

True you might not fit a model with high-order interactions, but the 
full pre-specified model is the only one whose standard errors and test 
statistics work as advertised.

Frank

 
 Petr
 
 [EMAIL PROTECTED] napsal(a):
 Hello all,
 I am trying to figure out an optimal linear model by using stepwise
 regression which requires partial f-test, I did some Googling on the
 Internet and realised that someone seemed to ask the question before:

 Jim Milks [EMAIL PROTECTED] writes: 
 Dear all: 

 I have a regression model that has collinearity problems (between 
 three regressor variables). I need a F-test that will allow me to 
 compare between full (with all variables) and partial models (minus 
 1= variables). The general F-test formula I'm using is: 

 F = {[SS(full model) - SS(reduced model)] / (#variables taken out)} / 
 MSS(full model) 

 Unfortunately, the ANOVA table parses the SS and MSS between the 
 variables and does not give the statistics for the regression model as 
 a whole, otherwise I'd do this by hand. 

 So, really, I have two questions: 1) Can I just add up all the SS and 
 MSS for all the variables to get the model SS and MSS and 2) Are 
 there any functions or packages I can use to calculate the F-statistic? 
 Just use anova(model1, model2). 
 (One potential catch: Make sure that both models are fitted to the same
 data set. Missing values in predictors may interfere.) 
 However, in the answer provided by Mr. Peter Dalgaard,(use
 anova(model1,model2) I could not understand what model1 and model2 are
 supposed to referring to, which one is supposedly to be the full model and
 which one is to be the partial model? Or it does not matter?

 Thanks in advance for help from anyone!

 Regards,
 Anyi Zhu

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] what is the difference between survival analysis and (...)

2007-03-29 Thread Frank E Harrell Jr
Thomas Lumley wrote:
 On Wed, 28 Mar 2007, Frank E Harrell Jr wrote:
 
 Eric Elguero wrote:
 Hi everybody,

 recently I had to teach a course on Cox model, of which I am
 not a specialist, to an audience of medical epidemiologists.
 Not a good idea you might say.. anyway, someone in the
 audience was very hostile. At some point, he sayed that
 Cox model was useless, since all you have to do is count
 who dies and who survives, divide by the sample sizes
 and compute a relative risk, and if there was significant
 censoring, use cumulated follow-up instead of sample
 sizes and that's it!
 I began arguing that in Cox model you could introduce
 several variables, interactions, etc, then I remembered
 of logistic models ;-)
 The only (and poor) argument I could think of was that
 if mr Cox took pains to devise his model, there should
 be some reason...

 That is a very ignorant person, concerning statistical
 efficiency/power/precision and how to handle incomplete follow-up
 (variable follow-up duration).  There are papers in the literature (I
 wish I had them at my fingertips) that go into the efficiency loss of
 just counting events.  If the events are very rare, knowing the time
 doesn't help as much, but the Cox model still can handle censoring
 correctly and that person's approach doesn't.

 
 Certainly just counting the events is inefficient -- the simplest 
 example would be studies of some advanced cancers where nearly everyone 
 dies during followup, so that there is little or no censoring but simple 
 counts are completely uninformative.
 
 It's relatively hard to come up with an example where using the 
 total-time-on-test (rather than sample size) as a denominator is much 
 worse than the Cox mode, though. You need the baseline hazard to vary a 
 lot over time and the censoring patterns to be quite different in the 
 groups, but proportional hazards to still hold.
 
 I think the advantages of the Cox model over a reasonably sensible 
 person-time analysis are real, but not dramatic -- it would be hard to 
 find a data set that would convince the sort of person who would make 
 that sort of claim.
 
 I would argue that computational convenience on the one hand, and the 
 ability to exercise lots of nice mathematical tools on the other hand 
 have also contributed to the continuing popularity of the Cox model.
 
 
 -thomas
 
 Thomas LumleyAssoc. Professor, Biostatistics
 [EMAIL PROTECTED]University of Washington, Seattle
 
 
 

Nicely put Thomas.  I have seen examples from surgical research where 
the hazard function is bathtub shaped and the epidemiologist's use of 
the exponential distribution is very problematic.  I have also seen 
examples in acute illness and medical treatment where time until death 
is important even with only 30-day follow-up.

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] what is the difference between survival analysis and (...)

2007-03-28 Thread Frank E Harrell Jr
Eric Elguero wrote:
 Hi everybody,
 
 recently I had to teach a course on Cox model, of which I am
 not a specialist, to an audience of medical epidemiologists.
 Not a good idea you might say.. anyway, someone in the
 audience was very hostile. At some point, he sayed that
 Cox model was useless, since all you have to do is count
 who dies and who survives, divide by the sample sizes
 and compute a relative risk, and if there was significant
 censoring, use cumulated follow-up instead of sample
 sizes and that's it!
 I began arguing that in Cox model you could introduce
 several variables, interactions, etc, then I remembered
 of logistic models ;-)
 The only (and poor) argument I could think of was that
 if mr Cox took pains to devise his model, there should
 be some reason...

That is a very ignorant person, concerning statistical 
efficiency/power/precision and how to handle incomplete follow-up 
(variable follow-up duration).  There are papers in the literature (I 
wish I had them at my fingertips) that go into the efficiency loss of 
just counting events.  If the events are very rare, knowing the time 
doesn't help as much, but the Cox model still can handle censoring 
correctly and that person's approach doesn't.

Frank

 
 but the story doesn't end here. When I came back to my office,
 I tried these two methods on a couple of data sets, and true,
 crude RRs are very close to those coming from Cox model.
 
 hence this question: could someone provide me with a
 dataset (preferably real) where there is a striking
 difference between estimated RRs and/or between
 P-values? and of course I am interested in theoretical
 arguments and references.
 
 sorry that this question has nothing to do with R
 and thank you in advance for your leniency.
 
 Eric Elguero
 GEMI-UMR 2724 IRD-CNRS,
 Équipe Évolution des Systèmes Symbiotiques
 911 avenue Agropolis, BP 64501,
 34394 Montpellier cedex 5 FRANCE
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] weight factor in somers2 function

2007-03-28 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Hi!
 I’m trying to calculate de C index (concordance probability) through the
 somers2 function (library Hmisc). I’m interesting on including the
 sampling effort as a weight factor for the evaluation of model predictions
 with real data. I’ve some questions about that: first of all I’m not
 really sure if I can include sampling effort as a weight factor. Since the
 weight factor should be a numeric vector of observation (usually
 frequencies), I would expect that sampling effort could be a surrogate of
 the frequency count of the number of subjects (i.e. frequency of
 observation). However, when I use sampling effort as a weight factor, I
 get C index larger than one. I guess/know this is statistically wrong.
 Then, if these values were frequency of observation; what is working
 incorrectly? What should be the characteristics of the weight vector? Or
 what could be exactly included as weight factor?
 Thank you very much!

Send me the smallest artificial example you can construct and I'll work 
on it.

Frank


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to get lsmeans?

2007-03-22 Thread Frank E Harrell Jr
John Fox wrote:
 Dear Brian et al.,
 
 My apologies for chiming in late: It's been a busy day.
 
 First some general comments on least-squares means and effect displays.
... much good stuff deleted ...

John - the one thing I didn't get from your post is a motivation to do 
all that as opposed to easy-to-explain predicted values.  I would 
appreciate your thoughts.

Thanks
Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to get lsmeans?

2007-03-22 Thread Frank E Harrell Jr
John Fox wrote:
 Dear Frank,
 
 The object is to focus on a high-order term of the model, holding other
 terms constant at typical values; in the case of a factor, a typical
 value is ambiguous, so an average is taken over the levels of the factor.
 If the factor is, e.g., gender, one could produce an estimate of the average
 fitted value for a population composed equally of men and women, or of a
 population composed of men and women in proportion to their distribution in
 the data. Otherwise, one would have to produce separate sets of fitted
 values for men and women, with the number of such sets increasing as the
 number of levels of the factors held constant increase. On the scale of the
 linear predictor, these sets would differ only by constants.
 
 Regards,
  John

Makes sense.  I just set other factors to the mode, and if it is 
important to see estimates for other categories, I give estimates for 
each factor level.  If I want to uncondition on a variable (not often) I 
take the proper weighted average of predicted values.

Cheers
Frank

 
 
 John Fox
 Department of Sociology
 McMaster University
 Hamilton, Ontario
 Canada L8S 4M4
 905-525-9140x23604
 http://socserv.mcmaster.ca/jfox 
  
 
 -Original Message-
 From: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, March 22, 2007 8:41 AM
 To: John Fox
 Cc: 'Prof Brian Ripley'; 'r-help'; 'Chuck Cleland'; 'Liaw, Andy'
 Subject: Re: [R] how to get lsmeans?

 John Fox wrote:
 Dear Brian et al.,

 My apologies for chiming in late: It's been a busy day.

 First some general comments on least-squares means and 
 effect displays.
 ... much good stuff deleted ...

 John - the one thing I didn't get from your post is a 
 motivation to do all that as opposed to easy-to-explain 
 predicted values.  I would appreciate your thoughts.

 Thanks
 Frank

 -- 
 Frank E Harrell Jr   Professor and Chair   School of Medicine
   Department of Biostatistics   
 Vanderbilt University
 
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to get lsmeans?

2007-03-21 Thread Frank E Harrell Jr
 do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Stepwise Logistic Regression

2007-03-21 Thread Frank E Harrell Jr
Sergio Della Franca wrote:
 Dear R-Helpers,
 
 I'd like to perform a Logistic Regression whit a Stepwise method.
 
 
 Can you tell me how can i proceed to develop this procedure?
 
 
 Thank you in advance.
 
 
 Sergio Della Franca.

If the number of events is not incredibly large, you can get almost the 
same result with the following code :-)

candidates - setdiff(names(mydataframe), 'y')
p - length(candidates)
sample(candidates, sample(round(p/4):round(p/2), 1))

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and clinical studies

2007-03-20 Thread Frank E Harrell Jr
[EMAIL PROTECTED] wrote:
 Thank you to all those that responded to Delphine's original post on R and
 clinical studies.  They have provided much food for thought.
 
 I had a couple of follow up questions/comments.  Andrew is very correct in
 pointing out that there are classes and workshops available for R.  It's my
 understanding that there are even commercial versions of R that now provide
 formal commercial-style courses.  And at any rate, the money saved by
 potentially avoiding pricey software could certainly justify any training
 expense in time or money  - this assumes of course that the pricey software
 could be dispensed with (I suspect that would take considerable time at my
 current company as so many legacy projects have been done in proprietary
 software).  I still think that R provides less 'hand-holding' and requires
 more initiative (which may be more or less present on a per
 programmer/statistician basis).
 
 I guess one could always integrate R/Splus in with SAS, as Terry's group
 has done at Mayo - I will probably do this at least as a start.  I have a
 few concerns with regards to this approach (these may be needless concerns,
 but I will venture expressing them anyway).  First, I'm worried about the
 possibility of compatability concerns (will anyone be worried about a SAS
 dataset read into R or vice-versa?).  Second, I would prefer focusing all
 my learning on one package if possible.  I actually have more experience
 with SAS (as do others in my group), and if the switch to R is to be made I
 would like to make that switch as complete as possible.   This would also
 avoid requiring new hires to know both languages.  Third, if SAS is to be
 kept around, it defeats one of the main advantages of having open source
 code in the first place (R is wonderfully free!).  Like Mayo, Baylor Health
 (my previous employer) used both Splus and SAS.  I was warned that data
 manipulation would be much more difficult in R/Splus than it was in SAS.
 To be honest, and I say this humbly realizing that most posters to this
 list have much more experience than I, I haven't found data manipulation to
 be that much more difficult in R/Splus (at least as I have gained
 experience in R/Splus).   I can think of two exceptions (1) large datasets
 and (2) SAS seems to play nicer with MS products (e.g. PROC IMPORT seemed
 to read in messy Excel spreadsheets better than importData in Splus).  Is
 it possible (and I again say this with MUCH humility) that the perceived
 advantages of SAS with regards to data manipulation may be due in part to
 some users only using R/Splus for stat modeling and graphics (thus never
 becoming familiar with the data manipulation capabilities of R/Splus) or to
 the reluctance of SAS-trained individuals and companies to make the
 complete switch?

You are exactly correct on this point.  Some graduate programs only 
teach students how to use R/S-Plus for modeling and graphics.  R/S-Plus 
are wonderful for data manipulation - more powerful than SAS but not 
easy to learn (plus in R there are sometimes too many ways to do 
something; new users get lost - e.g. the reshape and reShape functions 
and the reshape package). 
http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf has many 
examples of complex data manipulation as do some web sites.  We do 
analysis for pharmaceutical companies with 100% of the data manipulation 
done in R after importing say 50 SAS datasets into R.  Doing tasks such 
as finding a lab value measured the closest in time to some event is 
much more elegant in R/S-Plus than in SAS.

Frank

 
 Tony, the story about the famous software and the certain operating
 system at the large company was priceless.
 
 In closing, I should mention that in all posts I am speaking for myself and
 not for Edwards LifeSciences.
 
 Regards,
 -Cody
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and clinical studies

2007-03-16 Thread Frank E Harrell Jr
Delphine Fontaine wrote:
 Thanks for your answer which was very helpfull. I have another question:
 
 I have read in this document  
 (http://cran.r-project.org/doc/manuals/R-intro.pdf) that most of the  
 programs written in R are ephemeral and that new releases are not  
 always compatible with previous releases. What I would like to know is  
 if R functions are already validated and if not, what should we do to  
 validate a R function ?
 

In the sense in which most persons use the term 'validate', it means to 
show with one or more datasets that the function is capable of producing 
the right answer.  It doesn't mean that it produces the right answer for 
every dataset although we hope it does.  [As an aside, most errors are 
in the data manipulation phase, not in the analysis phase.]  So I think 
that instead of validating functions we should spend more effort on 
validating analyses [and validating analysis file derivation].  Pivotal 
analyses can be re-done a variety of ways, in R or in separate 
programmable packages such as Stata.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ols Error : missing value where TRUE/FALSE needed

2007-03-14 Thread Frank E Harrell Jr
Charles Evans wrote:
 I have installed Hmisc and Design.  When I use ols, I get the  
 following error message:
 
 Error in if (!length(fname) || !any(fname == zname)) { :
   missing value where TRUE/FALSE needed
 
 The model that I am running is:
 
   ecools - ols(eco$exp ~ eco$age + eco$own + eco$inc + inc2, x=TRUE)

ecools - ols(exp ~ age + own + inc + inc2, data=eco, x=TRUE)

Watch out for variables named exp but probably OK.

Frank Harrell

 
 I have tried several other combinations of arguments that take TRUE/ 
 FALSE values, but no luck.
 
 Ultimately, I am trying to calculate robust standard errors.
 
 Any help would be appreciated.
 
 Charles Evans
 Executive Director
 Free Curricula Center
 
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and SAS proc format

2007-03-07 Thread Frank E Harrell Jr
Ulrike Grömping wrote:
 
 
The down side to R's factor solution: 
 
The numerical values of factors are always 1 to number of levels. Thus, it
 
can be tough and requires great care to work with studies that have both
 
numerical values different from this and value labels. This situation is
 
currently not well-supported by R.
 

 
Regards, Ulrike
 

 
P.S.: I fully agree with Frank regarding the annoyance one sometimes
 
encounters with formats in SAS! 
 
   You can add an attribute to a variable.  In the sas.get function in the
   Hmisc package for example, when importing SAS variables that have PROC
   FORMAT value labels, an attribute 'sas.codes' keeps the original codes;
   these can be retrieved using sas.codes(variable name).  This could be
   done outside the SAS import context also.
  
   Frank
   --
   Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt 
 University
 
 Frank,
 
 are these attributes preserved when merging or subsetting a data frame?
 Are they used in R packages other than Hmisc and Design (e.g. in a 
 simple table request)?

no; would need to add functions like those that are used by the Hmisc 
label or impute functions.  And they are not used outside Hmisc/Design. 
  In fact I have little need for them as I always find the final labels 
as the key to analysis.

 
 If this is the case, my wishlist items 8658 and 8659 
 (http://bugs.r-project.org/cgi-bin/R/wishlist?id=8658;user=guest, 
 http://bugs.r-project.org/cgi-bin/R/wishlist?id=8659;user=guest) can be 
 closed.
 Otherwise, I maintain the opinion that there are workarounds but that R 
 is not satisfactorily able to handle this type of data.

R gives the framework for doing this elegantly but the user has an 
overhead of implementing new methods for such attributes.

Cheers
Frank

 
 Regards, Ulrike
 
 
 *--- End of Original Message ---*


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and SAS proc format

2007-03-06 Thread Frank E Harrell Jr
lamack lamack wrote:
 Dear all, Is there an R equivalent to SAS's proc format?
 
 Best regards
 
 J. Lamack

Fortunately not.  SAS is one of the few large systems that does not 
implicitly support value labels and that separates label information 
from the database [I can't count the number of times someone has sent me 
a SAS dataset and forgotten to send the PROC FORMAT value labels].  See 
the factor function for information about how R does this.

Frank Harrell

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R and SAS proc format

2007-03-06 Thread Frank E Harrell Jr
Ulrike Grömping wrote:
 The down side to R's factor solution: 
 The numerical values of factors are always 1 to number of levels. Thus, it
 can be tough and requires great care to work with studies that have both
 numerical values different from this and value labels. This situation is
 currently not well-supported by R.

You can add an attribute to a variable.  In the sas.get function in the 
Hmisc package for example, when importing SAS variables that have PROC 
FORMAT value labels, an attribute 'sas.codes' keeps the original codes; 
these can be retrieved using sas.codes(variable name).  This could be 
done outside the SAS import context also.

Frank

 
 Regards, Ulrike
 
 P.S.: I fully agree with Frank regarding the annoyance one sometimes
 encounters with formats in SAS! 
 
 
 lamack lamack wrote:
 Dear all, Is there an R equivalent to SAS's proc format?

 Best regards

 J. Lamack

 _
 O Windows Live Spaces é seu espaço na internet com fotos (500 por mês),
 blog 
 e agora com rede social http://spaces.live.com/

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] bootstrap

2007-02-28 Thread Frank E Harrell Jr
Indermaur Lukas wrote:
 Hi,
 I would like to evaluate the frequency of the variables within the best 
 selected model by AIC among a set of 12 competing models (I fit them with 
 GLM) with a bootstrap procedure  to get unbiased results. So I would ike to 
 do the ranking of the 12-model-set 10'000 times separately and calculate the 
 frequency of variables of the 10'000 best ranked models. I wrote a script 
 doing the model ranking (with some difficulty) for the pre-defined 
 12-model-set, that works. My model results (one row per model) are stored in 
 a dataframe called mydataframe. How can I implement my script simply into a 
 bootstrap and sum up the frequencies of variables of the 10'000 bootstraped 
 results?
  
 I am not so good in R and would appreciate any hint.
  
 Regards
 Lukas

I'm not entirely clear on your goals but the bootstrap selection 
frequency is mainly a replay of the significance of variables in the 
full model.  I don't see that it provides new information.  And 
selection frequency is marred by collinearity.

Frank

  
  
 °°° 
 Lukas Indermaur, PhD student 
 eawag / Swiss Federal Institute of Aquatic Science and Technology 
 ECO - Department of Aquatic Ecology
 Überlandstrasse 133
 CH-8600 Dübendorf
 Switzerland
  
 Phone: +41 (0) 71 220 38 25
 Fax: +41 (0) 44 823 53 15 
 Email: [EMAIL PROTECTED]
 www.lukasindermaur.ch
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fitting of all possible models

2007-02-27 Thread Frank E Harrell Jr
Indermaur Lukas wrote:
 Hi,
 Fitting all possible models (GLM) with 10 predictors will result in loads of 
 (2^10 - 1) models. I want to do that in order to get the importance of 
 variables (having an unbalanced variable design) by summing the up the 
 AIC-weights of models including the same variable, for every variable 
 separately. It's time consuming and annoying to define all possible models by 
 hand. 
  
 Is there a command, or easy solution to let R define the set of all possible 
 models itself? I defined models in the following way to process them with a 
 batch job:
  
 # e.g. model 1
 preference- formula(Y~Lwd + N + Sex + YY)
 
 # e.g. model 2
 preference_heterogeneity- formula(Y~Ri + Lwd + N + Sex + YY)  
 etc.
 etc.
  
  
 I appreciate any hint
 Cheers
 Lukas

If you choose the model from amount 2^10 -1 having best AIC, that model 
will be badly biased.  Why look at so many?  Pre-specification of 
models, or fitting full models with penalization, or using data 
reduction (masked to Y) may work better.

Frank

  
  
  
  
  
 °°° 
 Lukas Indermaur, PhD student 
 eawag / Swiss Federal Institute of Aquatic Science and Technology 
 ECO - Department of Aquatic Ecology
 Überlandstrasse 133
 CH-8600 Dübendorf
 Switzerland
  
 Phone: +41 (0) 71 220 38 25
 Fax: +41 (0) 44 823 53 15 
 Email: [EMAIL PROTECTED]
 www.lukasindermaur.ch
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart minimum sample size

2007-02-27 Thread Frank E Harrell Jr
Amy Uhrin wrote:
 Is there an optimal / minimum sample size for attempting to construct a 
 classification tree using /rpart/?
 
 I have 27 seagrass disturbance sites (boat groundings) that have been 
 monitored for a number of years.  The monitoring protocol for each site 
 is identical.  From the monitoring data, I am able to determine the 
 level of recovery that each site has experienced.  Recovery is our 
 categorical dependent variable with values of none, low, medium, high 
 which are based upon percent seagrass regrowth into the injury over 
 time.  I wish to be able to predict the level of recovery of future 
 vessel grounding sites based upon a number of categorical / continuous 
 predictor variables used here including (but not limited to) such 
 parameters as:  sediment grain size, wave exposure, original size 
 (volume) of the injury, injury age, injury location.
 
 When I run /rpart/, the data is split into only two terminal nodes based 
 solely upon values of the original volume of each injury.  No other 
 predictor variables are considered, even though I have included about 
 six of them in the model.  When I remove volume from the model the same 
 thing happens but with injury area - two terminal nodes are formed based 
 upon area values and no other variables appear.  I was hoping that this 
 was a programming issue, me being a newbie and all, but I really think 
 I've got the code right.  Now I am beginning to wonder if my N is too 
 small for this method?
 

In my experience N needs to be around 20,000 to get both good accuracy 
and replicability of patterns if the number of potential predictors is 
not tiny.  In general, the R^2 from rpart is not competitive with that 
from an intelligently fitted regression model.  It's just a difficult 
problem, when relying on a single tree (hence the popularity of random 
forests, bagging, boosting).

Frank
-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fitting of all possible models

2007-02-27 Thread Frank E Harrell Jr
Indermaur Lukas wrote:
 Hi Frank
 I fitted a set of 12 candidate models and evaluated the importance of 
 variables based on model averaged coefficients and SE (model weights =0.9). 
 Variables in my models were not distributed in equal numbers across all 
 models thus I was not able to evaluate the importance of variables just by 
 summing up the AIC-weights of models including a specific variable. Now, why 
 so many models to fit: I was curious, if the ranking in the importance of 
 variables is similar, when just summing up the AIC-weights over an 
 all-possible-models set and looking at the ordered model averaged 
 coefficients (order of CV=SE/coefficient).
  
 Any hint for me?
 Cheers
 Lukas

I have seen the literature on Bayesian model averaging which uses 
weights from Bayes factors, related to BIC, but not the approach you are 
using.

Frank

 
  
 
 Indermaur Lukas wrote:
 Hi,
 Fitting all possible models (GLM) with 10 predictors will result in loads of 
 (2^10 - 1) models. I want to do that in order to get the importance of 
 variables (having an unbalanced variable design) by summing the up the 
 AIC-weights of models including the same variable, for every variable 
 separately. It's time consuming and annoying to define all possible models 
 by hand.

 Is there a command, or easy solution to let R define the set of all possible 
 models itself? I defined models in the following way to process them with a 
 batch job:

 # e.g. model 1
 preference- formula(Y~Lwd + N + Sex + YY)   
 
 # e.g. model 2
 preference_heterogeneity- formula(Y~Ri + Lwd + N + Sex + YY) 
 etc.
 etc.


 I appreciate any hint
 Cheers
 Lukas
 
 If you choose the model from amount 2^10 -1 having best AIC, that model
 will be badly biased.  Why look at so many?  Pre-specification of
 models, or fitting full models with penalization, or using data
 reduction (masked to Y) may work better.
 
 Frank
 




 °°°
 Lukas Indermaur, PhD student
 eawag / Swiss Federal Institute of Aquatic Science and Technology
 ECO - Department of Aquatic Ecology
 Überlandstrasse 133
 CH-8600 Dübendorf
 Switzerland

 Phone: +41 (0) 71 220 38 25
 Fax: +41 (0) 44 823 53 15
 Email: [EMAIL PROTECTED]
 www.lukasindermaur.ch

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] fitting of all possible models

2007-02-27 Thread Frank E Harrell Jr
Bert Gunter wrote:
 ... Below
 
 -- Bert 
 
 Bert Gunter
 Genentech Nonclinical Statistics
 South San Francisco, CA 94404
 650-467-7374
 
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr
 Sent: Tuesday, February 27, 2007 5:14 AM
 To: Indermaur Lukas
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] fitting of all possible models
 
 Indermaur Lukas wrote:
 Hi,
 Fitting all possible models (GLM) with 10 predictors will result in loads
 of (2^10 - 1) models. I want to do that in order to get the importance of
 variables (having an unbalanced variable design) by summing the up the
 AIC-weights of models including the same variable, for every variable
 separately. It's time consuming and annoying to define all possible models
 by hand. 
  
 Is there a command, or easy solution to let R define the set of all
 possible models itself? I defined models in the following way to process
 them with a batch job:
  
 # e.g. model 1
 preference- formula(Y~Lwd + N + Sex + YY)
 
 # e.g. model 2
 preference_heterogeneity- formula(Y~Ri + Lwd + N + Sex + YY)  
 etc.
 etc.
  
  
 I appreciate any hint
 Cheers
 Lukas
 
 If you choose the model from amount 2^10 -1 having best AIC, that model 
 will be badly biased.  Why look at so many?  Pre-specification of 
 models, or fitting full models with penalization, 
 
 --- ...the rub being how much to penalize. My impression from what I've read
 is, for prediction, close to the more, the better is the predictor... .
 Nature rewards parsimony.
 
 Cheers,
 Bert

Bert,

In my experience nature rewards complexity, if done right.  See Savage's 
antiparsimony principle  -Frank

@Article{gre00whe,
   author =   {Greenland, Sander},
   title ={When should epidemiologic regressions use 
random coeff
icients?},
   journal =  Biometrics,
   year = 2000,
   volume =   56,
   pages ={915-921},
   annote =   {Bayesian methods;causal inference;empirical Bayes
estimators;epidemiologic method;hierarchical regression;mixed
models;multilevel modeling;random-coefficient
regression;shrinkage;variance components;use of statistics in
epidemiology is largely primitive;stepwise variable selection on
confounders leaves important confounders uncontrolled;composition
matrix;example with far too many significant predictors with many
regression coefficients absurdly inflated when
overfit;lack of evidence for dietary effects mediated through
constituents;shrinkage instead of variable selection;larger effect on
confidence interval width than on point estimates with variable
selection;uncertainty about variance of random effects is just
uncertainty about prior opinion;estimation of variance is
pointless;instead the analysis shuld be repeated using different
values;if one feels compelled to estimate $\tau^2$, I would recommend
giving it a proper prior concentrated amount contextually reasonable
values;claim about ordinary MLE being unbiased is misleading because
it assumes the model is correct and is the only model
entertained;shrinkage towards compositional model;models need to be
complex to capture uncertainty about the relations...an honest
uncertainty assessment requires parameters for all effects that we
know may be present.  This advice is implicit in an antiparsimony
principle often attributed to L. J. Savage 'All models should be as
big as an elephant (see Draper, 1995)'.  See also gus06per.}
}

 
 
 Frank
 
  
  
  
  
  
 °°° 
 Lukas Indermaur, PhD student 
 eawag / Swiss Federal Institute of Aquatic Science and Technology 
 ECO - Department of Aquatic Ecology
 Überlandstrasse 133
 CH-8600 Dübendorf
 Switzerland
  
 Phone: +41 (0) 71 220 38 25
 Fax: +41 (0) 44 823 53 15 
 Email: [EMAIL PROTECTED]
 www.lukasindermaur.ch



-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] bootcov and cph error

2007-02-27 Thread Frank E Harrell Jr
Williams Scott wrote:
 Hi all,
 I am trying to get bootstrap resampled estimates of covariates in a Cox
 model using cph (Design library).
 
 Using the following I get the error:
 
 ddist2.abr - datadist(data2.abr)
 options(datadist='ddist2.abr') 
 cph1.abr - cph(Surv(strt3.abr,loc3.abr)~cov.a.abr+cov.b.abr,
 data=data2.abr, x=T, y=T) 
 boot.cph1 - bootcov(cph1.abr, B=100, coef.reps=TRUE, pr=T)
 1 Error in oosl(f, matxv(X, cof), Y) : not implemented for cph models
 
 Removing coef.reps argument works fine, but I really need the
 coefficients if at all possible. I cant find anything in the help files
 suggesting that I cant use coef.reps in a cph model. Any help
 appreciated.
 
 Cheers
 
 Scott

Sorry it's taken so long to get to this.  The documentation needs to be 
clarified.  Add loglik=FALSE to allow coef.reps=TRUE to work for cph models.

Frank

 
 _
 
  
 
 Dr. Scott Williams MD
 
 Peter MacCallum Cancer Centre
 
 Melbourne, Australia
 
 [EMAIL PROTECTED]


-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Double-banger function names: preferences and suggestions

2007-02-25 Thread Frank E Harrell Jr
hadley wickham wrote:
 What do you prefer/recommend for double-banger function names:
 
  1 scale.colour
  2 scale_colour
  3 scaleColour
 
 1 is more R-like, but conflicts with S3.  2 is a modern version of
 number 1, but not many packages use it.  Number 3 is more java-like.
 (I like number 2 best)
 
 Any suggestions?
 
 Thanks,
 
 Hadley

Personal preference is for 3.  I wish I had used that throughout the 
Hmisc and Design packages for non-generics.

Frank

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   3   4   5   6   7   >