Re: [R] FW: variable format
Martin Becker wrote: Dear Cory, I am not familiar with SAS, but is this what you are looking for? divisionTable - matrix(c(1, New England, 2, Middle Atlantic, 3, East North Central, 4, West North Central, 5, South Atlantic, 6, East South Central, 7, West South Central, 8, Mountain, 9, Pacific), ncol=2, byrow=T) How about just divisionTable - c('New England', 'Middle Atlantic', ...) then factor(old, 1:9, divisionTable) ? Frank a - NULL a$divisionOld - c(0,1,2,3,4,5) a$divisionNew - as.character(factor(a$divisionOld,levels=divisionTable[,1],labels=divisionTable[,2])) a$divisionNew [1] NA New EnglandMiddle Atlantic [4] East North Central West North Central South Atlantic Kind regards, Martin Cory Nissen schrieb: Anybody? From: Cory Nissen Sent: Tue 9/4/2007 9:30 AM To: r-help@stat.math.ethz.ch Subject: variable format Okay, I want to do something similar to SAS proc format. I usually do this... a - NULL a$divisionOld - c(1,2,3,4,5) divisionTable - matrix(c(1, New England, 2, Middle Atlantic, 3, East North Central, 4, West North Central, 5, South Atlantic), ncol=2, byrow=T) a$divisionNew[match(a$divisionOld, divisionTable[,1])] - divisionTable[,2] But how do I handle the case where... a$divisionOld - c(0,1,2,3,4,5) #no format available for 0, this throws an error. OR divisionTable - matrix(c(1, New England, 2, Middle Atlantic, 3, East North Central, 4, West North Central, 5, South Atlantic, 6, East South Central, 7, West South Central, 8, Mountain, 9, Pacific), ncol=2, byrow=T) There are extra formats available... this throws a warning. Thanks Cory [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Stratified weighted statistics
Vikas Rawal wrote: I need to compute weighted descriptive statistics (mean, median, variance) for a vector of data disaggregated by a set of index variables. The weights to be used are vector in the same data frame and have to be disaggregated by the same index variables. Package Hmisc has functions for computing weighted statistics. But both by/aggregate and summarise (in Hmisc) allow a function to have only a simgle argument. In my case, I need to specify by the variable and the weights as argument to the function. Could anyone please advice how to do it. Please look again at the help files for summarize and wtd.mean which show how to do this. Frank As an example, say I have a data.frame giving incomes of persons in different towns along with weights to be given for each sampled person. How will I compute weighted statistics for each town? Vikas Rawal JNU, New Delhi. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Excel
Rolf Turner wrote: On 30/08/2007, at 8:49 AM, Greg Snow wrote: Erich Neuwirth said: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Erich Neuwirth Sent: Wednesday, August 29, 2007 12:43 PM To: r-help Subject: Re: [R] Excel Excel bashing can be fun but also can be dangerous because you are makeing your life harder than necessary. My experience differs, so far using excel (other than as a table layout program) has made my life harder more times than it has made it easier. Remainder of message deleted. Bravo!!! Very well and very cogently expressed! cheers, Rolf Turner Yes! In addition I'd like to stress that the Excel model represents non-reproducible research. Frank Harrell -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] validate (package Design): error message subscript out of bounds
Wentzel-Larsen, Tore wrote: Thanks, I use Design version 2.1-1 (and as provided, R 2.5.1 on Windows XP). Sorry I missed the latter. The redundancies in object names were for my own convience, as part of a larger command file, and the validation of this univariate model was only included to ease comparison with the main multivariate model. I have A minor point: adding more variables to the right hand side makes the model a multivariable model. It is still univariate as it has only one dependent variable. tried to access Design version 2.0_12, but only managed to access version 2.1_1 of Design in my Windows implementation of R2.5.1 (the choice of operative system is made by my institution and I am only entitled to use Windows). My mistake again. I forgot that the Debian repositories for CRAN are behind. I updated to latest Design in CRAN and found the bug. I will send a separate note with a new version of the function in question for you to source( ) to override the current function. Frank Best, Tore -Opprinnelig melding- Fra: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] Sendt: 28. august 2007 03:17 Til: Wentzel-Larsen, Tore Kopi: r-help@stat.math.ethz.ch Emne: Re: [R] validate (package Design): error message subscript out of bounds Wentzel-Larsen, Tore wrote: Dear R users I use Windows XP, R2.5.1 (I have read the posting guide, I have contacted the package maintainer first, it is not homework). In a research project on renal cell carcinoma we want to compute Harrell's c index, with optimism correction, for a multivariate Cox regression and also for some univariate Cox models. For some of these univariate models I have encountered an error message (and no result produced) from the function validate i Frank Harrell's Design package: Error in Xb(x[, xcol, drop = FALSE], coef, non.slopes, non.slopes.in.x, : subscript out of bounds The following is an artificial example wherein I have been able to reproduce this error message (actual data has been changed to preserve confidentiality): I could not reproduce the error on R 2.5.1 on linux using version 2.0-12 of Design (you did not provide this information). Your code involved a good deal of extra typing. Here is a streamlined version: bc - data.frame(time1 = c(9,24,28,43,58,62,66,107,116,118,123, 127,129,131,137,138,139,140,148,169,176,179,188,196,210,218, bc library(Design) dd - with(bc, datadist(bc1, age, adjto.cat='first')) options(datadist = 'dd') f - cph(Surv(time1,status1) ~ bc1, data = bc, x=TRUE, y=TRUE, surv=TRUE) anova(f) f summary(f) val - validate(f, B=200, dxy=TRUE) I don't get much value of putting the type of an object as part of the object's name, as information within objects defines the object type/class. There is little reason to validate a one degree of freedom model. Frank library(Design) # an example data frame: frame.bc - data.frame(time1 = c(9,24,28,43,58,62,66,107,116,118,123, 127,129,131,137,138,139,140,148,169,176,179,188,196,210,218, 1,1,1,2,2,3,4,8,23,32,33,34,43,44,48,51,52,54,59,59,60,60,62, 65,65,68,70,72,73,74,81,84,88,98,99,106,107,115,115,117,119, 120,122,122,122,122,126,128,130,135,136,136,138,149,151,154, 157,159,161,164,164,164,166,172,172,176,179,180,183,183,184, 187,190,197,201,201,203,203,203,209,210,214,219,227,233,4,18, 49,113,147,1,1,2,2,2,2,2,3,4,6,6,6,6,6,6,6,6,9,9,9,9,9,10,10, 10,11,12,12,12,13,14,14,17,18,18,19,19,20,20,21,21,21,21,22,23, 23,24,28,28,29,29,32,34,35,38,38,48,48,52,52,54,54,56,64,67,67, 69,70,70,72,84,88,90,114,115,140,142,154,171,195), status1 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1), bc1 = factor(c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2), labels=c('bc.1','bc.2')), age = c(58,68,23,20,50,43,41,69,20,48,19,27,39,20,65,49,70,59,31,43,25, 61,60,45,34,59,32,58,30,62,26,44,52,29,40,57,33,18,50,50,55,51,38,34, 69,56,67,38,66,21,48,39,62,62,29,68,66,19,60,39,55,42,24,29,56,61,40, 52,19,40,33,67,66,51,48,63,60,58,68,60,53,20,45,62,37,38,61,63,43,67
Re: [R] how to include bar values in a barplot?
Donatas G. wrote: On Tuesday 07 August 2007 22:09:52 Donatas G. wrote: How do I include bar values in a barplot (or other R graphics, where this could be applicable)? To make sure I am clear I am attaching a barplot created with OpenOffice.org which has barplot values written on top of each barplot. Here is the barplot mentioned above: http://dg.lapas.info/wp-content/barplot-with-values.jpg it appeaars that this list does not allow attachments... That is a TERRIBLE graphic. Can't we finally leave this subject alone? Frank Harrell __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] grouping scat1d/rug and plotting to 2 axes
Mike wrote: Hi, I'm wondering if anybody can offer a bit of guidance on how to add a couple of features to a plot. I'm using Frank Harrell's Design library to model some survival data in R (2.3.1, windows platform). I'm fairly comfortable with the survival modeling in Design, but am still at a frustratingly low level of competence when it comes to creating anything beyond simple plots in R. A simplified version of the model is: fit - cph(Surv(survtime,deceased) ~ rcs(smw,4), data=survdata,x=T,y=T,surv=T ) And the basic plot is: plot(fit,smw=NA, fun=function(x) 1/(1+exp(-x))) or plot(fit, smw=NA, fun=plogis). But what does the logistic model have to do with the Cox model you fitted? You can instead do plot(fit, smw=NA, time=1) to plot estimated 1-year survival prob. I know that if I add scat1d(smw) I get a nice jittered rug plot of all values of the predictor smw on the top axis. What I'd like to do, however, is to plot on bottom axis the values of smw for only those participants who are alive, and then on the top axis, plot the values of smw for those who are deceased. I'd appreciate any tips as to how I might approach this. That isn't so well defined because of variable follow-up time. I would not get very much out of such a plot. Frank Thanks, Mike Babyak __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] validate (package Design): error message subscript out of bounds
1.000 Slope 1.00 0.8075246 D 0.025717 0.0147536 U -0.002447 0.0005348 Q 0.028163 0.0142188 Error in Xb(x[, xcol, drop = FALSE], coef, non.slopes, non.slopes.in.x, : subscript out of bounds Any help/suggestions will be highly appreciated. Sincerely, Tore Wentzel-Larsen statistician Centre for Clinical research Armauer Hansen house Haukeland University Hospital N-5021 Bergen tlf +47 55 97 55 39 (a) faks +47 55 97 60 88 (a) email [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Random number expert needed
Colleagues: An agency contacted me that has need of a true expert in pseudo random number generation who can perform an audit of a system currently in use. If you live within say 400 miles of Nashville TN and have in-depth expertise, please contact me and I will put you in touch with the agency. Thanks Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Ask for functions to obtain partial R-square (squared partial correlation coefficients)
Ye Xingwang wrote: The partial R-square (or coefficient of partial determination, or squared partial correlation coefficients) measures the marginal contribution of one explanatory variable when all others are already included in multiple linear regression model. The following link has very clear explanations on partial and semi-partial correlation: http://www.psy.jhu.edu/~ashelton/courses/stats315/week2.pdf In SAS, the options is PCORR2 and SCORR2. For example(from http://www.ats.ucla.edu/stat/sas/examples/alsm/alsmsasch7.htm) data ch7tab01; input X1 X2 X3 Y; label x1 = 'Triceps' x2 = 'Thigh cir.' x3 = 'Midarm cir.' y = 'body fat'; cards; 19.5 43.1 29.1 11.9 24.7 49.8 28.2 22.8 30.7 51.9 37.0 18.7 29.8 54.3 31.1 20.1 19.1 42.2 30.9 12.9 25.6 53.9 23.7 21.7 31.4 58.5 27.6 27.1 27.9 52.1 30.6 25.4 22.1 49.9 23.2 21.3 25.5 53.5 24.8 19.3 31.1 56.6 30.0 25.4 30.4 56.7 28.3 27.2 18.7 46.5 23.0 11.7 19.7 44.2 28.6 17.8 14.6 42.7 21.3 12.8 29.5 54.4 30.1 23.9 27.7 55.3 25.7 22.6 30.2 58.6 24.6 25.4 22.7 48.2 27.1 14.8 25.2 51.0 27.5 21.1 ; run; proc reg data = ch7tab01; model y = x1 x2 / pcorr2 SCORR2; model y = x1-x3 / pcorr2 SCORR2; run; quit; There has been a post in http://tolstoy.newcastle.edu.au/R/help/05/03/0437.html It will be great appreciated if someone could write a general function to work with class lm or glm to obtain the pcorr2 (squared partial correlation coefficients using Type II sums of squares) and scorr2 (squared semi-partial correlation coefficients using Type II sums of squares) for all independent variables (3 variables) simultaneously? Thank you. Xingwang Ye library(Design) # requires Hmisc f - ols(y ~ x1 + x2) p - plot(anova(f), what='partial R2') p The anova.Design function called above handles pooling related degrees of freedom and pools main effects with related interaction effects to get the total partial effect. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Regulatory Compliance and Validation Issues
Cody Hamilton wrote: Dear Thomas, Thank you for your reply. You are of course quite right - the R Foundation couldn't be responsible for any individually contributed package. I am curious as to how an orgainization operating in a regulated environment could safely use a contributed package. What if the author/maintainer retires or loses interest in maintaining the package? The organization would then find itself in the awkward position of being reliant on software for which there is no technical support and which may not be compatible with future versions of the base R software. I suppose the organization could take responsibility for maintaining the individual functions within a package on its own (one option made possible by the open source nature of R), but this would require outstanding programming resources which the company may not have (Thomas Lumleys are sadly rare). In addition, as the organization is claiming the functions as their own (and not as out-of-the-box software), the level of required validation would be truly extraordinary. I also wonder if an everyone-maintain-their-own-copy approach could lead to multiple mutated vers i! ons of a package's functions across the R universe (e.g. Edwards' version of sas.get() vs. Company X's version of sas.get(), etc.). Regards, -Cody Cody, I think of this issue as not unlike an organization using its own code written by its own analysts or SAS programmers. Code is reused all the time. Frank As always, I am speaking for myself and not necessarily for Edwards Lifesciences. -Original Message- From: Thomas Lumley [mailto:[EMAIL PROTECTED] Sent: Sunday, August 19, 2007 8:50 AM To: Cody Hamilton Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Regulatory Compliance and Validation Issues On Fri, 17 Aug 2007, Cody Hamilton wrote: snip I have a few specific comments/questions that I would like to present to the R help list. snip 2. While the document's scope is limited to base R plus recommended packages, I believe most companies will need access to functionalities provided by packages not included in the base or recommended packages. (For example, I don't think I could survive without the sas.get() function from the Design library.) How can a company address the issues covered in the document for packages outside its scope? For example, what if a package's author does not maintain historical archive versions of the package? What if the author no longer maintains the package? Is the solution to add more packages to the recommended list (I'm fairly certain that this would not be a simple process) or is there another solution? This will have to be taken up with the package maintainer. The R Foundation doesn't have any definitive knowledge about, eg, Frank Harrell's development practices and I don't think the FDA would regard our opinions as relevant. Archiving, at least, is addressed by CRAN: all the previously released versions of packages are available 3. At least at my company, each new version must undergo basically the same IQ/OQ/PQ as the first installation. As new versions of R seem to come at least once a year, the ongoing validation effort would be painful if the most up-to-date version of R is to be maintained within the company. Is there any danger it delaying the updates (say updating R within the company every two years or so)? It's worse than that: there are typically 4 releases of R per year (the document you are commenting on actually gives dates). The ongoing validation effort may indeed be painful, and this was mentioned as an issue in the talk by David James Tony Rossini. The question of what is missed by delaying updates can be answered by looking at the NEWS file. The question of whether it is dangerous is really an internal risk management issue for you. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subject: Re: how to include bar values in a barplot?
[EMAIL PROTECTED] wrote: Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. ---snip--- Ted, You make many excellent points and provide much food for thought. I still think that Greg's points are valid too, and in this particular case, bar plots are a bad choice and adding numbers at variable heights causes a perception error as I wrote previously. Thanks for your elaboration on this important subject. Frank On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. . . . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subject: Re: how to include bar values in a barplot?
Gabor Grothendieck wrote: You could put the numbers inside the bars in which case it would not add to the height of the bar: I think the Cleveland/Tufte prescription would be much different: horizontal dot charts with the numbers in the right margin. I do this frequently with great effect. The Hmisc dotchart2 function makes this easy. Frank x - 1:5 names(x) - letters[1:5] bp - barplot(x) text(bp, x - .02 * diff(par(usr)[3:4]), x) On 8/9/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. ---snip--- Ted, You make many excellent points and provide much food for thought. I still think that Greg's points are valid too, and in this particular case, bar plots are a bad choice and adding numbers at variable heights causes a perception error as I wrote previously. Thanks for your elaboration on this important subject. Frank On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. . . . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to include bar values in a barplot?
Donatas G. wrote: How do I include bar values in a barplot (or other R graphics, where this could be applicable)? To make sure I am clear I am attaching a barplot created with OpenOffice.org which has barplot values written on top of each barplot. This causes an optical illusion in which the viewer adds some of the heights of the numbers to the heights of the bars. And in general dot charts are greatly preferred to bar plots, not the least because of their reduced ink:information ratio. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] (Censboot, Z-score, Cox) How to use Z-score as the statistic within censboot?
David Lloyd wrote: Dear R Help list, My question is regarding extracting the standard error or Z-score from a cph or coxph call. My Cox model is: - modz=cph(Surv(TSURV,STATUS)~RAGE+DAGE+REG_WTIME_M+CLD_ISCH+POLY_VS, data=kidneyT,method=breslow, x=T, y=T) You are assuming linearity for all predictors, which is unlikely. When you expand predictors into multiple parameters to allow for nonlinearity, or when you have factors with 2 levels, individual P-values will be meaningless. You can get all Wald multiple d.f. test stats and P-values by saving the result of anova(modz) using anova.Design, which creates a matrix. Frank I've used names(modz) but can't see anything that will let me extract the Z scores for each coefficient or the standard errors in the same way one can use coef (modz). I need to get either the se or Zscore as I am hoping to be able to use the Zscore within a statistic function used in censboot. I wish to find the proportion of times each of the variables in the cox model are significant out of the R=1000 bootstrap resamples. My current function for us with the bootstrap only gives me the coefficients: - func=function(kidneyT,indices) { kidneyT=kidneyT[indices,] modz=coxph(Surv(TSURV,STATUS)~RAGE+DAGE+REG_WTIME_M+CLD_ISCH+POLY_VS,dat a=kidneyT,method=breslow) coefficients(modz) } I would like to be able to modify the following code so modz.cens.boot$t below gives me an array of the Z-scores for each variable and each resample - not the coefficients modz.cens.boot=censboot(kidneyT, func,R=1000,sim=ordinary) modz.cens.boot I hope this is clear enough? I'm still getting to terms with the Bootstrap, having only just started trying to use it a few days ago. Any help on this would clear a massive headache for me as I've been going around in circles for hours and hours on this. Thx DaveL Click to find information on your credit score and your credit report. http://tagline.bidsystem.com/fc/Ioyw36XIe1IYw28KqOSu66l6ED7AsGYWlz4EWyK N5OTcqKjQM8gfiW/ span id=m2wTlpfont face=Arial, Helvetica, sans-serif size=2 style=font-size:13.5px___BRGet the Free email that has everyone talking at a href=http://www.mail2world.com target=newhttp://www.mail2world.com/abr font color=#99Unlimited Email Storage #150; POP3 #150; Calendar #150; SMS #150; Translator #150; Much More!/font/font/span [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROC curve in R
Dylan Beaudette wrote: On Thursday 26 July 2007 10:45, Frank E Harrell Jr wrote: Dylan Beaudette wrote: On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote: Note that even though the ROC curve as a whole is an interesting 'statistic' (its area is a linear translation of the Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation statistics), each individual point on it is an improper scoring rule, i.e., a rule that is optimized by fitting an inappropriate model. Using curves to select cutoffs is a low-precision and arbitrary operation, and the cutoffs do not replicate from study to study. Probably the worst problem with drawing an ROC curve is that it tempts analysts to try to find cutoffs where none really exist, and it makes analysts ignore the whole field of decision theory. Frank Harrell Frank, This thread has caught may attention for a couple reasons, possibly related to my novice-level experience. 1. in a logistic regression study, where i am predicting the probability of the response being 1 (for example) - there exists a continuum of probability values - and a finite number of {1,0} realities when i either look within the original data set, or with a new 'verification' data set. I understand that drawing a line through the probabilities returned from the logistic regression is a loss of information, but there are times when a 'hard' decision requiring prediction of {1,0} is required. I have found that the ROCR package (not necessarily the ROC Curve) can be useful in identifying the probability cutoff where accuracy is maximized. Is this an unreasonable way of using logistic regression as a predictor? Thanks for the detailed response Frank. My follow-up questions are below: Logistic regression (with suitable attention to not assuming linearity and to avoiding overfitting) is a great way to estimate P[Y=1]. Given good predicted P[Y=1] and utilities (losses, costs) for incorrect positive and negative decisions, an optimal decision is one that optimizes expected utility. The ROC curve does not play a direct role in this regard. Ok. If per-subject utilities are not available, the analyst may make various assumptions about utilities (including the unreasonable but often used assumption that utilities do not vary over subjects) to find a cutoff on P[Y=1]. Can you elaborate on what exactly a per-subject utility is? In my case, I am trying to predict the occurance of specific soil features based on two predictor variables: 1 continuous, the other categorical. Thus far my evaluation of how well this method works is based on how often I can correctly predict (a categorical) quality. This could be called a per-unit utility in your case. It is the consequence of decisions at the point in which you decide Y=0 or Y=1. If consequences are the same over all units, you just have to deal with the single ratio of cost of false positive to cost of false negative. One way to limit bad consequences is to not make any decision when the predicted probability is in the middle, i.e., the decision is 'obtain more data'. That is a real advantage of having a continuous risk estimate. A very nice feature of P[Y=1] is that error probabilities are self-contained. For example if P[Y=1] = .02 for a single subject and you predict Y=0, the probability of an error is .02 by definition. One doesn't need to compute an overall error probability over the whole distribution of subjects' risks. If the cost of a false negative is C, the expected cost is .02*C in this example. Interesting. The hang-up that I am having is that I need to predict from {O,1}, as the direct users of this information are not currently interested in in raw probabilities. As far as I know, in order to predict a class from a probability I need use a cutoff... How else can I accomplish this without imposing a cutoff on the entire dataset? One thought, identify a cutoff for each level of the categorical predictor term in the model... (?) You're right you have to ultimately use a cutoff (or better still, educate the users about the meaning of probabilities and let them make the decision without exposing the cutoff). And see the comment regarding gray zones above. 2. The ROC curve can be a helpful way of communicating false positives / false negatives to other users who are less familiar with the output and interpretation of logistic regression. What is more useful than that is a rigorous calibration curve estimate to demonstrate the faithfulness of predicted P[Y=1] and a histogram showing the distribution of predicted P[Y=1] Ok. I can make that histogram - how would one go about making the 'rigorous calibration curve' ? Note that I have a training set, from which the model is built, and a smaller testing set for evaluation. See the val.prob function in the Design package. This assumes your test samples and training samples
Re: [R] proportional odds model
Ramon Martínez Coscollà wrote: Hi all!! I am using a proportinal odds model to study some ordered categorical data. I am trying to predict one ordered categorical variable taking into account only another categorical variable. I am using polr from the R MASS library. It seems to work ok, but I'm still getting familiar and I don't know how to assess goodness of fit. I have this output, when using response ~ independent variable: Residual Deviance: 327.0956 AIC: 333.0956 polr.out$df.residual [1] 278 polr.out$edf [1] 3 When taking out every variable... (i.e., making formula: response ~ 1), I have: Residual Deviance: 368.2387 AIC: 372.2387 How can I test if the model fits well? How can I check that the independent variable effectively explains the model? Is there any test? Moreover, sendig summary(polr.out) I get this error: Error in optim(start, fmin, gmin, method = BFGS, hessian = Hess, ...) : initial value in 'vmmin' is not finite Something to do with the optimitation procedure... but, how can I fix it? Any help would be greatly appreciated. Thanks. You might also look at lrm and residuals.lrm in the Design package, which provides partial and score residual plots to check PO model assumptions. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression
Mary, The 10-group approach results in a low-resolution and fairly arbitrary calibration curve. Also, it is the basis of the original Hosmer-Lemeshow goodness of fit statistic which has been superceded by the Hosmer et al single degree of freedom GOF test that does not require any binning. The Design package handles both. Do ?calibrate.lrm, ?residuals.lrm, ?lrm for details. Frank Harrell Sullivan, Mary M wrote: Greetings, I am working on a logistic regression model in R and I am struggling with the code, as it is a relatively new program for me. In searching Google for 'logistic regression diagnostics' I came Elizabeth Brown's Lecture 14 from her Winter 2004 Biostatistics 515 course (http://courses.washington.edu/b515/l14.pdf) . I found most of the code to be very helpful, but I am struggling with the lines on to calculate the observed and expected values in the 10 groups created by the cut function. I get error messages in trying to create the E and O matrices: R won't accept assignment of fi1c==j and it won't calculate the sum. I am wondering whether someone might be able to offer me some assistance...my search of the archives was not fruitful. Here is the code that I adapted from the lecture notes: fit - fitted(glm.lyme) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)),1)) t-table(fitc) fitc - cut(fit, br = c(0, quantile(fit, p = seq(.1, .9, .1)), 1), labels = F) t-table(fitc) #Calculate observed and expected values in ea group E - matrix(0, nrow=10, ncol = 2) O - matrix(0, nrow=10, ncol=2) for (j in 1:10) { E[j, 2] = sum(fit[fitc==j]) E[j, 1] = sum((1- fit)[fitc==j]) O[j, 2] = sum(pcdata$lymdis[fitc==j]) O[j, 1] = sum((1-pcdata$lymdis)[fitc==j]) } Here is the error message: Error in Summary.factor(..., na.rm = na.rm) : sum not meaningful for factors I understand what it means; I just can't figure out how to get around it or how to get the output printed in table form. Thank you in advance for any assistance. Mary Sullivan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROC curve in R
Note that even though the ROC curve as a whole is an interesting 'statistic' (its area is a linear translation of the Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation statistics), each individual point on it is an improper scoring rule, i.e., a rule that is optimized by fitting an inappropriate model. Using curves to select cutoffs is a low-precision and arbitrary operation, and the cutoffs do not replicate from study to study. Probably the worst problem with drawing an ROC curve is that it tempts analysts to try to find cutoffs where none really exist, and it makes analysts ignore the whole field of decision theory. Frank Harrell [EMAIL PROTECTED] wrote: http://search.r-project.org/cgi-bin/namazu.cgi?query=ROCmax=20result=normalsort=scoreidxname=Rhelp02aidxname=functionsidxname=docs there is a lot of help try help.search(ROC curve) gave Help files with alias or concept or title matching 'ROC curve' using fuzzy matching: granulo(ade4) Granulometric Curves plot.roc(analogue)Plot ROC curves and associated diagnostics roc(analogue) ROC curve analysis colAUC(caTools) Column-wise Area Under ROC Curve (AUC) DProc(DPpackage) Semiparametric Bayesian ROC curve analysis cv.enet(elasticnet) Computes K-fold cross-validated error curve for elastic net ROC(Epi) Function to compute and draw ROC-curves. lroc(epicalc) ROC curve cv.lars(lars) Computes K-fold cross-validated error curve for lars roc.demo(TeachingDemos) Demonstrate ROC curves by interactively building one HTH see the help and examples those will suffice Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'. Regards, Gaurav Yadav +++ Assistant Manager, CCIL, Mumbai (India) Mob: +919821286118 Email: [EMAIL PROTECTED] Bhagavad Gita: Man is made by his Belief, as He believes, so He is Rithesh M. Mohan [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 07/26/2007 11:26 AM To R-help@stat.math.ethz.ch cc Subject [R] ROC curve in R Hi, I need to build ROC curve in R, can you please provide data steps / code or guide me through it. Thanks and Regards Rithesh M Mohan [[alternative HTML version deleted]] - Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROC curve in R
Dylan Beaudette wrote: On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote: Note that even though the ROC curve as a whole is an interesting 'statistic' (its area is a linear translation of the Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation statistics), each individual point on it is an improper scoring rule, i.e., a rule that is optimized by fitting an inappropriate model. Using curves to select cutoffs is a low-precision and arbitrary operation, and the cutoffs do not replicate from study to study. Probably the worst problem with drawing an ROC curve is that it tempts analysts to try to find cutoffs where none really exist, and it makes analysts ignore the whole field of decision theory. Frank Harrell Frank, This thread has caught may attention for a couple reasons, possibly related to my novice-level experience. 1. in a logistic regression study, where i am predicting the probability of the response being 1 (for example) - there exists a continuum of probability values - and a finite number of {1,0} realities when i either look within the original data set, or with a new 'verification' data set. I understand that drawing a line through the probabilities returned from the logistic regression is a loss of information, but there are times when a 'hard' decision requiring prediction of {1,0} is required. I have found that the ROCR package (not necessarily the ROC Curve) can be useful in identifying the probability cutoff where accuracy is maximized. Is this an unreasonable way of using logistic regression as a predictor? Logistic regression (with suitable attention to not assuming linearity and to avoiding overfitting) is a great way to estimate P[Y=1]. Given good predicted P[Y=1] and utilities (losses, costs) for incorrect positive and negative decisions, an optimal decision is one that optimizes expected utility. The ROC curve does not play a direct role in this regard. If per-subject utilities are not available, the analyst may make various assumptions about utilities (including the unreasonable but often used assumption that utilities do not vary over subjects) to find a cutoff on P[Y=1]. A very nice feature of P[Y=1] is that error probabilities are self-contained. For example if P[Y=1] = .02 for a single subject and you predict Y=0, the probability of an error is .02 by definition. One doesn't need to compute an overall error probability over the whole distribution of subjects' risks. If the cost of a false negative is C, the expected cost is .02*C in this example. 2. The ROC curve can be a helpful way of communicating false positives / false negatives to other users who are less familiar with the output and interpretation of logistic regression. What is more useful than that is a rigorous calibration curve estimate to demonstrate the faithfulness of predicted P[Y=1] and a histogram showing the distribution of predicted P[Y=1]. Models that put a lot of predictions near 0 or 1 are the most discriminating. Calibration curves and risk distributions are easier to explain than ROC curves. Too often a statistician will solve for a cutoff on P[Y=1], imposing her own utility function without querying any subjects. 3. I have been using the area under the ROC Curve, kendall's tau, and cohen's kappa to evaluate the accuracy of a logistic regression based prediction, the last two statistics based on a some probability cutoff identified before hand. ROC area (equiv. to Wilcoxon-Mann-Whitney and Somers' Dxy rank correlation between pred. P[Y=1] and Y) is a measure of pure discrimination, not a measure of accuracy per se. Rank correlation (concordance) measures do not require the use of cutoffs. How does the topic of decision theory relate to some of the circumstances described above? Is there a better way to do some of these things? See above re: expected loses/utilities. Good questions. Frank Cheers, Dylan [EMAIL PROTECTED] wrote: http://search.r-project.org/cgi-bin/namazu.cgi?query=ROCmax=20result=no rmalsort=scoreidxname=Rhelp02aidxname=functionsidxname=docs there is a lot of help try help.search(ROC curve) gave Help files with alias or concept or title matching 'ROC curve' using fuzzy matching: granulo(ade4) Granulometric Curves plot.roc(analogue)Plot ROC curves and associated diagnostics roc(analogue) ROC curve analysis colAUC(caTools) Column-wise Area Under ROC Curve (AUC) DProc(DPpackage) Semiparametric Bayesian ROC curve analysis cv.enet(elasticnet) Computes K-fold cross-validated error curve for elastic net ROC(Epi) Function to compute and draw ROC-curves. lroc(epicalc) ROC curve cv.lars(lars
Re: [R] Constructing bar charts with standard error bars
John Zabroski wrote: I am new to R. I want to graph group data using a Traditional Bar Chart with Standard Error Bar, like the kind shown here: http://samiam.colorado.edu/~mcclella/ftep/twoGroups/twoGroupGraphs.html There are severe problems with dynamite plots such as these. See http://biostat.mc.vanderbilt.edu/DynamitePlots for a list of problems and solutions. Frank Is there a simple way to do this? So far, I have only figured out how to plot the bars using barplot. testdata - scan(, list(group=0,xbar=0,se=0)) 400 0.36038 0.02154 200 0.35927 0.02167 100 0.35925 0.02341 50 0.35712 0.01968 25 0.35396 0.01931 barplot(testdata$xbar, names.arg=as.character(testdata$group), main=a=4.0, xlab=Group, ylab=xbar) xvalues - c(0.7, 1.9, 3.1, 4.3, 5.5) arrows(xvalues, testdata$xbar, xvalues, testdata$xbar+testdata$se, length= 0.4, angle=90, code=3) The best clue I have so far is Rtips #5.9: http://pj.freefaculty.org/R/Rtips.html#5.9 which is what I based my present solution off of. However, I do not understand how this works. It seems like there is no concrete way to determine the arrow drawing parameters x0 and x1 for a barplot. Moreover, the bars seem to be cut off. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subscript out of bounds when using datadist() from Design library
Cody Hamilton wrote: I am running R version 2.4.1 on Windows XP. I have a question regarding the datadist() function from the Design library. I have a data.frame (call it my.data) with 4 columns. When I submit the code datadist(data=my.data) You can just use dd - datadist(my.data); options(datadist='dd') If that doesn't fix it, generate a test dataset that fails or do save(..., compress=TRUE) and send me your dataset so I can debug. Frank I get the following error message: Error in X[[1]] : subscript out of bounds I suspect there may be something wrong with my data.frame (I'm certain there is nothing wrong with datadist()), but I don't know what. Has anyone experienced the mistake I seem to be making? Regards, Cody Hamilton Edwards Lifesciences __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Off-topic: simulate time dependent covariates
Jin Lo wrote: Dear R friends, this is an off-topic question. Could you please point me in ways (e.g., references or even R code) for simulating time varying covariates in a survival analysis setting. Thanks in advance for any responses. yours sincerely, Jin I'm not sure if this article addresses time-dependent covariates but it may be worth a look: @Article{ben05gen, author = {Bender, Ralf and Augustin, Thomas and Blettner, Maria}, title = {Generating survival times to simulate {Cox} proportional hazards models}, journal = Stat in Med, year =2005, volume = 24, pages = {1713-1723}, annote = {simulation setup;simulating survival times;censoring;Cox model;errata 25:1978-1979} } The spower function in the Hmisc package can simulate data for a non-constant-hazard ratio treatment effect. The user specifies the hazard ratio as a function of time. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] exclude1 in summary.formula from Hmisc
david dav wrote: Dear Users, Still having troubles with sharing my results, I'm trying to display a contingency table using summary.formula. Computing works well but I'd like to display information on redundant entries say, males AND females. Please tell us what output you got. And it is sometimes helpful to get a trivial example of the failure with simulated data. Also see a related parameter long. Frank I wonder if this code is correct : desqualjum - summary.formula(varqual1$X ~., data = subset( varqual1, select = -X), method = reverse, overall = T, test = T, exclude1 = FALSE) where varqual1 is the data frame containing the variable sex. I'm using R 2.5.1 and Hmisc 3.4-2 on a Mac (OSX.4, intel) but I had the same trouble with former versions of R and the package. Thank you for your help. David [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] exclude1 in summary.formula from Hmisc
david dav wrote: Here is a peace of the data and code : sex -c(2,2,1,1,1,1,1,1,1,2,1,1,1,1,1,2,2,1,1,2,2) AGN - c(C,C,C,C,C,A,A,C,B,B,C,C,C,C,C,C,B,B,C,C,C) X - c(2,2,2,2,1,2,2,2,2,2,1,1,1,1,1,1,1,1,1,2,2) varqual - data.frame(sex,AGN,X) desqual - summary.formula(varqual$X ~., data = subset( varqual, select = -X), method = reverse, overall = T, test = T, long = T, exclude1 = F) desqual doesn't show the results for sex ==1 as it is redundant. I also tried long =T wich didn't change anything here. Oh yes. exclude1 is not an argument to summary.formula but is an argument to the print, plot, and latex methods. So do print(desqual, exclude1=FALSE, long=TRUE). Thanks for the reproducible example. Note that you say summary( ) and don't need to type out summary.formula Bonus question if I may : This function is a little bit complex for me and I couldn't figure out how to get either Yates' continuity correction or Fisher Test instead of Chisquare. Does it ask a lot of program coding ? Neither of those is recommended so they are not automatically supported. But users can add their own test functions - see the help file for the catTest argument to summary.formula. Fisher's test is conservative. The Yates' continuity correction tries to mimic the conservatism of Fisher's test. I don't like to get larger P-values when I can avoid it. And the recommendations about worrying about the chi-square approximation when some cell sizes are small are a bit overdone. Better might be to use the likelihood ratio tests, and many other tests are available. Frank Regards. David -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Package for .632 (and .632+) bootstrap and the cross-validation of ROC Parameters
spime wrote: Suppose I have Training data: my.train Testing data: my.test The bootstrap does not need split samples. I want to calculate bootstrap error rate for logistic model. My wrapper function for prediction pred.glm - function(object, newdata) { ret - as.factor(ifelse(predict.glm(object, newdata, type='response') 0.4, 0, 1)) return(ret) } But i thing i cant understand if i want to calculate misclassification error for my testing data what will be in my data in the following formula. Misclassification error has many problems because it is not a proper scoring rule, i.e., it is optimized by bogus models. Frank errorest(RES ~., data=???, model=glm, estimator=boot, predict=pred.glm, est.para=control.errorest(nboot = 10)) Using my.test got following error, Error in predict(mymodel, newdata = outbootdata) : unused argument(s) (newdata = list(RES = c(1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0), CAT01 = c(4, 4, 2, 4, 4, 4, 4, 4, 4, 2, 1, 2, 2, 4, 4, 4, 1, 1, 2, 2, 1, 4, 1, 4, 1, 4, 2, 4, 1, 4, 2, 3, 1, 1, 3, 3, 4, 2, 4, 2, 1, 2, 2, 1, 1, please reply... Frank E Harrell Jr wrote: spime wrote: Hi users, I need to calculate .632 (and .632+) bootstrap and the cross-validation of area under curve (AUC) to compare my models. Is there any package for the same. I know about 'ipred' and using it i can calculate misclassification errors. Please help. It's urgent. See the validate* functions in the Design package. Note that some simulations (see http://biostat.mc.vanderbilt.edu/rms) indicate that the advantages of .632 and .632+ over the ordinary bootstrap are highly dependent on the choice of the accuracy measure being validated. The bootstrap variants seem to have advantages mainly if an improper, inefficient, discontinuous scoring rule such as the percent classified correct is used. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Package for .632 (and .632+) bootstrap and the cross-validation of ROC Parameters
spime wrote: Hi users, I need to calculate .632 (and .632+) bootstrap and the cross-validation of area under curve (AUC) to compare my models. Is there any package for the same. I know about 'ipred' and using it i can calculate misclassification errors. Please help. It's urgent. See the validate* functions in the Design package. Note that some simulations (see http://biostat.mc.vanderbilt.edu/rms) indicate that the advantages of .632 and .632+ over the ordinary bootstrap are highly dependent on the choice of the accuracy measure being validated. The bootstrap variants seem to have advantages mainly if an improper, inefficient, discontinuous scoring rule such as the percent classified correct is used. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] variable and value labels
Steve Powell wrote: Dear list I am new here - enjoying the power of R compared to SPSS. Looking for sets of tips and tricks for people with old SPSS habits. In particular, would like to know an easy way to set variable labels across a dataset and to set value labels for sets of variables. Grateful for any help, Steve Powell In the Hmisc package see the functions spss.get, label, upData, and describe. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incidence estimated from Kaplan-Meier
Nguyen Dinh Nguyen wrote: Dear all, I have a stat question that may not be related to R, but I would like to have your advice. I have just read a medical paper in which the authors report the 1-p (where p is the cumulative survival probability from the Kaplan Meier curve) as incidence of disease. Specifically, the study followed ~12000 women on drug A and ~2 women on drug B for 12 months. During that period 29 women on drug A and 80 on drug B had the disease. The incidence of disease for A and B was 0.24% and 0.30% respectively. However, instead of reporting these numbers, they report the 1-p figure which was 0.3% for A and 0.6% for B. So, the incidence from 1-p was substantially higher than the actual incidence. My question is: is it appropriate to use 1-p estimated from Kaplan-Meier as the incidence of disease? If not, why not? Regards, Nguyen Yes it's appropriate, and it makes you state the cumulative incidence by time t rather than leaving time unspecified. In your example it is likely that all women weren't followed completely, so simple incidences are not appropriate to compute because the denominator is not constant. Frank Nguyen Dinh Nguyen, Bone and Mineral Research Program Garvan Institute of Medical Research St Vincent's Hospital 384 Victoria Street, Darlinghurst Sydney, NSW 2010 Australia Tel; 61-2-9295 8274 Fax: 61-2-9295 8241 E-mail: [EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Formula syntax question
Isaac Kohane wrote: Forgive me if this is obvious: I have a frame of data with the variables in each column (e.g. Discrete_Variable1, ContinuousVariable_1, ContinuousVariable_2, ... ContinuousVariable_n) and I want to create a model using lrm i.e. model - lrm(Discrete_Variable1 ~ ContinuousVariable_1, data=lotsofdata) Is there a syntax for having all the continuous variables referenced in the formula without having to enumerate them all? I've seen the ~ . notation but when I try model - lrm(Discrete_Variable1 ~ ., data=lotsofdata) I get this error: Error in terms.formula(formula, specials = strat) : '.' in formula and no 'data' argument Any help is appreciated. -Zak It may be best to write a function to determine what is continuous (= 10 unique values for example, and numeric) and to run sapply on that function, over your data frame. Then you could use lrm(y ~ ., data=mydata[continuous]) if it were not for a problem with lrm which Charles Thomas Dupont (the Design package maintainer) and I will work on. Until then you can write a command to compose a formula, e.g., form - as.formula(paste('y', paste(names(mydata)[continuous], collapse='+'), sep='~')) lrm(form, data=mydata) -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Harrell's C
Madero Jarabo, Rosario wrote: I need to calculate Harrell's C for some survival analyses using Design package with R version 2.4.1. ¿How can I try or do it? Rosario Madero Sección de Bioestadística Hospital Universitario La Paz Pºde la Castellana, 261 28046 Madrid, España Tfno: 917277112 [EMAIL PROTECTED] It's in the documention with the Hmisc package. Type ?rcorr.cens ?rcorrp.cens Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression and dummy variable coding
Bingshan Li wrote: Hi All, Now it works. Thanks for all your answers and the explanations are very clear. Bingshan But note that you are not using R correctly unless you are doing a simulation and have some special speed issues. Let the model functions do all this for you. Frank On Jun 28, 2007, at 7:44 PM, Seyed Reza Jafarzadeh wrote: NewVar - relevel( factor(OldVar), ref = b) should create a dummy variable, and change the reference category for the model. Reza On 6/28/07, Bingshan Li [EMAIL PROTECTED] wrote: Hello everyone, I have a variable with several categories and I want to convert this into dummy variables and do logistic regression on it. I used model.matrix to create dummy variables but it always picked the smallest one as the reference. For example, model.matrix(~.,data=as.data.frame(letters[1:5])) will code 'a' as '0 0 0 0'. But I want to code another category as reference, say 'b'. How to do it in R using model.matrix? Is there other way to do it if model.matrix has no such functionality? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Warning message in Design ...
Petar Milin wrote: Hello! There were some questions regarding warning messages in Design library, but no answer how to fix them. What about: use of storage.mode(x) - single is deprecated: use mode- instead What does it mean? How to use properly simple model like: m1 = ols(Y ~ A + B + C, data = someData) After specifying the model above, warning message to use mode(x) instead of storage.mode(x) appears. Best, PM It's just a warning to be ignored. It will go away in a future release of Design. Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Marginal Effects of continuous variable in lrm model (Design package)
Minyu Chen wrote: Dear all: When I am trying to get the marginal effects: summary(result7,adv_inc_ratio=mean(m9201 $adv_inc_ratio),adv_price_ratio=mean(m9201$adv_price_ratio), ...(SOME MORE CONTINUOUS AND DISCRETE VARIABLES BUT I AM NOT LISTING)... regW=c (0,mean(m9201$regW),1), regWM=c(0,mean(m9201$regWM),1)) It gave out an error message: Error in summary.Design(result7, adv_inc_ratio = mean(m9201 $adv_inc_ratio), : ranges not defined here or with datadist for adv_inc_ratio adv_price_ratio agem1 change2 yr1Libor yr1Libor1mtLag yr1Libor1yrLag change21yrLag change21mtLag fwdReal4yr fwdInfl4yr But I remember from my previous operation a few months ago (I recorded the commands) that to summary the marginal effect, I don't have to specify the ranges for the discrete variables. However, I use the command again (with slight modification because of the newly added variables) it doesn't work. I don't know what went wrong. Thank you very much. Thanks, Minyu Chen Please read the package's help files especially about the datadist function. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help on the use of ldBand
Tomas Goicoa wrote: Hi R Users, I am trying to use the ldBand package. Together with the package, I have downloaded the ld98 program (version for windows) as indicated in the help page on ldBand. I did it, but obtained an error message Error in (head + 1):length(w) : Argument NA/NaN when I copied the help examples, so it seems that a conection between R and ld98 is not well performed in my computer. Did I put ld98.exe in the wrong place? If so, where should I put it? Thanks a lot in advance, Berta Ibañez Beroiz Do you mean the Hmisc package? Do you mean the ldBands function? Did you put ld98.exe in a place that is in your system path as specified in the ldBands help file? And please read the posting guide referenced below. Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression
spime wrote: I haven't find any helpful thread. How can i calculate BIC and Hosmer-Lemeshow statistic for a logistic regression model. I have used glm for logistic fit. See the Design package's lrm function and residuals.lrm for a better GOF test. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] BIC and Hosmer-Lemeshow statistic for logistic regression
spime wrote: Is there any windows version of Design package??? Soon the new version will will make its way to Windows, probably in a day or two. Frank Frank E Harrell Jr wrote: spime wrote: I haven't find any helpful thread. How can i calculate BIC and Hosmer-Lemeshow statistic for a logistic regression model. I have used glm for logistic fit. See the Design package's lrm function and residuals.lrm for a better GOF test. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] psm/survreg coefficient values ?
sj wrote: I am using psm to model some parametric survival data, the data is for length of stay in an emergency department. There are several ways a patient's stay in the emergency department can end (discharge, admit, etc..) so I am looking at modeling the effects of several covariates on the various outcomes. Initially I am trying to fit a survival model for each type of outcome using the psm function in the design package, i.e., all patients who's visits come to an end due to any event other than the event of interest are considered to be censored. Being new to the psm and survreg packages (and to parametric survival modeling) I am not entirely sure how to interpret the coefficient values that psm returns. I have included the following code to illustrate code similar to what I am using on my data. I suppose that the coefficients are somehow rescaled , but I am not sure how to return them to the original scale and make sense out of the coefficients, e.g., estimate the the effect of higher acuity on time to event in minutes. Any explanation or direction on how to interpret the coefficient values would be greatly appreciated. this is from the documentation for survreg.object. coefficientsthe coefficients of the linear.predictors, which multiply the columns of the model matrix. It does not include the estimate of error (sigma). The names of the coefficients are the names of the single-degree-of-freedom effects (the columns of the model matrix). If the model is over-determined there will be missing values in the coefficients corresponding to non-estimable coefficients. code: LOS - sort(rweibull(1000,1.4,108)) AGE - sort(rnorm(1000,41,12)) ACUITY - sort(rep(1:5,200)) EVENT - sample(x=c(0,1),replace=TRUE,1000) psm(Surv(LOS,EVENT)~AGE+as.factor(ACUITY),dist='weibull') output: psm(formula = Surv(LOS, CENS) ~ AGE + as.factor(ACUITY), dist = weibull) Obs Events Model L.R. d.f. P R2 10005132387.62 5 0 0.91 Value Std. Error z p (Intercept) 1.10550.04425 24.98 8.92e-138 AGE 0.07720.00152 50.93 0.00e+00 ACUITY=2 0.09440.01357 6.96 3.39e-12 ACUITY=3 0.17520.02111 8.30 1.03e-16 ACUITY=4 0.13910.02722 5.11 3.18e-07 ACUITY=5-0.05440.03789 -1.43 1.51e-01 Log(scale)-2.72870.03780 -72.18 0.00e+00 Scale= 0.0653 best, Spencer I have a case study using psm (survreg wrapper) in my book. Briefly, coefficients are on the log median survival time scale. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] error bars on survival curve
Murray Pung wrote: I am using plot(survfit(Surv(time,status) ~...) and would like to add error bars rather than the confidence intervals. Am I able to do this at specified times? e.g. when time = 20 40. leukemia.surv - survfit(Surv(time, status) ~ x, data = aml) plot(leukemia.surv, lty = 2:3,xlim = c(0,50)) #can i add error bars at times 20 40? legend(100, .9, c(Maintenance, No Maintenance), lty = 2:3) Thanks Murray Error bars when done correctly should equal confidence intervals in the minds of many. When the Design package is available again for the latest version of R, you can have more control using the survplot function, with bars and bands options. Do ?survplot.survfit Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] selecting cut-off in Logistic regression using ROCR package
Tirthadeep wrote: Hi, I am using logistic regression to classify a binary psychometric data. using glm() and then predict.glm() i got the predicted odds ratio of the testing data. Next i am going to plot ROC curve for the analysis of my study. Now what i will do: 1. first select a cut-off (say 0.4) and classify the output of predict.glm() into {0,1} segment and then use it to draw ROC curve using ROCR package OR 2. just use the predicted odds ratio in ROCR package to get error rate and use the minimum error rate (as new cut-off) to draw new ROC curve. waiting for reply. with regards and thanks. Tirtha. It's not clear why any cutoff or ROC curve is needed. Please give us more information about why a continuous variable should be dichotomized, and read @Article{roy06dic, author = {Royston, Patrick and Altman, Douglas G. and Sauerbrei, Willi}, title = {Dichotomizing continuous predictors in multiple regression: a bad idea}, journal = Stat in Med, year =2006, volume = 25, pages = {127-141}, annote = {continuous covariates;dichotomization;categorization;regression;efficiency;clinical research;residual confounding;destruction of statistical inference when cutpoints are chosen using the response variable;varying effect estimates from change in cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization;PBC data} } Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] complex contrasts and logistic regression
Nicholas Lewin-Koh wrote: Hi, I am doing a retrospective analysis on a cohort from a designed trial, and I am fitting the model fit-glmD(survived ~ Covariate*Therapy + confounder,myDat,X=TRUE, Y=TRUE, family=binomial()) For logistic regression you can also use Design's lrm function which gives you more options. My covariate has three levels (A,B and C) and therapy has two (treated and control), confounder is a continuous variable. Also patients were randomized to treatment in the trial, but Covariate is something that is measured posthoc and can vary in the population. If by posthoc you mean that the covariate is measured after baseline, it is difficult to get an interpretable analysis. I am trying to wrap my head around how to calculate a few quantities from the model and get reasonable confidence intervals for them, namely I would like to test H0: gamma=0, where gamma is the regression coefficient of the odds ratios of surviving under treatment vs control at each level of Covariate (adjusted for the confounder) You mean regression coefficient on the log odds ratio scale. This is easy to do with the contrast( ) function in Design. Do ?contrast.Design for details and examples. and I would like to get the odds of surviving at each level of Covariate under treatment and control for each level of covariate adjusted for the confounder. I have looked at contrast in the Design library but I don't think it gives me the right quantity, for instance contrast(fit,list(covariate=A, Therapy=Treated, confounder=median(myDat$confounder), X=TRUE) ( A is the baseline level of Covariate) gives me beta0 + beta_Treated + beta_confounder*68 Is this correctly interpreted as the conditional odds of dying? As to the 1st contrast I am not sure how to get it, would it be using type = 'average' with some weights in contrast? The answers are probably staring me in the face, i am just not seeing them today. contrast( ) is for contrasts (differences). Sounds like you want predicted values. Do ?predict ?predict.lrm ?predict.Design. Also do ?gendata which will generate a data frame for getting predictors, with unspecified predictors set to reference values such as medians. Frank Nicholas -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Design library installation problem
Uwe Ligges wrote: Ian Watson wrote: Dear Listers I have tried to install Frank Harrell's two libaries: Hmisc and Design. I found that Hmisc was listed in the list of packages from the Install Packages command on the Packages menu, but Design was not. I installed Hmisc from this list, and when I issued the library(Hmisc) command, it loaded into memory correctly. I then copied the Design 1.1-1.zip file from the http://lib.stat.cmu.edu/S/Harrell/library/r/ site and used the Install Packages from Local Zip file command. I received no error messages and a visual inspection of the R\library directory shows Design has been installed. However, when I issued the library(Design) command I get the following error message: Error in library(Design) : 'Design' is not a valid package -- installed 2.0.0? I also notice, from a visual inspection of the R\library\Design\R directory that there is only one file: design. In other directories, eg. R\library\Hmisc\R there are usually 3 files: Hmisc Hmisc.rdx Hmisc.rdb I am new to R, and a bit lost. I have read the R-admin.pdf documentation on packages but am still unsure how to proceed from here. I would appreciate any advice, and any answers to the following questions: 1) is there a reason why Design is not listed in the Install Packages list as Hmisc is? Yes. The current version does not pass the checks under Windows. Please convince the maintainer to fix the package, and a binary will be made available shortly. Please note that this would be a lot easier if we did not get a segfault when running R CMD CHECK --- a segfault that does not happen with the example code is run outside of CMD CHECK. Charles Thomas Dupont is working hard on this. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R vs. Splus in Pharma/Devices Industry
[EMAIL PROTECTED] wrote: Following up to some extent on Friday's discussion regarding the 'validation' of R, could I ask the list group's opinion on possible advantages of R over Splus from a pharma/devices perspective? I wish to exclude the obvious price difference, which doesn’t seem to carry as much weight as I would have thought. Besides, I have noticed many former Splus users gravitating towards R, and I suspect that the reasons are not purely economic. I can think of a few advantages of Splus: 1. SeqTrial (of course that means more $) 2. Tech support 3. The warm fuzzies that management seems to get from proprietary software I can also think of a few advantages of R: 1. Based on my personal experiences, simulations requiring a lot of looping seem to run faster. 2. R interfaces with BUGS, for example through BRUGS. 3. The wonderful help list! As always, I am speaking for myself and not necessarily for Edwards Lifesciences. Regards, -Cody Cody Hamilton, PhD Edwards Lifesciences [[alternative HTML version deleted]] A big one for us is plotmath (for clinical trial reports we put a lot of greek letters and subscripts on plots), and later we will consider migrating a lot of our stuff to the ggplot package which I don't think is available in S-Plus. Lexical scoping is another advantage of R as is the ability to reference files on the internet. Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R is not a validated software package..
Giovanni Parrinello wrote: Dear All, discussing with a statistician of a pharmaceutical company I received this answer about the statistical package that I have planned to use: As R is not a validated software package, we would like to ask if it would rather be possible for you to use SAS, SPSS or another approved statistical software system. Could someone suggest me a 'polite' answer? TIA Giovanni Search the archives and you'll find a LOT of responses. Briefly, in my view there are no requirements, just some pharma companies that think there are. FDA is required to accepted all submissions, and they get some where only Excel was used, or Minitab, and lots more. There is a session on this at the upcoming R International Users Meeting in Iowa in August. The session will include dicussions of federal regulation compliance for R, for those users who feel that such compliance is actually needed. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R is not a validated software package..
Sicotte, Hugues Ph.D. wrote: People, don't get angry at the pharma statistician, he is just trying to abide by an FDA requirement that is designed to insure that test perform reliably the same. There is no point in getting into which product is better. As far as the FDA rules are concerned a validated system beats a better system any day of the week. There is no such requirement. Here is your polite answer. You can develop and try your software in R. Should they need to use those results in a report that will matter to the FDA, then you can work together with him to set up a validated environment for S-plus. You then have to commit to port your code to S-plus. That doesn't follow. What matters is good statistical analysis practice no matter which environment you use. Note that more errors are made in the data preparation / derived variables stage than are made by statistical software. Frank As I assume that you do not work in a regulated environment, you probably wouldn't have access to a validated SAS environment anyways. It is not usually enough to install a piece of software, you have to validate every step of the installation. Since AFAIK the FDA uses S-plus, it would be to your pharma person's advantage to speed-up submissions if they also had a validated S-plus environment. http://www.msmiami.com/custom/downloads/S-PLUSValidationdatasheet_Final. pdf -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Wensui Liu Sent: Friday, June 08, 2007 9:24 AM To: Giovanni Parrinello Cc: r-help@stat.math.ethz.ch Subject: Re: [R] R is not a validated software package.. I like to know the answer as well. To be honest, I really have hard time to understand the mentality of clinical trial guys and rather believe it is something related to job security. On 6/8/07, Giovanni Parrinello [EMAIL PROTECTED] wrote: Dear All, discussing with a statistician of a pharmaceutical company I received this answer about the statistical package that I have planned to use: As R is not a validated software package, we would like to ask if it would rather be possible for you to use SAS, SPSS or another approved statistical software system. Could someone suggest me a 'polite' answer? TIA Giovanni -- dr. Giovanni Parrinello External Lecturer Medical Statistics Unit Department of Biomedical Sciences Viale Europa, 11 - 25123 Brescia Italy Tel: +390303717528 Fax: +390303717488 email: [EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R is not a validated software package..
Bert Gunter wrote: Frank et. al: I believe this is a bit too facile. 21 CFR Part 11 does necessitate a software validation **process** -- but this process does not require any For database software and for medical devices - particular software. Rather, it requires that those using whatever software demonstrate to the FDA's satisfaction that the software does what it's supposed to do appropriately. This includes a lot more than assuring, say, the numerical accuracy of computations; I think it also requires demonstration that the data are secure, that it is properly transferred from one source to another, etc. I assume that the statistical validation of R would be relatively simple, as R already has an extensive test suite, and it would simply be a matter of providing that test suite info. A bit more might be required, but I don't think it's such a big deal. I think Wensui Liu's characterization of clinical statisticians as having a mentality related to job security is a canard. Although I work in nonclinical, my observation is that clinical statistics is complex and difficult, not only because of many challenging statistical issues, but also because of the labyrinthian complexities of the regulated and extremely costly environment in which they work. It is certainly a job that I could not do. That said, probably the greatest obstacle to change from SAS is neither obstinacy nor ignorance, but rather inertia: pharmaceutical companies have over the decades made a huge investment in SAS infrastructure to support the collection, organization, analysis, and submission of data for clinical trials. To convert this to anything else would be a herculean task involving huge expense, risk, and resources. R, S-Plus (and much else -- e.g. numerous unvalidated data mining software packages) are routinely used by clinical statisticians to better understand their data and for exploratory analyses that are used to supplement official analyses (e.g. for trying to justify collection of tissue samples or a pivotal study in a patient subpopulation). But it is difficult for me to see how one could make a business case to change clinical trial analysis software infrastructure from SAS to S-Plus, SPSS, or anything else. What I would love to have is some efficiency estimates for SAS macro programming as done in pharma vs. using a high-level language. My bias is that SAS macro programming, which costs companies more than SAS licenses, is incredibly inefficient. Frank **DISCLAINMER** My opinions only. They do not in any way represent the view of my company or its employees. Bert Gunter Genentech Nonclinical Statistics South San Francisco, CA 94404 650-467-7374 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr Sent: Friday, June 08, 2007 7:45 AM To: Giovanni Parrinello Cc: r-help@stat.math.ethz.ch Subject: Re: [R] R is not a validated software package.. Giovanni Parrinello wrote: Dear All, discussing with a statistician of a pharmaceutical company I received this answer about the statistical package that I have planned to use: As R is not a validated software package, we would like to ask if it would rather be possible for you to use SAS, SPSS or another approved statistical software system. Could someone suggest me a 'polite' answer? TIA Giovanni Search the archives and you'll find a LOT of responses. Briefly, in my view there are no requirements, just some pharma companies that think there are. FDA is required to accepted all submissions, and they get some where only Excel was used, or Minitab, and lots more. There is a session on this at the upcoming R International Users Meeting in Iowa in August. The session will include dicussions of federal regulation compliance for R, for those users who feel that such compliance is actually needed. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
Dale Steele wrote: For windows users, EpiData Entry http://www.epidata.dk/ is an excellent (free) tool for data entry and documentation.--Dale Note that EpiData seems to work well under linux using wine. Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Harrell's C
Wachtel, Mitchell wrote: R will not let me load Design or Hmisc anymore. I need to calculate Harrell's C for some survival analyses. Can anyone provide an R formula for Harrell's C? With kindest regards. Mitchell Wachtel, MD You provided very little information. We have had no reports of Hmisc not loading under Windows but Design is not currently available for Windows for R 2.5. We are working to fix that. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem with Weighted Variance in Hmisc
jiho wrote: On 2007-June-01 , at 01:03 , Tom La Bone wrote: The function wtd.var(x,w) in Hmisc calculates the weighted variance of x where w are the weights. It appears to me that wtd.var(x,w) = var (x) if all of the weights are equal, but this does not appear to be the case. Can someone point out to me where I am going wrong here? Thanks. The true formula of weighted variance is this one: http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/ weighvar.pdf But for computation purposes, wtd.var uses another definition which considers the weights as repeats instead of true weights. However if the weights are normalized (sum to one) to two formulas are equal. If you consider weights as real weights instead of repeats, I would recommend to use this option. With normwt=T, your issue is solved: a=1:10 b=a b[]=2 b [1] 2 2 2 2 2 2 2 2 2 2 wtd.var(a,b) [1] 8.68421 # all weights equal 2 = there are two repeats of each element of a var(c(a,a)) [1] 8.68421 wtd.var(a,b,normwt=T) [1] 9.17 var(a) [1] 9.17 Cheers, JiHO The issue is what is being assumed for N in the denominator of the variance formula, since the unbiased estimator subtracts one. Using normwt=TRUE means you are in effect assuming N is the number of elements in the data vector, ignoring the weights. Frank Harrell --- http://jo.irisson.free.fr/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R's Spearman
Mendiburu, Felipe (CIP) wrote: Dear Ray, The R's Spearman calculated by R is correct for ties or nonties, which is not correct is the probability for the case of ties. I send to you formulates it for the correlation with ties, that is equal to R. Regards, Felipe de Mendiburu Statistician Just use midranks for ties (as with rank()) and compute the ordinary correlation on those (see also the spearman2 and rcorr functions in Hmisc package). No need to use complex formulas. And the t approximation for p-values works pretty well. Frank Harrell # Spearman correlation rs with ties or no ties rs-function(x,y) { d-rank(x)-rank(y) tx-as.numeric(table(x)) ty-as.numeric(table(y)) Lx-sum((tx^3-tx)/12) Ly-sum((ty^3-ty)/12) N-length(x) SX2- (N^3-N)/12 - Lx SY2- (N^3-N)/12 - Ly rs- (SX2+SY2-sum(d^2))/(2*sqrt(SX2*SY2)) return(rs) } # Aplicacion cor(y[,1],y[,2],method=spearman) [1] 0.2319084 rs(y[,1],y[,2]) [1] 0.2319084 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Raymond Wan Sent: Monday, May 28, 2007 10:29 PM To: r-help@stat.math.ethz.ch Subject: [R] R's Spearman Hi all, I am trying to figure out the formula used by R's Spearman rho (using cor(method=spearman)) because I can't seem to get the same value as by calculating by hand. Perhaps I'm using cor wrong, but I don't know where. Basically, I am running these commands: y=read.table(file=tmp,header=TRUE,sep=\t) y IQ Hours 1 106 7 2 86 0 3 9720 4 11312 5 12012 6 11017 cor(y[1],y[2],method=spearman) Hours IQ 0.2319084 [it's an abbreviated example of one I took from Wikipedia]. I calculated by hand (apologies if the table looks strange when pasted into e-mail): IQHoursrank(IQ) rank(hours)diffdiff^2 110673 2 11 2 8601 1 00 3 9720 2 6-416 411312 5 3.5 1.52.25 512012 6 3.5 2.56.25 611017 4 5-11 26.5 rho=0.242857 where rho = (1 - ((6 * 26.5) / 6 * (6^2 - 1))). I kept modifying the table and realized that the difference in result comes from ties. i.e., if I remove the tie in rows 4 and 5, I get the same result from both cor and calculating by hand. Perhaps I'm handling ties wrong...does anyone know how R does it or perhaps I need to change how I'm using it? Thank you! Ray __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] AIC for lrm(Hmisc/Design) model.
Milton Cezar Ribeiro wrote: Dear all, I am adjusting a Logistic Regression Model using lmr() function of Hmisc/Design package. Now I would like to compute AIC for this model. How can I do that? Kind regards, miltinho Brazil I like to change AIC to have it on the chi-square scale. For that you can do aic - function(fit) round(unname(fit$stats['Model L.R.'] - 2*fit$stats['d.f.']),2) f - lrm( ) aic(f) If unname doesn't exist in S-Plus as it does in R, you can remove that part. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] normality tests
[EMAIL PROTECTED] wrote: Hi all, apologies for seeking advice on a general stats question. I ve run normality tests using 8 different methods: - Lilliefors - Shapiro-Wilk - Robust Jarque Bera - Jarque Bera - Anderson-Darling - Pearson chi-square - Cramer-von Mises - Shapiro-Francia All show that the null hypothesis that the data come from a normal distro cannot be rejected. Great. However, I don't think it looks nice to report the values of 8 different tests on a report. One note is that my sample size is really tiny (less than 20 independent cases). Without wanting to start a flame war, are there any advices of which one/ones would be more appropriate and should be reported (along with a Q-Q plot). Thank you. Regards, Wow - I have so many concerns with that approach that it's hard to know where to begin. But first of all, why care about normality? Why not use distribution-free methods? You should examine the power of the tests for n=20. You'll probably find it's not good enough to reach a reliable conclusion. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] normality tests [Broadcast]
Lucke, Joseph F wrote: Most standard tests, such as t-tests and ANOVA, are fairly resistant to non-normalilty for significance testing. It's the sample means that have to be normal, not the data. The CLT kicks in fairly quickly. Testing for normality prior to choosing a test statistic is generally not a good idea. I beg to differ Joseph. I have had many datasets in which the CLT was of no use whatsoever, i.e., where bootstrap confidence limits were asymmetric because the data were so skewed, and where symmetric normality-based confidence intervals had bad coverage in both tails (though correct on the average). I see this the opposite way: nonparametric tests works fine if normality holds. Note that the CLT helps with type I error but not so much with type II error. Frank -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy Sent: Friday, May 25, 2007 12:04 PM To: [EMAIL PROTECTED]; Frank E Harrell Jr Cc: r-help Subject: Re: [R] normality tests [Broadcast] From: [EMAIL PROTECTED] On 25/05/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Hi all, apologies for seeking advice on a general stats question. I ve run normality tests using 8 different methods: - Lilliefors - Shapiro-Wilk - Robust Jarque Bera - Jarque Bera - Anderson-Darling - Pearson chi-square - Cramer-von Mises - Shapiro-Francia All show that the null hypothesis that the data come from a normal distro cannot be rejected. Great. However, I don't think it looks nice to report the values of 8 different tests on a report. One note is that my sample size is really tiny (less than 20 independent cases). Without wanting to start a flame war, are there any advices of which one/ones would be more appropriate and should be reported (along with a Q-Q plot). Thank you. Regards, Wow - I have so many concerns with that approach that it's hard to know where to begin. But first of all, why care about normality? Why not use distribution-free methods? You should examine the power of the tests for n=20. You'll probably find it's not good enough to reach a reliable conclusion. And wouldn't it be even worse if I used non-parametric tests? I believe what Frank meant was that it's probably better to use a distribution-free procedure to do the real test of interest (if there is one) instead of testing for normality, and then use a test that assumes normality. I guess the question is, what exactly do you want to do with the outcome of the normality tests? If those are going to be used as basis for deciding which test(s) to do next, then I concur with Frank's reservation. Generally speaking, I do not find goodness-of-fit for distributions very useful, mostly for the reason that failure to reject the null is no evidence in favor of the null. It's difficult for me to imagine why there's insufficient evidence to show that the data did not come from a normal distribution would be interesting. Andy Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University -- yianni __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] normality tests [Broadcast]
[EMAIL PROTECTED] wrote: Thank you all for your replies they have been more useful... well in my case I have chosen to do some parametric tests (more precisely correlation and linear regressions among some variables)... so it would be nice if I had an extra bit of support on my decisions... If I understood well from all your replies... I shouldn't pay s much attntion on the normality tests, so it wouldn't matter which one/ones I use to report... but rather focus on issues such as the power of the test... If doing regression I assume your normality tests were on residuals rather than raw data. Frank Thanks again. On 25/05/07, Lucke, Joseph F [EMAIL PROTECTED] wrote: Most standard tests, such as t-tests and ANOVA, are fairly resistant to non-normalilty for significance testing. It's the sample means that have to be normal, not the data. The CLT kicks in fairly quickly. Testing for normality prior to choosing a test statistic is generally not a good idea. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy Sent: Friday, May 25, 2007 12:04 PM To: [EMAIL PROTECTED]; Frank E Harrell Jr Cc: r-help Subject: Re: [R] normality tests [Broadcast] From: [EMAIL PROTECTED] On 25/05/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Hi all, apologies for seeking advice on a general stats question. I ve run normality tests using 8 different methods: - Lilliefors - Shapiro-Wilk - Robust Jarque Bera - Jarque Bera - Anderson-Darling - Pearson chi-square - Cramer-von Mises - Shapiro-Francia All show that the null hypothesis that the data come from a normal distro cannot be rejected. Great. However, I don't think it looks nice to report the values of 8 different tests on a report. One note is that my sample size is really tiny (less than 20 independent cases). Without wanting to start a flame war, are there any advices of which one/ones would be more appropriate and should be reported (along with a Q-Q plot). Thank you. Regards, Wow - I have so many concerns with that approach that it's hard to know where to begin. But first of all, why care about normality? Why not use distribution-free methods? You should examine the power of the tests for n=20. You'll probably find it's not good enough to reach a reliable conclusion. And wouldn't it be even worse if I used non-parametric tests? I believe what Frank meant was that it's probably better to use a distribution-free procedure to do the real test of interest (if there is one) instead of testing for normality, and then use a test that assumes normality. I guess the question is, what exactly do you want to do with the outcome of the normality tests? If those are going to be used as basis for deciding which test(s) to do next, then I concur with Frank's reservation. Generally speaking, I do not find goodness-of-fit for distributions very useful, mostly for the reason that failure to reject the null is no evidence in favor of the null. It's difficult for me to imagine why there's insufficient evidence to show that the data did not come from a normal distribution would be interesting. Andy Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University -- yianni __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] normality tests [Broadcast]
[EMAIL PROTECTED] wrote: Following up on Frank's thought, why is it that parametric tests are so much more popular than their non-parametric counterparts? As non-parametric tests require fewer assumptions, why aren't they the default? The relative efficiency of the Wilcoxon test as compared to the t-test is 0.955, and yet I still see t-tests in the medical literature all the time. Granted, the Wilcoxon still requires the assumption of symmetry (I'm curious as to why the Wilcoxon is often used when asymmetry is suspected, since the Wilcoxon assumes symmetry), but that's less stringent than requiring normally distributed data. In a similar vein, one usually sees the mean and standard deviation reported as summary statistics for a continuous variable - these are not very informative unless you assume the variable is normally distributed. However, clinicians often insist that I included these figures in reports. Cody Hamilton, PhD Edwards Lifesciences Well said Cody, just want to add that Wilcoxon does not assume symmetry if you are interested in testing for stochastic ordering and not just for a mean. Frank Frank E Harrell Jr [EMAIL PROTECTED] To bilt.edu Lucke, Joseph F Sent by: [EMAIL PROTECTED] [EMAIL PROTECTED] cc at.math.ethz.ch r-help r-help@stat.math.ethz.ch Subject Re: [R] normality tests 05/25/2007 02:42 [Broadcast] PM Lucke, Joseph F wrote: Most standard tests, such as t-tests and ANOVA, are fairly resistant to non-normalilty for significance testing. It's the sample means that have to be normal, not the data. The CLT kicks in fairly quickly. Testing for normality prior to choosing a test statistic is generally not a good idea. I beg to differ Joseph. I have had many datasets in which the CLT was of no use whatsoever, i.e., where bootstrap confidence limits were asymmetric because the data were so skewed, and where symmetric normality-based confidence intervals had bad coverage in both tails (though correct on the average). I see this the opposite way: nonparametric tests works fine if normality holds. Note that the CLT helps with type I error but not so much with type II error. Frank -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Liaw, Andy Sent: Friday, May 25, 2007 12:04 PM To: [EMAIL PROTECTED]; Frank E Harrell Jr Cc: r-help Subject: Re: [R] normality tests [Broadcast] From: [EMAIL PROTECTED] On 25/05/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Hi all, apologies for seeking advice on a general stats question. I ve run normality tests using 8 different methods: - Lilliefors - Shapiro-Wilk - Robust Jarque Bera - Jarque Bera - Anderson-Darling - Pearson chi-square - Cramer-von Mises - Shapiro-Francia All show that the null hypothesis that the data come from a normal distro cannot be rejected. Great. However, I don't think it looks nice to report the values of 8 different tests on a report. One note is that my sample size is really tiny (less than 20 independent cases). Without wanting to start a flame war, are there any advices of which one/ones would be more appropriate and should be reported (along with a Q-Q plot). Thank you. Regards, Wow - I have so many concerns with that approach that it's hard to know where to begin. But first of all, why care about normality? Why not use distribution-free methods? You should examine the power of the tests for n=20. You'll probably find it's not good enough to reach a reliable conclusion. And wouldn't it be even worse if I used non-parametric tests? I believe what Frank meant was that it's probably better to use a distribution-free procedure to do the real test of interest (if there is one) instead of testing for normality, and then use a test that assumes normality. I guess the question is, what exactly do you want
Re: [R] clustered standarderrors using design package
Anders Eklund wrote: Please help, I have a strange problem. I've got a balanced panel data set. I use dummy variable regression and I've got results with lm function. summary(lm(y ~ post + t19961 + t19962 + t19963 + t19964 + t19971 + t19972 + t19973 + t19974 + t19981+factor( id))) The problem is that I would like to get my standard errors clustered but then gets the following error message: f-(lm(y ~ post + t19961 + t19962 + t19963 + t19964 + t19971 + t19972 + t19973 + t19974 + t19981+factor( id))) Omit id from the model (usually) and do library(Design) f - ols(y ~ post + . . . , x=TRUE, y=TRUE) g.clust1 - robcov(f, id) Frank library(Design) g.clust1 - robcov(f, id) Error in match.arg(type) : 'arg' should be one of working, response, deviance, pearson, partial All my variables is vectors and I've tried with other variables inside and outside the model and all results in the same errormessage. Best regards Anders Eklund Stockholm, Sweden. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to analyse simple study: Placebo-controlled (2 groups) repeated measurements (ANOVA, ANCOA???)
Karl Knoblick wrote: Hallo! I have two groups (placebo/verum), every subject is measured at 5 times, the first time t0 is the baseline measurement, t1 to t4 are the measurements after applying the medication (placebo or verum). The question is, if there is a significant difference in the two groups and how large the differnce is (95% confidence intervals). Let me give sample data # Data ID-factor(rep(1:50,each=5)) # 50 subjects GROUP-factor(c(rep(Verum, 115), rep(Placebo, 135))) TIME-factor(rep(paste(t,0:4,sep=), 50)) set.seed(1234) Y-rnorm(250) # to have an effect: Y[GROUP==Verum TIME==t1]-Y[GROUP==Verum TIME==t1] + 0.6 Y[GROUP==Verum TIME==t2]-Y[GROUP==Verum TIME==t2] + 0.3 Y[GROUP==Verum TIME==t3]-Y[GROUP==Verum TIME==t3] + 0.9 Y[GROUP==Verum TIME==t4]-Y[GROUP==Verum TIME==t4] + 0.9 DF-data.frame(Y, ID, GROUP, TIME) I have heard of different ways to analyse the data 1) Comparing the endpoint t4 between the groups (t-test), ignoring baseline Don't even consider this 2) Comparing the difference t4 minus t0 between the two groups (t-test) This is not optimal 3) Comparing the endpoint t4 with t0 as a covariate between the groups (ANOVA - how can this model be calculated in R?) Using t0 as a covariate is the way to go. A question is whether to just use t4. Generally this is not optimum. 4) Taking a summary score (im not sure but this may be a suggestion of Altman) istead of t4 5) ANOVA (repeated measurements) times t0 to t5, group placebo/verum), subject as random factor - interested in interaction times*groups (How to do this in R?) 6) as 5) but times t1 to t5, ignoring baseline (How to do this in R?) 7) as 6) but additional covariate baseline t0 (How to do this in R?) What will be best? - (Advantages / disadvantages?) How to analyse these models in R with nested and random effects and possible covariate(ID, group - at least I think so) and random parameter ID)? Or is there a more simple possibility? It's not obvious that random effects are needed if you take the correlation into account in a good way. Generalized least squares using for example an AR1 correlation structure (and there are many others) is something I often prefer. A detailed case study with R code (similar to your situation) is in http://biostat.mc.vanderbilt.edu/FrankHarrellGLS . This includes details about why t0 is best to consider as a covariate. One reason is that the t0 effect may not be linear. If you want to focus on t4 it is easy to specify a contrast (after fitting is completed) that tests t4. If time is continuous this contrast would involve predicted values at the 4th time, otherwise testing single parameters. Frank Harrell Perhaps somebody can recommend a book or weblink where these different strategies of analysing are discussed - preferable with examples with raw data which I can recalculate. And if there is the R syntax includede - this would be best! Any help will be appreciate! Thanks! Karl -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Re : Bootstrap sampling for repeated measures
justin bem wrote: Hi, If it was me I would have done this - First reshape the data frame to get some thing like header measure1 measure3 measure3 12800012.471.482.23 ... Since you have same number of measure for all subject. The you define you statistic with the data frame in this form. and you can use the boot function in boot or Hmisc bootstrap function. Justin BEM I don't think that's the best way to go. As in the Design package (see the predab.resample, validate, calibrate functions) and the Hmisc rm.boot function you can easily sample from subject IDs and put together the needed records. Frank Harrell Elève Ingénieur Statisticien Economiste BP 294 Yaoundé. Tél (00237)9597295. - Message d'origine De : Niccolò Bassani [EMAIL PROTECTED] À : r-help@stat.math.ethz.ch Envoyé le : Mardi, 15 Mai 2007, 11h15mn 51s Objet : [R] Bootstrap sampling for repeated measures Dear R users, I'm having some problems trying to create a routine for a bootstrap resampling issue. Suppose I've got a dataset like this: Header inr weeks . 12800012.47 0 ... 12800011.48 1 ... 12800012.23 . 2 .. 1280369 2.5 ... 56 i.e. a dataset with n subjects identified by the column header, with a set of repeated mesaures. The amount of repeated measures for each subject is 57, with a few of them being more or lesse frequent. That is, generalizing, that I haven't got the same number of observations for each patient. I've created a function allowing me to to reorder, subsetting and calculate some statistics over this dataset, but now I need to bootstrap it all. I was looking for a routine in R that could resample longitudinal data, in order to resample on the ID of the subjects. This means that while resampling (suppose m samples of n length) I wish to consider (better with replacement) either none or all of the observations related to a subject. So, if my bootstrap 1st sample takes the patient with header 1280001, I want the routine to consider all of the observations related with a subject with such a header. Thus, I shall obtain a bootstrap sample of my original dataset to wich apply the function cited before (whose only argument is the dataset). Can anybody help me? I'm trying to understand how the rm.boot function from Hmisc package resamples this way, but it's not that easy, so if anyone could help me I'd be very grateful. Thanks in advance Niccolò [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ___ [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Nicely formatted summary table with mean, standard deviation or number and proportion
Keith Wong wrote: Dear all, The incredibly useful Hmisc package provides a method to generate summary tables that can be typeset in latex. The Alzola and Harrell book An introduction to S and the Hmisc and Design libraries provides an example that generates mean and quartiles for continuous variables, and numbers and percentages for count variables: summary() with method = 'reverse'. I wonder if there is a way to change it so the mean and standard deviation are reported instead for continuous variables. I illustrate my question below using an example from the book. Thank you. Keith Newer versions of Hmisc have an option to add mean and SD for method='reverse'. Quartiles are always there. Frank library(Hmisc) set.seed(173) sex = factor(sample(c(m, f), 500, rep = T)) age = rnorm(500, 50, 5) treatment = factor(sample(c(Drug, Placebo), 500, rep = T)) summary(sex ~ treatment, fun = table) sexN=500 +-+---+---+---+---+ | | |N |f |m | +-+---+---+---+---+ |treatment|Drug |263|140|123| | |Placebo|237|133|104| +-+---+---+---+---+ |Overall | |500|273|227| +-+---+---+---+---+ (x = summary(treatment ~ age + sex, method = reverse)) # generates quartiles for continuous variables Descriptive Statistics by treatment +---+--+--+ | |Drug |Placebo | | |(N=263) |(N=237) | +---+--+--+ |age|46.5/49.9/53.2|46.7/50.0/53.4| +---+--+--+ |sex : m| 47% (123) | 44% (104) | +---+--+--+ # latex(x) generates a very nicely formatted table # but I'd like mean (standard deviation) instead of quartiles. # this function from http://tolstoy.newcastle.edu.au/R/e2/help/06/11/4713.html g - function(y) { + s - apply(y, 2, + function(z) { +z - z[!is.na(z)] +n - length(z) +if(n==0) c(NA,NA,NA,0) else +if(n==1) c(z, NA,NA,1) else { + m - mean(z) + s - sd(z) + c(N=n, Mean=m, SD=s) +} + }) + w - as.vector(s) + names(w) - as.vector(outer(rownames(s), colnames(s), paste, sep='')) + w + } summary(treatment ~ age + sex, method = reverse, fun = g) # does not work, 'fun' or 'FUN argument is ignored. Descriptive Statistics by treatment +---+--+--+ | |Drug |Placebo | | |(N=263) |(N=237) | +---+--+--+ |age|46.5/49.9/53.2|46.7/50.0/53.4| +---+--+--+ |sex : m| 47% (123) | 44% (104) | +---+--+--+ (x1 = summarize(cbind(age), llist(treatment), FUN = g, stat.name=c(n, mean, sd))) treatment n mean sd 1 Drug 263 49.9 4.94 2 Placebo 237 50.1 4.97 # this works but table is rotated, and it count data has to be # treated separately. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Read SAS data into R
John Kane wrote: Have a look at the Hmisc package:sas.get or sasxport.get. In either case you need to have a working SAS installation on the same machine. Another way, although you lose label is simply to export the SAS data file as a csv or delim file and import it. I'm also looking into running the SAS Viewer under wine on linux to read non-transport SAS files in a new Hmisc function. But so far there is a problem - SAS in all its wisdom doesn't quote character string output when creating csv files with the Viewer, so you can't have commas in your character strings. There is a tab delimiter option as long as you don't have tabs in the strings. Amazing that SAS couldn't do better than that. Frank --- AbouEl-Makarim Aboueissa [EMAIL PROTECTED] wrote: Dear ALL: Could you please let me know how to read SAS data file into R. Thank you so much for your helps. Regards; Abou == AbouEl-Makarim Aboueissa, Ph.D. Assistant Professor of Statistics Department of Mathematics Statistics University of Southern Maine 96 Falmouth Street P.O. Box 9300 Portland, ME 04104-9300 Tel: (207) 228-8389 Email: [EMAIL PROTECTED] [EMAIL PROTECTED] Office: 301C Payson Smith __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Follow-up about ordinal logit with mixtures: how about 'continuation ratio' strategy?
Paul Johnson wrote: This is a follow up to the message I posted 3 days ago about how to estimate mixed ordinal logit models. I hope you don't mind that I am just pasting in the code and comments from an R file for your feedback. Actual estimates are at the end of the post. . . . Paul, lrm does not give an incorrect sign on the intercepts. Just look at how it states the model in terms of Prob(Y=j) so that its coefficients are consistent with the way people state binary models. I'm not clear on your generation of simulated data. I specify the population logit, anti-logit that, and generate binary responses with those probabilities. I don't use rlogis. See if using the PO model with lrm with penalization on the factor does what you need. lrm is not set up to omit an intercept with the -1 notation. My book goes into details about the continuation ratio model. Frank Harrell __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate
Does anyone know of an R function for computing the Greenland-Robins variance for Mantel-Haenszel relative risks? Thanks Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mantel-Haenszel relative risk with Greenland-Robins variance estimate
[EMAIL PROTECTED] wrote: Would this function help: http://www.csm.ornl.gov/~frome/ES/RRMHex/MHanalysis.txt ? Regards, -Cody I think so. Thank you Cody. If you have time would you mind defining, probably offline, the input data elements? I assume that one of them is a stratification factor other than occupational group. Thanks again Frank Frank E Harrell Jr [EMAIL PROTECTED] To bilt.edu R list r-help@stat.math.ethz.ch Sent by: cc [EMAIL PROTECTED] at.math.ethz.ch Subject [R] Mantel-Haenszel relative risk with Greenland-Robins variance 05/08/2007 02:38 estimate PM Does anyone know of an R function for computing the Greenland-Robins variance for Mantel-Haenszel relative risks? Thanks Frank -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] pseudo-R2 or GOF for regression trees?
Prof. Jeffrey Cardille wrote: Hello, Is there an accepted way to convey, for regression trees, something akin to R-squared? I'm developing regression trees for a continuous y variable and I'd like to say how well they are doing. In particular, I'm analyzing the results of a simulation model having highly non-linear behavior, and asking what characteristics of the inputs are related to a particular output measure. I've got a very large number of points: n=4000. I'm not able to do a model sensitivity analysis because of the large number of inputs and the model run time. I've been googling around both on the archives and on the rest of the web for several hours, but I'm still having trouble getting a firm sense of the state of the art. Could someone help me to quickly understand what strategy, if any, is acceptable to say something like The regression tree in Figure 3 captures 42% of the variance? The target audience is readers who will be interested in the subsequent verbal explanation of the relationship, but only once they are comfortable that the tree really does capture something. I've run across methods to say how well a tree does relative to a set of trees on the same data, but that doesn't help much unless I'm sure the trees in question are really capturing the essence of the system. I'm happy to be pointed to a web site or to a thread I may have missed that answers this exact question. Thanks very much, Jeff -- Prof. Jeffrey Cardille [EMAIL PROTECTED] R-help@stat.math.ethz.ch mailing list Ye (below) has a method to get a nearly unbiased estimate of R^2 from recursive partitioning. In his examples the result was similar to using the formula for adjusted R^2 with regression degrees of freedom equal to about 3n/4. You can also use something like 10-fold cross-validation repeated 20 times to get a fairly precise and unbiased estimate of R^2. Frank @ARTICLE{ye98mea, author = {Ye, Jianming}, year = 1998, title = {On measuring and correcting the effects of data mining and model selection}, journal = JASA, volume = 93, pages = {120-131}, annote = {generalized degrees of freedom;GDF;effective degrees of freedom;data mining;model selection;model uncertainty;overfitting;nonparametric regression;CART;simulation setup} } -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Error in if (!length(fname) || !any(fname == zname)) { :
hongyuan cao wrote: Dear R users, I tried to fit a cox proportional hazard model to get estimation of stratified survival probability. my R code is as follows: cph(Surv(time.sur, status.sur)~ strat(colon[,13])+colon[,18] +colon[,20]+colon[,9], surv=TRUE) Error in if (!length(fname) || !any(fname == zname)) { : missing value where TRUE/FALSE needed Here colon[,13] is the one that I want to stratify and the others are all coefficients. How can I solve this problem? Thanks a lot! Grace The Design package does not like you to have complex variable names like that, and in general storing your data in a matrix when you want to treat columns as separate variables is not the best approach. I would use something like S - with(d, Surv( ) ) # d = data frame h - as.data.frame(colon) # assumes colon is a matrix;make sure it had column names cph(S ~ strat(v1)+v2+v3+v4, data=h) -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Hmisc curve label size cex
Brian O'Connor wrote: R-Masters, I need to produce high resolution line plots and place labels on the curves. It seems that cex must be high relative to the other cex values in order to produce sufficiently large legible tick labels at high resolutions. But high cex values cause the curve labels to become gigantic when using Hmisc. I've struggled and searched the archives, but cannot find a way of controlling the sizes of the curve labels in this situation. These commands produce the problem on my PC using XP: png(trial.png, width=3000, height=2400, res = 600, pointsize=12 ) par(ann=F, font.main=1, font.lab=1, font.axis=1, cex=5, cex.main=1, cex.lab=1, cex.axis=1, lwd=12, las=1, mar=c(4, 4, 2, 2) ) x = seq(-2.5, 2.5, length=100) labcurve( list( One= list( x,sin(x)), Two= list( x,cos(x)), Three=list( x,(x*x)), Four= list( x,exp(x)) ), keys=c('1','2','3','4'), keyloc=none, pl=TRUE ) dev.off() Thanks for your time. cex.main .lab .axis etc. are relative so yo need for your case to specify something like cex.axis=1/5 Not sure why you are using keys of 1-4 when you've already given nice labels. I tried labcurve( list( One= list( x,sin(x)), Two= list( x,cos(x)), Three=list( x,(x*x)), Four= list( x,exp(x)) ), pl=TRUE ) and got some nice output after reducing cex.* Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Integrating R-programs into larger systems
Ralf Finne wrote: Hi experts! Have anybody experience in including an R-program as part of a larger system? In Matlab there is a toolbox that converts a m-script into C-code. One application in mind is that I do the model building in R, for estimating the risk for cancer based on clinical measurements. When the model is ready, a small R-program can simulate the model to estimate the risk for a new patient. The idea is that a doctor gets the measurements for the patient sitting in his office. Then he goes to Internet and types in the test measuremnets and gets the estimated risk. Look at www.finne.info for an early version to get the idea. There I developed the model in Matlab and converted it to Excel. Don't use the results! Much better are available in R! There are many more applications that need a higher degree of integration. Than you in advance. Ralf Finne SYH, University of Applied Sciences Vasa Finland The Design package for R has a version for S-Plus. In S-Plus you can use its Dialog function to automatically create a GUI for getting predicted values from a series of fitted models. We will be working on an R version but it will publish the model to a web server. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Not showing dvi with Hmisc latex()
Duncan Murdoch wrote: On 4/26/2007 9:20 PM, Gad Abraham wrote: Hi, I'm using latex() from Frank Harrell's Hmisc library to produce LaTeX files. By default, it calls xdvi and displays the dvi. How can I make xdvi not show? I couldn't find a clue in the extensive documentation. Unclass the result so it doesn't print as a latex object. For example, unclass(latex(1, file=test.tex)) $file [1] test.tex $style character(0) Alternatively, if you just assign the result you can print it later. It's when you print that the latex'ing happens. Duncan Murdoch Just say w - latex(object, file='foo.tex') Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Coercing data types for use in model.frame
Prof Brian Ripley wrote: Moved to R-devel What is the 'data class'? In particular what is its underlying type? And where in model.frame[.default] are you trying to use it (in the formula, data, in ..., etc). This is an example of where some reproducible code and the error messages would be very helpful indeed. Brian Brian, Sorry - this was one of those too late in the day errors. The problem was in a function called just before model.frame. model.frame seems to work fine with an object of class c('mChoice', 'labelled'). It keeps mChoice variables as mChoice. After model.frame is finished I'll change such variables to factors or matrices. Frank On Tue, 24 Apr 2007, Frank E Harrell Jr wrote: In the Hmisc package there is a new data class 'mChoice' for multiple choice variables. There are format and as.numeric methods (the latter creates a matrix of dummy variables). mChoice variables are not allowed by model.frame. Is there a way to specify a conversion function that model.frame will use automatically? I would use as.factor here. model.frame does not seem to use as.data.frame.foo for individual variables. Thanks Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Biostatistician Opportunities at Vanderbilt
The Department of Biostatistics at Vanderbilt University's School of Medicine has openings for biostatisticians at all levels. Details and application procedures may be found at http://biostat.mc.vanderbilt.edu/JobOpenings . For M.S. and B.S. biostatisticians we are especially interested in statisticians proficient in R, S-Plus, or Stata. We have faculty positions available at the Assistant, Associate, and Professor levels. Frank Harrell Chairman, Dept. of Biostatistics __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Coercing data types for use in model.frame
In the Hmisc package there is a new data class 'mChoice' for multiple choice variables. There are format and as.numeric methods (the latter creates a matrix of dummy variables). mChoice variables are not allowed by model.frame. Is there a way to specify a conversion function that model.frame will use automatically? I would use as.factor here. model.frame does not seem to use as.data.frame.foo for individual variables. Thanks Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help comparing two median with R
Thomas Lumley wrote: On Tue, 17 Apr 2007, Frank E Harrell Jr wrote: The points that Thomas and Brian have made are certainly correct, if one is truly interested in testing for differences in medians or means. But the Wilcoxon test provides a valid test of x y more generally. The test is consonant with the Hodges-Lehmann estimator: the median of all possible differences between an X and a Y. Yes, but there is no ordering of distributions (taken one at a time) that agrees with the Wilcoxon two-sample test, only orderings of pairs of distributions. The Wilcoxon test provides a test of xy if it is known a priori that the two distributions are stochastically ordered, but not under weaker assumptions. Otherwise you can get xyzx. This is in contrast to the t-test, which orders distributions (by their mean) whether or not they are stochastically ordered. Now, it is not unreasonable to say that the problems are unlikely to occur very often and aren't worth worrying too much about. It does imply that it cannot possibly be true that there is any summary of a single distribution that the Wilcoxon test tests for (and the same is true for other two-sample rank tests, eg the logrank test). I know Frank knows this, because I gave a talk on it at Vanderbilt, but most people don't know it. (I thought for a long time that the Wilcoxon rank-sum test was a test for the median pairwise mean, which is actually the R-estimator corresponding to the *one*-sample Wilcoxon test). -thomas Thanks for your note Thomas. I do feel that the problems you have rightly listed occur infrequently and that often I only care about two groups. Rank tests generally are good at relatives, not absolutes. We have an efficient test (Wilcoxon) for relative shift but for estimating an absolute one-sample quantity (e.g., median) the nonparametric estimator is not very efficient. Ironically there is an exact nonparametric confidence interval for the median (unrelated to Wilcoxon) but none exists for the mean. Cheers, Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help comparing two median with R
[EMAIL PROTECTED] wrote: Has anyone proposed using a bootstrap for Pedro's problem? What about taking a boostrap sample from x, a boostrap sample from y, take the difference in the medians for these two bootstrap samples, repeat the process 1,000 times and calculate the 95th percentiles of the 1,000 computed differences? You would get a CI on the difference between the medians for these two groups, with which you could determine whether the difference was greater/less than zero. Too crude? Regards, -Cody As hinted at by Brian Ripley, the following code will approximate that. It gets the nonparametric confidence interval for the median and solves for the variance that would give the same confidence interval width if normality of the median held. g - function(y) { y - sort(y[!is.na(y)]) n - length(y) if(n 4) return(c(median=median(y),q1=NA,q3=NA,variance=NA)) qu - quantile(y, c(.5,.25,.75)) names(qu) - NULL r - pmin(qbinom(c(.025,.975), n, .5) + 1, n) ## Exact 0.95 C.L. w - y[r[2]] - y[r[1]] ## Width of C.L. var.med - ((w/1.96)^2)/4 ## Approximate variance of median c(median=qu[1], q1=qu[2], q3=qu[3], variance=var.med) } Run g separately by group, add the two variances, and take the square root to approximate the variance of the difference in medians and get a confidence interval. Frank Frank E Harrell Jr [EMAIL PROTECTED] To bilt.edu Thomas Lumley Sent by: [EMAIL PROTECTED] [EMAIL PROTECTED] cc at.math.ethz.ch r-help@stat.math.ethz.ch Subject Re: [R] help comparing two median 04/18/2007 05:02 with R AM Thomas Lumley wrote: On Tue, 17 Apr 2007, Frank E Harrell Jr wrote: The points that Thomas and Brian have made are certainly correct, if one is truly interested in testing for differences in medians or means. But the Wilcoxon test provides a valid test of x y more generally. The test is consonant with the Hodges-Lehmann estimator: the median of all possible differences between an X and a Y. Yes, but there is no ordering of distributions (taken one at a time) that agrees with the Wilcoxon two-sample test, only orderings of pairs of distributions. The Wilcoxon test provides a test of xy if it is known a priori that the two distributions are stochastically ordered, but not under weaker assumptions. Otherwise you can get xyzx. This is in contrast to the t-test, which orders distributions (by their mean) whether or not they are stochastically ordered. Now, it is not unreasonable to say that the problems are unlikely to occur very often and aren't worth worrying too much about. It does imply that it cannot possibly be true that there is any summary of a single distribution that the Wilcoxon test tests for (and the same is true for other two-sample rank tests, eg the logrank test). I know Frank knows this, because I gave a talk on it at Vanderbilt, but most people don't know it. (I thought for a long time that the Wilcoxon rank-sum test was a test for the median pairwise mean, which is actually the R-estimator corresponding to the *one*-sample Wilcoxon test). -thomas Thanks for your note Thomas. I do feel that the problems you have rightly listed occur infrequently and that often I only care about two groups. Rank tests generally are good at relatives, not absolutes. We have an efficient test (Wilcoxon) for relative shift but for estimating an absolute one-sample quantity (e.g., median) the nonparametric estimator is not very efficient. Ironically there is an exact nonparametric confidence interval for the median (unrelated to Wilcoxon) but none exists for the mean. Cheers, Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Re: [R] help comparing two median with R
Thomas Lumley wrote: On Tue, 17 Apr 2007, Robert McFadden wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Lemon Sent: Tuesday, April 17, 2007 12:37 PM To: Pedro A Reche Cc: r-help@stat.math.ethz.ch Subject: Re: [R] help comparing two median with R Pedro A Reche wrote: Dear R users, I am new to R and I would like to ask your help with the following topic. I have three sets of numeral data, 2 sets are paired and a third is independent of the other two. For each of these sets I have obtained their basic statistics (mean, median, stdv, range ...). Now I want to compare if these sets differ. I could compare the mean doing a basic T test . However, I was looking for a test to compare the medians using R. If that is possible I would love to hear the specifics. Hi Pedro, You can use the Mann-Whitney test (wilcox with two samples), but you would have to check that the second and third moments of the variable distributions were the same, I think. Jim Use Mann-Whitney U test, but remember about 2 assumption: 1. samples come from continuous distribution (there are no tied obserwations) 2. distributions are identical in shape. It's very similar to t-test but Mann-Whitney U test is not as affected by violation of the homogeneity of variance assumption as t-test is. This turns out not to be quite correct. If the two distributions differ only by a location shift then the hypothesis that the shift is zero is equivalent to the medians being the same (or the means, or the 3.14159th percentile), and the Mann-Whitney U test will test this hypothesis. Otherwise the Mann-Whitney U test does not test for equal medians. The assumption that the distributions are continuous is for convenience -- it makes the distribution of the test statistic easier to calculate and otherwise R uses a approximation. The assumption of a location shift is critical -- otherwise it is easy to construct three data sets x,y,z so that the Mann-Whitney U test thinks x is larger than y, y is larger than z and z is larger than x (Google for Efron Dice). That is, the Mann-Whitney U test cannot be a test for any location statistic. There actually is an exact test for the median that does not assume a location shift: dichotomize your data at the pooled median to get a 2x2 table of above/below median by group, and do Fisher's exact test on the table. This is almost never useful (because it doesn't come with an interval estimate), but is interesting because it (and the generalizations to other quantiles) is the only exactly distribution-free location test that does not have the 'non-transitivity' problem of the Mann-Whitney U test. I believe this median test is attributed to Mood, but I have not seen the primary source. -thomas The Mood test is so inefficient that its use is no longer recommended: @Article{fri00sho, author = {Freidlin, Boris and Gastwirth, Joseph L.}, title ={Should the median test be retired from general use?}, journal = American Statistician, year = 2000, volume = 54, number = 3, pages ={161-164}, annote = {inefficiency of Mood median test} } The points that Thomas and Brian have made are certainly correct, if one is truly interested in testing for differences in medians or means. But the Wilcoxon test provides a valid test of x y more generally. The test is consonant with the Hodges-Lehmann estimator: the median of all possible differences between an X and a Y. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss (package foreign) and SPSS 15.0 files
John Kane wrote: --- Charilaos Skiadas [EMAIL PROTECTED] wrote: On Apr 16, 2007, at 10:41 AM, John Kane wrote: --- Charilaos Skiadas [EMAIL PROTECTED] wrote: It is not an export option, it is a save as option. I don't have a 14 to check, but on a 15 I just go to File - Save As, and change the Save as type field to Comma Delimited (.*.csv). (I suppose tab delimited would be another option). Then there are two check- boxes below the window that allow a bit further customizing, one of them is about using value labels where defined instead of data values. I'm now back on a machine with SPSS 14. No csv option that I can see. Perhaps an enhancement to v15. I don't have a 14, but I did check a 13 today and you are correct, no csv option is there, which in my opinion is quite unacceptable for a statistical package that is on its 13/14th version. But there was an option for Excel 97 and ..., and that seemed to allow using value labels instead of the values (again you have to check the corresponding box). So perhaps that would be an option. Not my problem at the moment but a colleague pointed out that he had 750+ variables. Excel handles 256. Actually my problem is now a SAS one where I can get a clean csv export but lose the variable labels. My crude workaround was just to do a proc contents and cut-and-paste the results. A pain but it worked. I've got to figure out why I cannot get Hmisc to work! And note that Hmisc has other SAS import options besides sas.get that keep labels. Some of them require you to run PROC CONTENTS CNTLOUT= to save metadata in a dataset. Frank Thanks for the suggestion. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Nonparametric Effect size indices
Chuck Cleland wrote: Martin Plöderl wrote: Hello! For comparing two non-normally distributed samples, Leech (2002) suggested to report nonparametric effect size indices, such as Vargha Delaney's A or Cliff's d. I tried to search the R-sites, but could not find related procedures or packages that include nonparametric effect sizes. Thank you for your help! Citation: Leech (2002). A call for greater use of nonparametric statistics. Paper presented at the Annual Meeting of the Mid-South Educational Research Association, Chattanooga, TN, November 6-8. Please beware. That literature is just giving new names to much older concepts. Look at Mann-Whitney, Wilcoxon, Kendall, Somers for better citations. And Cohen's d in the pdf below should just be called a z-score. Frank Harrell Based on the description of Cliff's d in http://www.florida-air.org/romano06.pdf, you could do something like the following: g1 - c(1,1,2,2,2,3,3,3,4,5) g2 - c(1,2,3,4,4,5) # Dominance matrix sign(outer(g1, g2, FUN=-)) [,1] [,2] [,3] [,4] [,5] [,6] [1,]0 -1 -1 -1 -1 -1 [2,]0 -1 -1 -1 -1 -1 [3,]10 -1 -1 -1 -1 [4,]10 -1 -1 -1 -1 [5,]10 -1 -1 -1 -1 [6,]110 -1 -1 -1 [7,]110 -1 -1 -1 [8,]110 -1 -1 -1 [9,]11100 -1 [10,]111110 mean(rowMeans(sign(outer(g1, g2, FUN=- [1] -0.25 If you can point us to a description of Vargha Delaney's A, someone can likely suggest a way of obtaining that too. Regards, Martin Plöderl PhD Suicide Prevention Research Program, Institute of Public Health Paracelsus Private Medical University Dept. of Suicide Prevention, University Clinic of Psychiatry and Psychotherapy Christian - Doppler - Klinik Ignaz Harrerstrasse 79 A-5020 Salzburg AUSTRIA Phone: +43-662-4483-4345 Fax: +43-662-4483-4344 E-mail: [EMAIL PROTECTED] __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] sas.get problem
, format.library = F:/sas, 2: \C:/Program not found - I really don't see why the sagprog does not work unless an early error is falling through. Thanks for all the help __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Taylor, Z Todd wrote: On Monday, April 09, 2007 3:23 PM, someone named Wilfred wrote: So what's the big deal about S using files instead of memory like R. I don't get the point. Isn't there enough swap space for S? (Who cares anyway: it works, isn't it?) Or are there any problems with S and large datasets? I don't get it. You use them, Greg. So you might discuss that issue. S's one-to-one correspondence between S objects and filesystem objects is the single remaining reason I haven't completely converted over to R. With S I can manage my objects via makefiles. Corrections to raw data or changes to analysis scripts get applied to all objects in the project (and there are often thousands of them) by simply typing 'make'. That includes everything right down to the graphics that will go in the report. How do people live without that? Personally I'd rather have R's save( ) and load( ). Frank --Todd -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.spss (package foreign) and SPSS 15.0 files
Charilaos Skiadas wrote: On Apr 6, 2007, at 12:32 PM, John Kane wrote: I have simply moved to exporting the SPSS file to a delimited file and loading it. Unfortunately I'm losing all the labelling which can be time-consuming to redo.Some of the data has something like 10 categories for a variable. I save as csv format all the time, and it offers me a choice to use the labels instead of the corresponding numbers. So you shouldn't have to lose that labelling. Haris Skiadas Department of Mathematics and Computer Science Hanover College That's a different point. The great advantage of read.spss (and the spss.get function in Hmisc that uses it) is that long variable labels are supported in addition to variable names. That's why I like getting SPSS or Stata files instead of csv files. I'm going to enhance csv.get in Hmisc to allow a row number to be specified, to contain long variable labels. Frank __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Doing partial-f test for stepwise regression
Petr Klasterecky wrote: And what about to read the help page ?anova ...? When given a sequence of objects, 'anova' tests the models against one another in the order specified. Generally you almost never fit a full model (including all possible interactions etc) - no one can interpret such complicated models. Anova gives you a comparison between a broader model (the first argument to anova) and its submodel(s). True you might not fit a model with high-order interactions, but the full pre-specified model is the only one whose standard errors and test statistics work as advertised. Frank Petr [EMAIL PROTECTED] napsal(a): Hello all, I am trying to figure out an optimal linear model by using stepwise regression which requires partial f-test, I did some Googling on the Internet and realised that someone seemed to ask the question before: Jim Milks [EMAIL PROTECTED] writes: Dear all: I have a regression model that has collinearity problems (between three regressor variables). I need a F-test that will allow me to compare between full (with all variables) and partial models (minus 1= variables). The general F-test formula I'm using is: F = {[SS(full model) - SS(reduced model)] / (#variables taken out)} / MSS(full model) Unfortunately, the ANOVA table parses the SS and MSS between the variables and does not give the statistics for the regression model as a whole, otherwise I'd do this by hand. So, really, I have two questions: 1) Can I just add up all the SS and MSS for all the variables to get the model SS and MSS and 2) Are there any functions or packages I can use to calculate the F-statistic? Just use anova(model1, model2). (One potential catch: Make sure that both models are fitted to the same data set. Missing values in predictors may interfere.) However, in the answer provided by Mr. Peter Dalgaard,(use anova(model1,model2) I could not understand what model1 and model2 are supposed to referring to, which one is supposedly to be the full model and which one is to be the partial model? Or it does not matter? Thanks in advance for help from anyone! Regards, Anyi Zhu __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] what is the difference between survival analysis and (...)
Thomas Lumley wrote: On Wed, 28 Mar 2007, Frank E Harrell Jr wrote: Eric Elguero wrote: Hi everybody, recently I had to teach a course on Cox model, of which I am not a specialist, to an audience of medical epidemiologists. Not a good idea you might say.. anyway, someone in the audience was very hostile. At some point, he sayed that Cox model was useless, since all you have to do is count who dies and who survives, divide by the sample sizes and compute a relative risk, and if there was significant censoring, use cumulated follow-up instead of sample sizes and that's it! I began arguing that in Cox model you could introduce several variables, interactions, etc, then I remembered of logistic models ;-) The only (and poor) argument I could think of was that if mr Cox took pains to devise his model, there should be some reason... That is a very ignorant person, concerning statistical efficiency/power/precision and how to handle incomplete follow-up (variable follow-up duration). There are papers in the literature (I wish I had them at my fingertips) that go into the efficiency loss of just counting events. If the events are very rare, knowing the time doesn't help as much, but the Cox model still can handle censoring correctly and that person's approach doesn't. Certainly just counting the events is inefficient -- the simplest example would be studies of some advanced cancers where nearly everyone dies during followup, so that there is little or no censoring but simple counts are completely uninformative. It's relatively hard to come up with an example where using the total-time-on-test (rather than sample size) as a denominator is much worse than the Cox mode, though. You need the baseline hazard to vary a lot over time and the censoring patterns to be quite different in the groups, but proportional hazards to still hold. I think the advantages of the Cox model over a reasonably sensible person-time analysis are real, but not dramatic -- it would be hard to find a data set that would convince the sort of person who would make that sort of claim. I would argue that computational convenience on the one hand, and the ability to exercise lots of nice mathematical tools on the other hand have also contributed to the continuing popularity of the Cox model. -thomas Thomas LumleyAssoc. Professor, Biostatistics [EMAIL PROTECTED]University of Washington, Seattle Nicely put Thomas. I have seen examples from surgical research where the hazard function is bathtub shaped and the epidemiologist's use of the exponential distribution is very problematic. I have also seen examples in acute illness and medical treatment where time until death is important even with only 30-day follow-up. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] what is the difference between survival analysis and (...)
Eric Elguero wrote: Hi everybody, recently I had to teach a course on Cox model, of which I am not a specialist, to an audience of medical epidemiologists. Not a good idea you might say.. anyway, someone in the audience was very hostile. At some point, he sayed that Cox model was useless, since all you have to do is count who dies and who survives, divide by the sample sizes and compute a relative risk, and if there was significant censoring, use cumulated follow-up instead of sample sizes and that's it! I began arguing that in Cox model you could introduce several variables, interactions, etc, then I remembered of logistic models ;-) The only (and poor) argument I could think of was that if mr Cox took pains to devise his model, there should be some reason... That is a very ignorant person, concerning statistical efficiency/power/precision and how to handle incomplete follow-up (variable follow-up duration). There are papers in the literature (I wish I had them at my fingertips) that go into the efficiency loss of just counting events. If the events are very rare, knowing the time doesn't help as much, but the Cox model still can handle censoring correctly and that person's approach doesn't. Frank but the story doesn't end here. When I came back to my office, I tried these two methods on a couple of data sets, and true, crude RRs are very close to those coming from Cox model. hence this question: could someone provide me with a dataset (preferably real) where there is a striking difference between estimated RRs and/or between P-values? and of course I am interested in theoretical arguments and references. sorry that this question has nothing to do with R and thank you in advance for your leniency. Eric Elguero GEMI-UMR 2724 IRD-CNRS, Équipe Évolution des Systèmes Symbiotiques 911 avenue Agropolis, BP 64501, 34394 Montpellier cedex 5 FRANCE __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] weight factor in somers2 function
[EMAIL PROTECTED] wrote: Hi! I’m trying to calculate de C index (concordance probability) through the somers2 function (library Hmisc). I’m interesting on including the sampling effort as a weight factor for the evaluation of model predictions with real data. I’ve some questions about that: first of all I’m not really sure if I can include sampling effort as a weight factor. Since the weight factor should be a numeric vector of observation (usually frequencies), I would expect that sampling effort could be a surrogate of the frequency count of the number of subjects (i.e. frequency of observation). However, when I use sampling effort as a weight factor, I get C index larger than one. I guess/know this is statistically wrong. Then, if these values were frequency of observation; what is working incorrectly? What should be the characteristics of the weight vector? Or what could be exactly included as weight factor? Thank you very much! Send me the smallest artificial example you can construct and I'll work on it. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to get lsmeans?
John Fox wrote: Dear Brian et al., My apologies for chiming in late: It's been a busy day. First some general comments on least-squares means and effect displays. ... much good stuff deleted ... John - the one thing I didn't get from your post is a motivation to do all that as opposed to easy-to-explain predicted values. I would appreciate your thoughts. Thanks Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to get lsmeans?
John Fox wrote: Dear Frank, The object is to focus on a high-order term of the model, holding other terms constant at typical values; in the case of a factor, a typical value is ambiguous, so an average is taken over the levels of the factor. If the factor is, e.g., gender, one could produce an estimate of the average fitted value for a population composed equally of men and women, or of a population composed of men and women in proportion to their distribution in the data. Otherwise, one would have to produce separate sets of fitted values for men and women, with the number of such sets increasing as the number of levels of the factors held constant increase. On the scale of the linear predictor, these sets would differ only by constants. Regards, John Makes sense. I just set other factors to the mode, and if it is important to see estimates for other categories, I give estimates for each factor level. If I want to uncondition on a variable (not often) I take the proper weighted average of predicted values. Cheers Frank John Fox Department of Sociology McMaster University Hamilton, Ontario Canada L8S 4M4 905-525-9140x23604 http://socserv.mcmaster.ca/jfox -Original Message- From: Frank E Harrell Jr [mailto:[EMAIL PROTECTED] Sent: Thursday, March 22, 2007 8:41 AM To: John Fox Cc: 'Prof Brian Ripley'; 'r-help'; 'Chuck Cleland'; 'Liaw, Andy' Subject: Re: [R] how to get lsmeans? John Fox wrote: Dear Brian et al., My apologies for chiming in late: It's been a busy day. First some general comments on least-squares means and effect displays. ... much good stuff deleted ... John - the one thing I didn't get from your post is a motivation to do all that as opposed to easy-to-explain predicted values. I would appreciate your thoughts. Thanks Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to get lsmeans?
do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Stepwise Logistic Regression
Sergio Della Franca wrote: Dear R-Helpers, I'd like to perform a Logistic Regression whit a Stepwise method. Can you tell me how can i proceed to develop this procedure? Thank you in advance. Sergio Della Franca. If the number of events is not incredibly large, you can get almost the same result with the following code :-) candidates - setdiff(names(mydataframe), 'y') p - length(candidates) sample(candidates, sample(round(p/4):round(p/2), 1)) -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and clinical studies
[EMAIL PROTECTED] wrote: Thank you to all those that responded to Delphine's original post on R and clinical studies. They have provided much food for thought. I had a couple of follow up questions/comments. Andrew is very correct in pointing out that there are classes and workshops available for R. It's my understanding that there are even commercial versions of R that now provide formal commercial-style courses. And at any rate, the money saved by potentially avoiding pricey software could certainly justify any training expense in time or money - this assumes of course that the pricey software could be dispensed with (I suspect that would take considerable time at my current company as so many legacy projects have been done in proprietary software). I still think that R provides less 'hand-holding' and requires more initiative (which may be more or less present on a per programmer/statistician basis). I guess one could always integrate R/Splus in with SAS, as Terry's group has done at Mayo - I will probably do this at least as a start. I have a few concerns with regards to this approach (these may be needless concerns, but I will venture expressing them anyway). First, I'm worried about the possibility of compatability concerns (will anyone be worried about a SAS dataset read into R or vice-versa?). Second, I would prefer focusing all my learning on one package if possible. I actually have more experience with SAS (as do others in my group), and if the switch to R is to be made I would like to make that switch as complete as possible. This would also avoid requiring new hires to know both languages. Third, if SAS is to be kept around, it defeats one of the main advantages of having open source code in the first place (R is wonderfully free!). Like Mayo, Baylor Health (my previous employer) used both Splus and SAS. I was warned that data manipulation would be much more difficult in R/Splus than it was in SAS. To be honest, and I say this humbly realizing that most posters to this list have much more experience than I, I haven't found data manipulation to be that much more difficult in R/Splus (at least as I have gained experience in R/Splus). I can think of two exceptions (1) large datasets and (2) SAS seems to play nicer with MS products (e.g. PROC IMPORT seemed to read in messy Excel spreadsheets better than importData in Splus). Is it possible (and I again say this with MUCH humility) that the perceived advantages of SAS with regards to data manipulation may be due in part to some users only using R/Splus for stat modeling and graphics (thus never becoming familiar with the data manipulation capabilities of R/Splus) or to the reluctance of SAS-trained individuals and companies to make the complete switch? You are exactly correct on this point. Some graduate programs only teach students how to use R/S-Plus for modeling and graphics. R/S-Plus are wonderful for data manipulation - more powerful than SAS but not easy to learn (plus in R there are sometimes too many ways to do something; new users get lost - e.g. the reshape and reShape functions and the reshape package). http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf has many examples of complex data manipulation as do some web sites. We do analysis for pharmaceutical companies with 100% of the data manipulation done in R after importing say 50 SAS datasets into R. Doing tasks such as finding a lab value measured the closest in time to some event is much more elegant in R/S-Plus than in SAS. Frank Tony, the story about the famous software and the certain operating system at the large company was priceless. In closing, I should mention that in all posts I am speaking for myself and not for Edwards LifeSciences. Regards, -Cody __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and clinical studies
Delphine Fontaine wrote: Thanks for your answer which was very helpfull. I have another question: I have read in this document (http://cran.r-project.org/doc/manuals/R-intro.pdf) that most of the programs written in R are ephemeral and that new releases are not always compatible with previous releases. What I would like to know is if R functions are already validated and if not, what should we do to validate a R function ? In the sense in which most persons use the term 'validate', it means to show with one or more datasets that the function is capable of producing the right answer. It doesn't mean that it produces the right answer for every dataset although we hope it does. [As an aside, most errors are in the data manipulation phase, not in the analysis phase.] So I think that instead of validating functions we should spend more effort on validating analyses [and validating analysis file derivation]. Pivotal analyses can be re-done a variety of ways, in R or in separate programmable packages such as Stata. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ols Error : missing value where TRUE/FALSE needed
Charles Evans wrote: I have installed Hmisc and Design. When I use ols, I get the following error message: Error in if (!length(fname) || !any(fname == zname)) { : missing value where TRUE/FALSE needed The model that I am running is: ecools - ols(eco$exp ~ eco$age + eco$own + eco$inc + inc2, x=TRUE) ecools - ols(exp ~ age + own + inc + inc2, data=eco, x=TRUE) Watch out for variables named exp but probably OK. Frank Harrell I have tried several other combinations of arguments that take TRUE/ FALSE values, but no luck. Ultimately, I am trying to calculate robust standard errors. Any help would be appreciated. Charles Evans Executive Director Free Curricula Center [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and SAS proc format
Ulrike Grömping wrote: The down side to R's factor solution: The numerical values of factors are always 1 to number of levels. Thus, it can be tough and requires great care to work with studies that have both numerical values different from this and value labels. This situation is currently not well-supported by R. Regards, Ulrike P.S.: I fully agree with Frank regarding the annoyance one sometimes encounters with formats in SAS! You can add an attribute to a variable. In the sas.get function in the Hmisc package for example, when importing SAS variables that have PROC FORMAT value labels, an attribute 'sas.codes' keeps the original codes; these can be retrieved using sas.codes(variable name). This could be done outside the SAS import context also. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University Frank, are these attributes preserved when merging or subsetting a data frame? Are they used in R packages other than Hmisc and Design (e.g. in a simple table request)? no; would need to add functions like those that are used by the Hmisc label or impute functions. And they are not used outside Hmisc/Design. In fact I have little need for them as I always find the final labels as the key to analysis. If this is the case, my wishlist items 8658 and 8659 (http://bugs.r-project.org/cgi-bin/R/wishlist?id=8658;user=guest, http://bugs.r-project.org/cgi-bin/R/wishlist?id=8659;user=guest) can be closed. Otherwise, I maintain the opinion that there are workarounds but that R is not satisfactorily able to handle this type of data. R gives the framework for doing this elegantly but the user has an overhead of implementing new methods for such attributes. Cheers Frank Regards, Ulrike *--- End of Original Message ---* -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and SAS proc format
lamack lamack wrote: Dear all, Is there an R equivalent to SAS's proc format? Best regards J. Lamack Fortunately not. SAS is one of the few large systems that does not implicitly support value labels and that separates label information from the database [I can't count the number of times someone has sent me a SAS dataset and forgotten to send the PROC FORMAT value labels]. See the factor function for information about how R does this. Frank Harrell -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and SAS proc format
Ulrike Grömping wrote: The down side to R's factor solution: The numerical values of factors are always 1 to number of levels. Thus, it can be tough and requires great care to work with studies that have both numerical values different from this and value labels. This situation is currently not well-supported by R. You can add an attribute to a variable. In the sas.get function in the Hmisc package for example, when importing SAS variables that have PROC FORMAT value labels, an attribute 'sas.codes' keeps the original codes; these can be retrieved using sas.codes(variable name). This could be done outside the SAS import context also. Frank Regards, Ulrike P.S.: I fully agree with Frank regarding the annoyance one sometimes encounters with formats in SAS! lamack lamack wrote: Dear all, Is there an R equivalent to SAS's proc format? Best regards J. Lamack _ O Windows Live Spaces é seu espaço na internet com fotos (500 por mês), blog e agora com rede social http://spaces.live.com/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bootstrap
Indermaur Lukas wrote: Hi, I would like to evaluate the frequency of the variables within the best selected model by AIC among a set of 12 competing models (I fit them with GLM) with a bootstrap procedure to get unbiased results. So I would ike to do the ranking of the 12-model-set 10'000 times separately and calculate the frequency of variables of the 10'000 best ranked models. I wrote a script doing the model ranking (with some difficulty) for the pre-defined 12-model-set, that works. My model results (one row per model) are stored in a dataframe called mydataframe. How can I implement my script simply into a bootstrap and sum up the frequencies of variables of the 10'000 bootstraped results? I am not so good in R and would appreciate any hint. Regards Lukas I'm not entirely clear on your goals but the bootstrap selection frequency is mainly a replay of the significance of variables in the full model. I don't see that it provides new information. And selection frequency is marred by collinearity. Frank °°° Lukas Indermaur, PhD student eawag / Swiss Federal Institute of Aquatic Science and Technology ECO - Department of Aquatic Ecology Überlandstrasse 133 CH-8600 Dübendorf Switzerland Phone: +41 (0) 71 220 38 25 Fax: +41 (0) 44 823 53 15 Email: [EMAIL PROTECTED] www.lukasindermaur.ch __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fitting of all possible models
Indermaur Lukas wrote: Hi, Fitting all possible models (GLM) with 10 predictors will result in loads of (2^10 - 1) models. I want to do that in order to get the importance of variables (having an unbalanced variable design) by summing the up the AIC-weights of models including the same variable, for every variable separately. It's time consuming and annoying to define all possible models by hand. Is there a command, or easy solution to let R define the set of all possible models itself? I defined models in the following way to process them with a batch job: # e.g. model 1 preference- formula(Y~Lwd + N + Sex + YY) # e.g. model 2 preference_heterogeneity- formula(Y~Ri + Lwd + N + Sex + YY) etc. etc. I appreciate any hint Cheers Lukas If you choose the model from amount 2^10 -1 having best AIC, that model will be badly biased. Why look at so many? Pre-specification of models, or fitting full models with penalization, or using data reduction (masked to Y) may work better. Frank °°° Lukas Indermaur, PhD student eawag / Swiss Federal Institute of Aquatic Science and Technology ECO - Department of Aquatic Ecology Überlandstrasse 133 CH-8600 Dübendorf Switzerland Phone: +41 (0) 71 220 38 25 Fax: +41 (0) 44 823 53 15 Email: [EMAIL PROTECTED] www.lukasindermaur.ch __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] rpart minimum sample size
Amy Uhrin wrote: Is there an optimal / minimum sample size for attempting to construct a classification tree using /rpart/? I have 27 seagrass disturbance sites (boat groundings) that have been monitored for a number of years. The monitoring protocol for each site is identical. From the monitoring data, I am able to determine the level of recovery that each site has experienced. Recovery is our categorical dependent variable with values of none, low, medium, high which are based upon percent seagrass regrowth into the injury over time. I wish to be able to predict the level of recovery of future vessel grounding sites based upon a number of categorical / continuous predictor variables used here including (but not limited to) such parameters as: sediment grain size, wave exposure, original size (volume) of the injury, injury age, injury location. When I run /rpart/, the data is split into only two terminal nodes based solely upon values of the original volume of each injury. No other predictor variables are considered, even though I have included about six of them in the model. When I remove volume from the model the same thing happens but with injury area - two terminal nodes are formed based upon area values and no other variables appear. I was hoping that this was a programming issue, me being a newbie and all, but I really think I've got the code right. Now I am beginning to wonder if my N is too small for this method? In my experience N needs to be around 20,000 to get both good accuracy and replicability of patterns if the number of potential predictors is not tiny. In general, the R^2 from rpart is not competitive with that from an intelligently fitted regression model. It's just a difficult problem, when relying on a single tree (hence the popularity of random forests, bagging, boosting). Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fitting of all possible models
Indermaur Lukas wrote: Hi Frank I fitted a set of 12 candidate models and evaluated the importance of variables based on model averaged coefficients and SE (model weights =0.9). Variables in my models were not distributed in equal numbers across all models thus I was not able to evaluate the importance of variables just by summing up the AIC-weights of models including a specific variable. Now, why so many models to fit: I was curious, if the ranking in the importance of variables is similar, when just summing up the AIC-weights over an all-possible-models set and looking at the ordered model averaged coefficients (order of CV=SE/coefficient). Any hint for me? Cheers Lukas I have seen the literature on Bayesian model averaging which uses weights from Bayes factors, related to BIC, but not the approach you are using. Frank Indermaur Lukas wrote: Hi, Fitting all possible models (GLM) with 10 predictors will result in loads of (2^10 - 1) models. I want to do that in order to get the importance of variables (having an unbalanced variable design) by summing the up the AIC-weights of models including the same variable, for every variable separately. It's time consuming and annoying to define all possible models by hand. Is there a command, or easy solution to let R define the set of all possible models itself? I defined models in the following way to process them with a batch job: # e.g. model 1 preference- formula(Y~Lwd + N + Sex + YY) # e.g. model 2 preference_heterogeneity- formula(Y~Ri + Lwd + N + Sex + YY) etc. etc. I appreciate any hint Cheers Lukas If you choose the model from amount 2^10 -1 having best AIC, that model will be badly biased. Why look at so many? Pre-specification of models, or fitting full models with penalization, or using data reduction (masked to Y) may work better. Frank °°° Lukas Indermaur, PhD student eawag / Swiss Federal Institute of Aquatic Science and Technology ECO - Department of Aquatic Ecology Überlandstrasse 133 CH-8600 Dübendorf Switzerland Phone: +41 (0) 71 220 38 25 Fax: +41 (0) 44 823 53 15 Email: [EMAIL PROTECTED] www.lukasindermaur.ch __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] fitting of all possible models
Bert Gunter wrote: ... Below -- Bert Bert Gunter Genentech Nonclinical Statistics South San Francisco, CA 94404 650-467-7374 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Frank E Harrell Jr Sent: Tuesday, February 27, 2007 5:14 AM To: Indermaur Lukas Cc: r-help@stat.math.ethz.ch Subject: Re: [R] fitting of all possible models Indermaur Lukas wrote: Hi, Fitting all possible models (GLM) with 10 predictors will result in loads of (2^10 - 1) models. I want to do that in order to get the importance of variables (having an unbalanced variable design) by summing the up the AIC-weights of models including the same variable, for every variable separately. It's time consuming and annoying to define all possible models by hand. Is there a command, or easy solution to let R define the set of all possible models itself? I defined models in the following way to process them with a batch job: # e.g. model 1 preference- formula(Y~Lwd + N + Sex + YY) # e.g. model 2 preference_heterogeneity- formula(Y~Ri + Lwd + N + Sex + YY) etc. etc. I appreciate any hint Cheers Lukas If you choose the model from amount 2^10 -1 having best AIC, that model will be badly biased. Why look at so many? Pre-specification of models, or fitting full models with penalization, --- ...the rub being how much to penalize. My impression from what I've read is, for prediction, close to the more, the better is the predictor... . Nature rewards parsimony. Cheers, Bert Bert, In my experience nature rewards complexity, if done right. See Savage's antiparsimony principle -Frank @Article{gre00whe, author = {Greenland, Sander}, title ={When should epidemiologic regressions use random coeff icients?}, journal = Biometrics, year = 2000, volume = 56, pages ={915-921}, annote = {Bayesian methods;causal inference;empirical Bayes estimators;epidemiologic method;hierarchical regression;mixed models;multilevel modeling;random-coefficient regression;shrinkage;variance components;use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important confounders uncontrolled;composition matrix;example with far too many significant predictors with many regression coefficients absurdly inflated when overfit;lack of evidence for dietary effects mediated through constituents;shrinkage instead of variable selection;larger effect on confidence interval width than on point estimates with variable selection;uncertainty about variance of random effects is just uncertainty about prior opinion;estimation of variance is pointless;instead the analysis shuld be repeated using different values;if one feels compelled to estimate $\tau^2$, I would recommend giving it a proper prior concentrated amount contextually reasonable values;claim about ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model entertained;shrinkage towards compositional model;models need to be complex to capture uncertainty about the relations...an honest uncertainty assessment requires parameters for all effects that we know may be present. This advice is implicit in an antiparsimony principle often attributed to L. J. Savage 'All models should be as big as an elephant (see Draper, 1995)'. See also gus06per.} } Frank °°° Lukas Indermaur, PhD student eawag / Swiss Federal Institute of Aquatic Science and Technology ECO - Department of Aquatic Ecology Überlandstrasse 133 CH-8600 Dübendorf Switzerland Phone: +41 (0) 71 220 38 25 Fax: +41 (0) 44 823 53 15 Email: [EMAIL PROTECTED] www.lukasindermaur.ch -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bootcov and cph error
Williams Scott wrote: Hi all, I am trying to get bootstrap resampled estimates of covariates in a Cox model using cph (Design library). Using the following I get the error: ddist2.abr - datadist(data2.abr) options(datadist='ddist2.abr') cph1.abr - cph(Surv(strt3.abr,loc3.abr)~cov.a.abr+cov.b.abr, data=data2.abr, x=T, y=T) boot.cph1 - bootcov(cph1.abr, B=100, coef.reps=TRUE, pr=T) 1 Error in oosl(f, matxv(X, cof), Y) : not implemented for cph models Removing coef.reps argument works fine, but I really need the coefficients if at all possible. I cant find anything in the help files suggesting that I cant use coef.reps in a cph model. Any help appreciated. Cheers Scott Sorry it's taken so long to get to this. The documentation needs to be clarified. Add loglik=FALSE to allow coef.reps=TRUE to work for cph models. Frank _ Dr. Scott Williams MD Peter MacCallum Cancer Centre Melbourne, Australia [EMAIL PROTECTED] -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Double-banger function names: preferences and suggestions
hadley wickham wrote: What do you prefer/recommend for double-banger function names: 1 scale.colour 2 scale_colour 3 scaleColour 1 is more R-like, but conflicts with S3. 2 is a modern version of number 1, but not many packages use it. Number 3 is more java-like. (I like number 2 best) Any suggestions? Thanks, Hadley Personal preference is for 3. I wish I had used that throughout the Hmisc and Design packages for non-generics. Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.