Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

Frank E Harrell Jr Wed, 25 Mar 2009 06:05:21 -0700

Ravi Varadhan wrote:

Fine detective work, David. Now, you can see the reasons for my frustration - multiplicity of data sets combined with non-existent documentation of the source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995).
Best,
Ravi.

Yes that is a big frustration for me, even for projects for which I wasthe principal statistician in 1990 for which I did a poor job ofarchiving excellent medical datasets for future use. This is a bigadvertisement for the reproducible research movement.

David - fantastic job. Based on what you found, the version on our website looks as good as any. Now if someone can explain to me why you seea spike near a serum prostate acid phosphatase (AP) value of 1 when youuse a flexible regression model (e.g., restricted cubic spline) torelate AP to the log hazard of death in a survival model (see p. 518 inmy book), that would be very helpful.

If you do with(prostate,plot(supsmu(log(ap),1*(status!='alive')))) yousee a minimum at ap=2.37 after anti-logging. If you do


dd <- datadist(prostate); options(datadist='dd')
f <- cph(Surv(dtime,status!='alive') ~ rcs(log(ap),6), data=prostate)
plot(f)

you see a sharp minimum at ap=1.43. With 4 knots the min is a 1.18.You have to go to 3 knots to get a monotonic fit in log(ap) but AIC isnot as good.


Frank

____________________________________________________________________

Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu


----- Original Message -----
From: David Winsemius <dwinsem...@comcast.net>
Date: Tuesday, March 24, 2009 10:54 pm
Subject: Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews 
and Herzberg - Data
To: Rolf Turner <r.tur...@auckland.ac.nz>
Cc: R-help Forum <r-help@r-project.org>, Ravi Varadhan <rvarad...@jhmi.edu>
 On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote:
>
 > On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:
 >
 >   <snip>
 >
>>> (2) Scrolling down to ``Byar and Green prostate cancer data''>>> appeared
 >>> to get
>>> me to the right place. But I couldn't see any signs of any ``R
 >>> binary
 >>> files''.
 >>
 >> Please look again.  It's under the heading "R".  Unfortunately I used
 >> .sav suffix for save() files in the old days.
 >
 >   Ah-ha.  Oh me of little faith.  I have been hanging around (in
 >   my current work environment) with too many SPSS users, and the
 >   *.sav extension seems to be the standard for SPSS data files.
 >   Whence my corrupted thinking.
 >
 >> The .xls fine opened with no problem in OpenOffice; has 506 rows.
 >
 >   Hmmm.  When I opened it with Excel on the Mac I got a spread
 >   sheet with 503 rows --- the first row being the column names,
 >   so there were really 502 rows.
The last "patnr" is "506" but there are only 502 lines of data. 471,
 473, 475 and 488 are missing.
And the CMU Statlib version for 2002 looks the same.The version at this site is missing more than 25 cases:Here are two other copies of the dataset the first of which appearstohave those missing cases:
 This one has patient numbers:
This one has a description of the fields and cites the one above buthas not retained the patient numbers and has apparently only kept the475 cases with complete data.>David Winsemius, MD
 Heritage Laboratories
 West Hartford, CT
______________________________________________
 R-help@r-project.org mailing list
PLEASE do read the posting guideand provide commented, minimal, self-contained, reproducible code.


--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

Reply via email to