Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-25 Thread Frank E Harrell Jr

Ravi Varadhan wrote:
Fine detective work, David.  Now, you can see the reasons for my frustration - multiplicity of data sets combined with non-existent documentation of the source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995).


Best,
Ravi.


Yes that is a big frustration for me, even for projects for which I was 
the principal statistician in 1990 for which I did a poor job of 
archiving excellent medical datasets for future use.  This is a big 
advertisement for the reproducible research movement.


David - fantastic job.  Based on what you found, the version on our web 
site looks as good as any.  Now if someone can explain to me why you see 
a spike near a serum prostate acid phosphatase (AP) value of 1 when you 
use a flexible regression model (e.g., restricted cubic spline) to 
relate AP to the log hazard of death in a survival model (see p. 518 in 
my book), that would be very helpful.


If you do with(prostate,plot(supsmu(log(ap),1*(status!='alive' you 
see a minimum at ap=2.37 after anti-logging.  If you do


dd <- datadist(prostate); options(datadist='dd')
f <- cph(Surv(dtime,status!='alive') ~ rcs(log(ap),6), data=prostate)
plot(f)

you see a sharp minimum at ap=1.43.  With 4 knots the min is a 1.18. 
You have to go to 3 knots to get a monotonic fit in log(ap) but AIC is 
not as good.


Frank





Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu


- Original Message -
From: David Winsemius 
Date: Tuesday, March 24, 2009 10:54 pm
Subject: Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews 
and Herzberg - Data
To: Rolf Turner 
Cc: R-help Forum , Ravi Varadhan 



 On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote:
 
 >

 > On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:
 >
 >   
 >
 >>> (2) Scrolling down to ``Byar and Green prostate cancer data''  
 >>> appeared

 >>> to get
 >>> me to the right place.  But I couldn't see any signs of any ``R  


 >>> binary
 >>> files''.
 >>
 >> Please look again.  It's under the heading "R".  Unfortunately I used
 >> .sav suffix for save() files in the old days.
 >
 >   Ah-ha.  Oh me of little faith.  I have been hanging around (in
 >   my current work environment) with too many SPSS users, and the
 >   *.sav extension seems to be the standard for SPSS data files.
 >   Whence my corrupted thinking.
 >
 >> The .xls fine opened with no problem in OpenOffice; has 506 rows.
 >
 >   Hmmm.  When I opened it with Excel on the Mac I got a spread
 >   sheet with 503 rows --- the first row being the column names,
 >   so there were really 502 rows.
 
 The last "patnr" is "506" but there are only 502 lines of data. 471,  


 473, 475 and 488 are missing.
 
 And the CMU Statlib version for 2002 looks the same.
 
 
 The version at this site is missing more than 25 cases:
 
 
 Here are two other copies of the dataset the first of which appears 
to  
 have those missing cases:

 This one has patient numbers:
 
 
 This one has a description of the fields and cites the one above but  

 has not retained the patient numbers and has apparently only kept the 
 
 475 cases with complete data.
 
 
 
 >
 
 David Winsemius, MD

 Heritage Laboratories
 West Hartford, CT
 
 __

 R-help@r-project.org mailing list
 
 PLEASE do read the posting guide 
 and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-25 Thread Frank E Harrell Jr

Rolf Turner wrote:


On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:




(2) Scrolling down to ``Byar and Green prostate cancer data'' appeared
to get
me to the right place.  But I couldn't see any signs of any ``R binary
files''.


Please look again.  It's under the heading "R".  Unfortunately I used
.sav suffix for save() files in the old days.


Ah-ha.  Oh me of little faith.  I have been hanging around (in
my current work environment) with too many SPSS users, and the
*.sav extension seems to be the standard for SPSS data files.
Whence my corrupted thinking.


It definitely is a standard for SPSS, that's why I regret ever using 
that suffix.





The .xls fine opened with no problem in OpenOffice; has 506 rows.


Hmmm.  When I opened it with Excel on the Mac I got a spread
sheet with 503 rows --- the first row being the column names,
so there were really 502 rows.

And 502 rows was what I got when I saved the *.xls file as a
*.csv file and then read that in.

Also, when I followed Phil Spector's excellent advice and
loaded prostate.sav from the website, using load(), I ***again***
got a data frame of 502 rows.  This data frame is (modulo some
classes and attributes) identical with what I got from reading
from the *.csv file.


Sorry about that - I was looking at patient numbers.  I do get 502 rows 
either with load()'ing the binary data frame or opening the spreadsheet.




Where have the other four rows gone?  Ravi Varadhan also observed
this phenomenon.

cheers,

Rolf

##
Attention:This e-mail message is privileged and confidential. If you are 
not theintended recipient please delete the message and notify the 
sender.Any views or opinions presented are solely those of the author.


This e-mail has been scanned and cleared by 
MailMarshalwww.marshalsoftware.com

##




--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-25 Thread David Winsemius
One further version:  this one with a header and with NA's replacing  
the -'s that apparently has not deleted any cases with missing data:

http://www.stat.auckland.ac.nz/~wild/764/s764data/prostatic.tab

--
David Winsemius
On Mar 24, 2009, at 11:51 PM, Ravi Varadhan wrote:

Fine detective work, David.  Now, you can see the reasons for my  
frustration - multiplicity of data sets combined with non-existent  
documentation of the source of data in journal articles (e.g. Kay  
1986; Lunn and McNeil 1995).


Best,
Ravi.



Ravi Varadhan, Ph.D.



On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote:



On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:




(2) Scrolling down to ``Byar and Green prostate cancer data''
appeared
to get
me to the right place.  But I couldn't see any signs of any ``R



binary
files''.


Please look again.  It's under the heading "R".  Unfortunately I  
used

.sav suffix for save() files in the old days.


Ah-ha.  Oh me of little faith.  I have been hanging around (in
my current work environment) with too many SPSS users, and the
*.sav extension seems to be the standard for SPSS data files.
Whence my corrupted thinking.


The .xls fine opened with no problem in OpenOffice; has 506 rows.


Hmmm.  When I opened it with Excel on the Mac I got a spread
sheet with 503 rows --- the first row being the column names,
so there were really 502 rows.


The last "patnr" is "506" but there are only 502 lines of data. 471,

473, 475 and 488 are missing.

And the CMU Statlib version for 2002 looks the same.


The version at this site is missing more than 25 cases:


Here are two other copies of the dataset the first of which appears
to
have those missing cases:
This one has patient numbers:


This one has a description of the fields and cites the one above but

has not retained the patient numbers and has apparently only kept the

475 cases with complete data.




David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-24 Thread Ravi Varadhan
Fine detective work, David.  Now, you can see the reasons for my frustration - 
multiplicity of data sets combined with non-existent documentation of the 
source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995).

Best,
Ravi.



Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu


- Original Message -
From: David Winsemius 
Date: Tuesday, March 24, 2009 10:54 pm
Subject: Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews 
and Herzberg - Data
To: Rolf Turner 
Cc: R-help Forum , Ravi Varadhan 


>  On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote:
>  
>  >
>  > On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:
>  >
>  >
>  >
>  >>> (2) Scrolling down to ``Byar and Green prostate cancer data''  
>  >>> appeared
>  >>> to get
>  >>> me to the right place.  But I couldn't see any signs of any ``R  
> 
>  >>> binary
>  >>> files''.
>  >>
>  >> Please look again.  It's under the heading "R".  Unfortunately I used
>  >> .sav suffix for save() files in the old days.
>  >
>  >Ah-ha.  Oh me of little faith.  I have been hanging around (in
>  >my current work environment) with too many SPSS users, and the
>  >*.sav extension seems to be the standard for SPSS data files.
>  >Whence my corrupted thinking.
>  >
>  >> The .xls fine opened with no problem in OpenOffice; has 506 rows.
>  >
>  >Hmmm.  When I opened it with Excel on the Mac I got a spread
>  >sheet with 503 rows --- the first row being the column names,
>  >so there were really 502 rows.
>  
>  The last "patnr" is "506" but there are only 502 lines of data. 471,  
> 
>  473, 475 and 488 are missing.
>  
>  And the CMU Statlib version for 2002 looks the same.
>  
>  
>  The version at this site is missing more than 25 cases:
>  
>  
>  Here are two other copies of the dataset the first of which appears 
> to  
>  have those missing cases:
>  This one has patient numbers:
>  
>  
>  This one has a description of the fields and cites the one above but  
> 
>  has not retained the patient numbers and has apparently only kept the 
>  
>  475 cases with complete data.
>  
>  
>  
>  >
>  
>  David Winsemius, MD
>  Heritage Laboratories
>  West Hartford, CT
>  
>  __
>  R-help@r-project.org mailing list
>  
>  PLEASE do read the posting guide 
>  and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-24 Thread David Winsemius


On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote:



On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:



(2) Scrolling down to ``Byar and Green prostate cancer data''  
appeared

to get
me to the right place.  But I couldn't see any signs of any ``R  
binary

files''.


Please look again.  It's under the heading "R".  Unfortunately I used
.sav suffix for save() files in the old days.


Ah-ha.  Oh me of little faith.  I have been hanging around (in
my current work environment) with too many SPSS users, and the
*.sav extension seems to be the standard for SPSS data files.
Whence my corrupted thinking.


The .xls fine opened with no problem in OpenOffice; has 506 rows.


Hmmm.  When I opened it with Excel on the Mac I got a spread
sheet with 503 rows --- the first row being the column names,
so there were really 502 rows.


The last "patnr" is "506" but there are only 502 lines of data. 471,  
473, 475 and 488 are missing.


And the CMU Statlib version for 2002 looks the same.
http://lib.stat.cmu.edu/S/Harrell/data/descriptions/prostate.html

The version at this site is missing more than 25 cases:
http://www.imbi.uni-freiburg.de/biom/Royston-Sauerbrei-book/

Here are two other copies of the dataset the first of which appears to  
have those missing cases:

This one has patient numbers:
http://lib.stat.cmu.edu/datasets/Andrews/T46.1

This one has a description of the fields and cites the one above but  
has not retained the patient numbers and has apparently only kept the  
475 cases with complete data.

http://www.stats.waikato.ac.nz/Staff/maj/multimix/cancerdesc.txt
http://www.stats.waikato.ac.nz/Staff/maj/multimix/cancer%20data.txt





David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-24 Thread Rolf Turner


On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote:



(2) Scrolling down to ``Byar and Green prostate cancer data''  
appeared

to get
me to the right place.  But I couldn't see any signs of any ``R  
binary

files''.


Please look again.  It's under the heading "R".  Unfortunately I used
.sav suffix for save() files in the old days.


Ah-ha.  Oh me of little faith.  I have been hanging around (in
my current work environment) with too many SPSS users, and the
*.sav extension seems to be the standard for SPSS data files.
Whence my corrupted thinking.


The .xls fine opened with no problem in OpenOffice; has 506 rows.


Hmmm.  When I opened it with Excel on the Mac I got a spread
sheet with 503 rows --- the first row being the column names,
so there were really 502 rows.

And 502 rows was what I got when I saved the *.xls file as a
*.csv file and then read that in.

Also, when I followed Phil Spector's excellent advice and
loaded prostate.sav from the website, using load(), I ***again***
got a data frame of 502 rows.  This data frame is (modulo some
classes and attributes) identical with what I got from reading
from the *.csv file.

Where have the other four rows gone?  Ravi Varadhan also observed
this phenomenon.

cheers,

Rolf

##
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-24 Thread Frank E Harrell Jr

Rolf Turner wrote:


On 25/03/2009, at 10:04 AM, Frank E Harrell Jr wrote:


Ravi Varadhan wrote:

Hi,

I am looking for a data set containing the information from a 
randomized trial evaluating the effect of DES (diethylsilbestrol) on 
multiple time-to-event endpoints, prostate cancer, CVD, and other 
causes.  The original source of this data is Green and Byar (1980).  
This is a popular competing risks problem that has subsequently been 
discussed in a number of statistical papers including Kay (1986).


Does anyone have a digital version of this data set?

This data is also presented in Andrews, D. F. and Herzberg, A. M. 
(1985). Data.   Does a digital version of all the data sets in A & H 
exist?


Thanks very much,
Ravi.


An R binary dataset is at http://biostat.mc.vanderbilt.edu/Datasets

Note that there is something strange about the AP variable with a lot of
ties at some value near 1.0.  I have never been able to find any
documentation about this problem.  If you find any please let me know.


Out of idle curiosity I went to have a look at this data set.

I had problems.

(1) The given URL didn't work for me; when I clicked on it, I got an 
error 404.
But if I went to http://biostat.mc.vanderbilt.edu I found a link to 
``Datasets'',

and clicking on that got me to some data sets.


Sorry that should have been DataSets not Datasets.



(2) Scrolling down to ``Byar and Green prostate cancer data'' appeared 
to get
me to the right place.  But I couldn't see any signs of any ``R binary 
files''.


Please look again.  It's under the heading "R".  Unfortunately I used 
.sav suffix for save() files in the old days.


The .xls fine opened with no problem in OpenOffice; has 506 rows.

Frank




The available formats appear to be *.sav (SPSS?), *.sdd (???), and *.xls.

(3) I downloaded the prostate.xls file O.K.  But when I tried to read it 
in with
the read.xls() function from the gdata package, I got an error to the 
effect


 > X <- read.xls("prostate.xls")
Converting xls file to csv file... Done.
Reading csv file... Error in read.table(file = file, header = header, 
sep = sep, quote = quote,  :

  no lines available in input

I was able to ``open'' the prostate.xls file with the version of Excel 
available

on my Mac, save it as a *.csv file, and then read *that* in with read.csv()

What am I missing?  *Are* there ``R binary'' files lurking about that I 
am somehow

not seeing?  Why won't read.xls() work on this data set?

cheers,

Rolf Turner

##
Attention:This e-mail message is privileged and confidential. If you are 
not theintended recipient please delete the message and notify the 
sender.Any views or opinions presented are solely those of the author.


This e-mail has been scanned and cleared by 
MailMarshalwww.marshalsoftware.com

##




--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-24 Thread Rolf Turner


On 25/03/2009, at 10:04 AM, Frank E Harrell Jr wrote:


Ravi Varadhan wrote:

Hi,

I am looking for a data set containing the information from a  
randomized trial evaluating the effect of DES (diethylsilbestrol)  
on multiple time-to-event endpoints, prostate cancer, CVD, and  
other causes.  The original source of this data is Green and Byar  
(1980).  This is a popular competing risks problem that has  
subsequently been discussed in a number of statistical papers  
including Kay (1986).


Does anyone have a digital version of this data set?

This data is also presented in Andrews, D. F. and Herzberg, A. M.  
(1985). Data.   Does a digital version of all the data sets in A &  
H exist?


Thanks very much,
Ravi.


An R binary dataset is at http://biostat.mc.vanderbilt.edu/Datasets

Note that there is something strange about the AP variable with a  
lot of

ties at some value near 1.0.  I have never been able to find any
documentation about this problem.  If you find any please let me know.


Out of idle curiosity I went to have a look at this data set.

I had problems.

(1) The given URL didn't work for me; when I clicked on it, I got an  
error 404.
But if I went to http://biostat.mc.vanderbilt.edu I found a link to  
``Datasets'',

and clicking on that got me to some data sets.

(2) Scrolling down to ``Byar and Green prostate cancer data''  
appeared to get
me to the right place.  But I couldn't see any signs of any ``R  
binary files''.


The available formats appear to be *.sav (SPSS?), *.sdd (???), and  
*.xls.


(3) I downloaded the prostate.xls file O.K.  But when I tried to read  
it in with
the read.xls() function from the gdata package, I got an error to the  
effect


> X <- read.xls("prostate.xls")
Converting xls file to csv file... Done.
Reading csv file... Error in read.table(file = file, header = header,  
sep = sep, quote = quote,  :

  no lines available in input

I was able to ``open'' the prostate.xls file with the version of  
Excel available
on my Mac, save it as a *.csv file, and then read *that* in with  
read.csv()


What am I missing?  *Are* there ``R binary'' files lurking about that  
I am somehow

not seeing?  Why won't read.xls() work on this data set?

cheers,

Rolf Turner

##
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-24 Thread Frank E Harrell Jr

Ravi Varadhan wrote:

Hi,

I am looking for a data set containing the information from a randomized trial evaluating the effect of DES (diethylsilbestrol) on multiple time-to-event endpoints, prostate cancer, CVD, and other causes.  The original source of this data is Green and Byar (1980).  This is a popular competing risks problem that has subsequently been discussed in a number of statistical papers including Kay (1986).  

Does anyone have a digital version of this data set?  


This data is also presented in Andrews, D. F. and Herzberg, A. M. (1985). Data.   
Does a digital version of all the data sets in A & H exist?

Thanks very much,
Ravi.


An R binary dataset is at http://biostat.mc.vanderbilt.edu/Datasets

Note that there is something strange about the AP variable with a lot of 
ties at some value near 1.0.  I have never been able to find any 
documentation about this problem.  If you find any please let me know.


Frank




Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Frank E Harrell Jr   Professor and Chair   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data

2009-03-24 Thread Ravi Varadhan

I just found it.  Please disregrad my email. 

Ravi.




Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvarad...@jhmi.edu


- Original Message -
From: Ravi Varadhan 
Date: Tuesday, March 24, 2009 4:03 pm
Subject: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and 
Herzberg - Data
To: r-help@r-project.org


> Hi,
>  
>  I am looking for a data set containing the information from a 
> randomized trial evaluating the effect of DES (diethylsilbestrol) on 
> multiple time-to-event endpoints, prostate cancer, CVD, and other 
> causes.  The original source of this data is Green and Byar (1980).  
> This is a popular competing risks problem that has subsequently been 
> discussed in a number of statistical papers including Kay (1986).  
>  
>  Does anyone have a digital version of this data set?  
>  
>  This data is also presented in Andrews, D. F. and Herzberg, A. M. 
> (1985). Data.   Does a digital version of all the data sets in A & H exist?
>  
>  Thanks very much,
>  Ravi.
>  
>  
>  
>  Ravi Varadhan, Ph.D.
>  Assistant Professor,
>  Division of Geriatric Medicine and Gerontology
>  School of Medicine
>  Johns Hopkins University
>  
>  Ph. (410) 502-2619
>  email: rvarad...@jhmi.edu
>  
>  __
>  R-help@r-project.org mailing list
>  
>  PLEASE do read the posting guide 
>  and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.