Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
Ravi Varadhan wrote: Fine detective work, David. Now, you can see the reasons for my frustration - multiplicity of data sets combined with non-existent documentation of the source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995). Best, Ravi. Yes that is a big frustration for me, even for projects for which I was the principal statistician in 1990 for which I did a poor job of archiving excellent medical datasets for future use. This is a big advertisement for the reproducible research movement. David - fantastic job. Based on what you found, the version on our web site looks as good as any. Now if someone can explain to me why you see a spike near a serum prostate acid phosphatase (AP) value of 1 when you use a flexible regression model (e.g., restricted cubic spline) to relate AP to the log hazard of death in a survival model (see p. 518 in my book), that would be very helpful. If you do with(prostate,plot(supsmu(log(ap),1*(status!='alive' you see a minimum at ap=2.37 after anti-logging. If you do dd <- datadist(prostate); options(datadist='dd') f <- cph(Surv(dtime,status!='alive') ~ rcs(log(ap),6), data=prostate) plot(f) you see a sharp minimum at ap=1.43. With 4 knots the min is a 1.18. You have to go to 3 knots to get a monotonic fit in log(ap) but AIC is not as good. Frank Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu - Original Message - From: David Winsemius Date: Tuesday, March 24, 2009 10:54 pm Subject: Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data To: Rolf Turner Cc: R-help Forum , Ravi Varadhan On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote: > > On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote: > > > >>> (2) Scrolling down to ``Byar and Green prostate cancer data'' >>> appeared >>> to get >>> me to the right place. But I couldn't see any signs of any ``R >>> binary >>> files''. >> >> Please look again. It's under the heading "R". Unfortunately I used >> .sav suffix for save() files in the old days. > > Ah-ha. Oh me of little faith. I have been hanging around (in > my current work environment) with too many SPSS users, and the > *.sav extension seems to be the standard for SPSS data files. > Whence my corrupted thinking. > >> The .xls fine opened with no problem in OpenOffice; has 506 rows. > > Hmmm. When I opened it with Excel on the Mac I got a spread > sheet with 503 rows --- the first row being the column names, > so there were really 502 rows. The last "patnr" is "506" but there are only 502 lines of data. 471, 473, 475 and 488 are missing. And the CMU Statlib version for 2002 looks the same. The version at this site is missing more than 25 cases: Here are two other copies of the dataset the first of which appears to have those missing cases: This one has patient numbers: This one has a description of the fields and cites the one above but has not retained the patient numbers and has apparently only kept the 475 cases with complete data. > David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
Rolf Turner wrote: On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote: (2) Scrolling down to ``Byar and Green prostate cancer data'' appeared to get me to the right place. But I couldn't see any signs of any ``R binary files''. Please look again. It's under the heading "R". Unfortunately I used .sav suffix for save() files in the old days. Ah-ha. Oh me of little faith. I have been hanging around (in my current work environment) with too many SPSS users, and the *.sav extension seems to be the standard for SPSS data files. Whence my corrupted thinking. It definitely is a standard for SPSS, that's why I regret ever using that suffix. The .xls fine opened with no problem in OpenOffice; has 506 rows. Hmmm. When I opened it with Excel on the Mac I got a spread sheet with 503 rows --- the first row being the column names, so there were really 502 rows. And 502 rows was what I got when I saved the *.xls file as a *.csv file and then read that in. Also, when I followed Phil Spector's excellent advice and loaded prostate.sav from the website, using load(), I ***again*** got a data frame of 502 rows. This data frame is (modulo some classes and attributes) identical with what I got from reading from the *.csv file. Sorry about that - I was looking at patient numbers. I do get 502 rows either with load()'ing the binary data frame or opening the spreadsheet. Where have the other four rows gone? Ravi Varadhan also observed this phenomenon. cheers, Rolf ## Attention:This e-mail message is privileged and confidential. If you are not theintended recipient please delete the message and notify the sender.Any views or opinions presented are solely those of the author. This e-mail has been scanned and cleared by MailMarshalwww.marshalsoftware.com ## -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
One further version: this one with a header and with NA's replacing the -'s that apparently has not deleted any cases with missing data: http://www.stat.auckland.ac.nz/~wild/764/s764data/prostatic.tab -- David Winsemius On Mar 24, 2009, at 11:51 PM, Ravi Varadhan wrote: Fine detective work, David. Now, you can see the reasons for my frustration - multiplicity of data sets combined with non-existent documentation of the source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995). Best, Ravi. Ravi Varadhan, Ph.D. On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote: On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote: (2) Scrolling down to ``Byar and Green prostate cancer data'' appeared to get me to the right place. But I couldn't see any signs of any ``R binary files''. Please look again. It's under the heading "R". Unfortunately I used .sav suffix for save() files in the old days. Ah-ha. Oh me of little faith. I have been hanging around (in my current work environment) with too many SPSS users, and the *.sav extension seems to be the standard for SPSS data files. Whence my corrupted thinking. The .xls fine opened with no problem in OpenOffice; has 506 rows. Hmmm. When I opened it with Excel on the Mac I got a spread sheet with 503 rows --- the first row being the column names, so there were really 502 rows. The last "patnr" is "506" but there are only 502 lines of data. 471, 473, 475 and 488 are missing. And the CMU Statlib version for 2002 looks the same. The version at this site is missing more than 25 cases: Here are two other copies of the dataset the first of which appears to have those missing cases: This one has patient numbers: This one has a description of the fields and cites the one above but has not retained the patient numbers and has apparently only kept the 475 cases with complete data. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
Fine detective work, David. Now, you can see the reasons for my frustration - multiplicity of data sets combined with non-existent documentation of the source of data in journal articles (e.g. Kay 1986; Lunn and McNeil 1995). Best, Ravi. Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu - Original Message - From: David Winsemius Date: Tuesday, March 24, 2009 10:54 pm Subject: Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data To: Rolf Turner Cc: R-help Forum , Ravi Varadhan > On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote: > > > > > On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote: > > > > > > > >>> (2) Scrolling down to ``Byar and Green prostate cancer data'' > >>> appeared > >>> to get > >>> me to the right place. But I couldn't see any signs of any ``R > > >>> binary > >>> files''. > >> > >> Please look again. It's under the heading "R". Unfortunately I used > >> .sav suffix for save() files in the old days. > > > >Ah-ha. Oh me of little faith. I have been hanging around (in > >my current work environment) with too many SPSS users, and the > >*.sav extension seems to be the standard for SPSS data files. > >Whence my corrupted thinking. > > > >> The .xls fine opened with no problem in OpenOffice; has 506 rows. > > > >Hmmm. When I opened it with Excel on the Mac I got a spread > >sheet with 503 rows --- the first row being the column names, > >so there were really 502 rows. > > The last "patnr" is "506" but there are only 502 lines of data. 471, > > 473, 475 and 488 are missing. > > And the CMU Statlib version for 2002 looks the same. > > > The version at this site is missing more than 25 cases: > > > Here are two other copies of the dataset the first of which appears > to > have those missing cases: > This one has patient numbers: > > > This one has a description of the fields and cites the one above but > > has not retained the patient numbers and has apparently only kept the > > 475 cases with complete data. > > > > > > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > __ > R-help@r-project.org mailing list > > PLEASE do read the posting guide > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
On Mar 24, 2009, at 8:57 PM, Rolf Turner wrote: On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote: (2) Scrolling down to ``Byar and Green prostate cancer data'' appeared to get me to the right place. But I couldn't see any signs of any ``R binary files''. Please look again. It's under the heading "R". Unfortunately I used .sav suffix for save() files in the old days. Ah-ha. Oh me of little faith. I have been hanging around (in my current work environment) with too many SPSS users, and the *.sav extension seems to be the standard for SPSS data files. Whence my corrupted thinking. The .xls fine opened with no problem in OpenOffice; has 506 rows. Hmmm. When I opened it with Excel on the Mac I got a spread sheet with 503 rows --- the first row being the column names, so there were really 502 rows. The last "patnr" is "506" but there are only 502 lines of data. 471, 473, 475 and 488 are missing. And the CMU Statlib version for 2002 looks the same. http://lib.stat.cmu.edu/S/Harrell/data/descriptions/prostate.html The version at this site is missing more than 25 cases: http://www.imbi.uni-freiburg.de/biom/Royston-Sauerbrei-book/ Here are two other copies of the dataset the first of which appears to have those missing cases: This one has patient numbers: http://lib.stat.cmu.edu/datasets/Andrews/T46.1 This one has a description of the fields and cites the one above but has not retained the patient numbers and has apparently only kept the 475 cases with complete data. http://www.stats.waikato.ac.nz/Staff/maj/multimix/cancerdesc.txt http://www.stats.waikato.ac.nz/Staff/maj/multimix/cancer%20data.txt David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
On 25/03/2009, at 12:09 PM, Frank E Harrell Jr wrote: (2) Scrolling down to ``Byar and Green prostate cancer data'' appeared to get me to the right place. But I couldn't see any signs of any ``R binary files''. Please look again. It's under the heading "R". Unfortunately I used .sav suffix for save() files in the old days. Ah-ha. Oh me of little faith. I have been hanging around (in my current work environment) with too many SPSS users, and the *.sav extension seems to be the standard for SPSS data files. Whence my corrupted thinking. The .xls fine opened with no problem in OpenOffice; has 506 rows. Hmmm. When I opened it with Excel on the Mac I got a spread sheet with 503 rows --- the first row being the column names, so there were really 502 rows. And 502 rows was what I got when I saved the *.xls file as a *.csv file and then read that in. Also, when I followed Phil Spector's excellent advice and loaded prostate.sav from the website, using load(), I ***again*** got a data frame of 502 rows. This data frame is (modulo some classes and attributes) identical with what I got from reading from the *.csv file. Where have the other four rows gone? Ravi Varadhan also observed this phenomenon. cheers, Rolf ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
Rolf Turner wrote: On 25/03/2009, at 10:04 AM, Frank E Harrell Jr wrote: Ravi Varadhan wrote: Hi, I am looking for a data set containing the information from a randomized trial evaluating the effect of DES (diethylsilbestrol) on multiple time-to-event endpoints, prostate cancer, CVD, and other causes. The original source of this data is Green and Byar (1980). This is a popular competing risks problem that has subsequently been discussed in a number of statistical papers including Kay (1986). Does anyone have a digital version of this data set? This data is also presented in Andrews, D. F. and Herzberg, A. M. (1985). Data. Does a digital version of all the data sets in A & H exist? Thanks very much, Ravi. An R binary dataset is at http://biostat.mc.vanderbilt.edu/Datasets Note that there is something strange about the AP variable with a lot of ties at some value near 1.0. I have never been able to find any documentation about this problem. If you find any please let me know. Out of idle curiosity I went to have a look at this data set. I had problems. (1) The given URL didn't work for me; when I clicked on it, I got an error 404. But if I went to http://biostat.mc.vanderbilt.edu I found a link to ``Datasets'', and clicking on that got me to some data sets. Sorry that should have been DataSets not Datasets. (2) Scrolling down to ``Byar and Green prostate cancer data'' appeared to get me to the right place. But I couldn't see any signs of any ``R binary files''. Please look again. It's under the heading "R". Unfortunately I used .sav suffix for save() files in the old days. The .xls fine opened with no problem in OpenOffice; has 506 rows. Frank The available formats appear to be *.sav (SPSS?), *.sdd (???), and *.xls. (3) I downloaded the prostate.xls file O.K. But when I tried to read it in with the read.xls() function from the gdata package, I got an error to the effect > X <- read.xls("prostate.xls") Converting xls file to csv file... Done. Reading csv file... Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input I was able to ``open'' the prostate.xls file with the version of Excel available on my Mac, save it as a *.csv file, and then read *that* in with read.csv() What am I missing? *Are* there ``R binary'' files lurking about that I am somehow not seeing? Why won't read.xls() work on this data set? cheers, Rolf Turner ## Attention:This e-mail message is privileged and confidential. If you are not theintended recipient please delete the message and notify the sender.Any views or opinions presented are solely those of the author. This e-mail has been scanned and cleared by MailMarshalwww.marshalsoftware.com ## -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
On 25/03/2009, at 10:04 AM, Frank E Harrell Jr wrote: Ravi Varadhan wrote: Hi, I am looking for a data set containing the information from a randomized trial evaluating the effect of DES (diethylsilbestrol) on multiple time-to-event endpoints, prostate cancer, CVD, and other causes. The original source of this data is Green and Byar (1980). This is a popular competing risks problem that has subsequently been discussed in a number of statistical papers including Kay (1986). Does anyone have a digital version of this data set? This data is also presented in Andrews, D. F. and Herzberg, A. M. (1985). Data. Does a digital version of all the data sets in A & H exist? Thanks very much, Ravi. An R binary dataset is at http://biostat.mc.vanderbilt.edu/Datasets Note that there is something strange about the AP variable with a lot of ties at some value near 1.0. I have never been able to find any documentation about this problem. If you find any please let me know. Out of idle curiosity I went to have a look at this data set. I had problems. (1) The given URL didn't work for me; when I clicked on it, I got an error 404. But if I went to http://biostat.mc.vanderbilt.edu I found a link to ``Datasets'', and clicking on that got me to some data sets. (2) Scrolling down to ``Byar and Green prostate cancer data'' appeared to get me to the right place. But I couldn't see any signs of any ``R binary files''. The available formats appear to be *.sav (SPSS?), *.sdd (???), and *.xls. (3) I downloaded the prostate.xls file O.K. But when I tried to read it in with the read.xls() function from the gdata package, I got an error to the effect > X <- read.xls("prostate.xls") Converting xls file to csv file... Done. Reading csv file... Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input I was able to ``open'' the prostate.xls file with the version of Excel available on my Mac, save it as a *.csv file, and then read *that* in with read.csv() What am I missing? *Are* there ``R binary'' files lurking about that I am somehow not seeing? Why won't read.xls() work on this data set? cheers, Rolf Turner ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
Ravi Varadhan wrote: Hi, I am looking for a data set containing the information from a randomized trial evaluating the effect of DES (diethylsilbestrol) on multiple time-to-event endpoints, prostate cancer, CVD, and other causes. The original source of this data is Green and Byar (1980). This is a popular competing risks problem that has subsequently been discussed in a number of statistical papers including Kay (1986). Does anyone have a digital version of this data set? This data is also presented in Andrews, D. F. and Herzberg, A. M. (1985). Data. Does a digital version of all the data sets in A & H exist? Thanks very much, Ravi. An R binary dataset is at http://biostat.mc.vanderbilt.edu/Datasets Note that there is something strange about the AP variable with a lot of ties at some value near 1.0. I have never been able to find any documentation about this problem. If you find any please let me know. Frank Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data
I just found it. Please disregrad my email. Ravi. Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu - Original Message - From: Ravi Varadhan Date: Tuesday, March 24, 2009 4:03 pm Subject: [R] Green and Byar (1980) Prostate Cancer Data set from Andrews and Herzberg - Data To: r-help@r-project.org > Hi, > > I am looking for a data set containing the information from a > randomized trial evaluating the effect of DES (diethylsilbestrol) on > multiple time-to-event endpoints, prostate cancer, CVD, and other > causes. The original source of this data is Green and Byar (1980). > This is a popular competing risks problem that has subsequently been > discussed in a number of statistical papers including Kay (1986). > > Does anyone have a digital version of this data set? > > This data is also presented in Andrews, D. F. and Herzberg, A. M. > (1985). Data. Does a digital version of all the data sets in A & H exist? > > Thanks very much, > Ravi. > > > > Ravi Varadhan, Ph.D. > Assistant Professor, > Division of Geriatric Medicine and Gerontology > School of Medicine > Johns Hopkins University > > Ph. (410) 502-2619 > email: rvarad...@jhmi.edu > > __ > R-help@r-project.org mailing list > > PLEASE do read the posting guide > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.