Hi Max, Matthias and David and anyone else interested, My question was prompted by seeing various prospective cohort studies which are deriving the RR of colorectal cancer due to smoking and which simply censor people who die from diseases other then colorectal cancer and then use coxph to get an estimate of the CRC due to smoking. When (and only when) those diseases are also caused by smoking then I think this reduces the apparent risk ratio of crc due to smoking. The following R code summarises what I think is happening and why simply censoring these other deaths is wrong. Mind you I have absolutely no idea how these studies should proceed, its just that it looks wrong to me and I'm trying to find out what other people think.
Some might argue that dead people can't get crc, but consider a second case analogous to the first. Red meat and crc. If heart disease is censored then the RR of crc due to red meat is reduced. But as heart disease is reduced by better treatment, the crc "suddenly" emerges. The relationship was hidden by virtue of the censoring protocol. Cheers, Geoff. R code follows library(survival) set.seed(2) popsize<-3000 smoking<-popsize/2 nonsmoke<-popsize/2 #---------------------------------------------------------- # generate a bunch of times for people to be either # censored or diagnosed #---------------------------------------------------------- time<-c(rnorm(popsize,75,15)) #---------------------------------------------------------- # allocate people to smoking/nonsmoking groups #---------------------------------------------------------- smokestatus<-c(rep(1,smoking),rep(0,nonsmoke)) #---------------------------------------------------------- # pick a colorectal cancer (crc) rate of 1 in 20 uniformly distributed # Aussie lifetime rate is about 1 in 17 #---------------------------------------------------------- status<-c(sample(c(0,1),popsize,replace=T,prob=c(0.95,0.05))) #---------------------------------------------------------- # alternatively generate a higher rate of crc in the smokers #---------------------------------------------------------- statusAlt<-c( sample(c(0,1),smoking,replace=T,prob=c(0.9,0.1)), sample(c(0,1),nonsmoke,replace=T,prob=c(0.95,0.05)) ) #---------------------------------------------------------- # now generate some heart deaths with more in the smokers # The ratio of deaths from heart disease to crc is about 5 # to 1 in EPIC-oxford cohort, and our numbers below are similar. #---------------------------------------------------------- heartattack<-c( sample(c(0,1),smoking,replace=T,prob=c(0.7,0.3)), sample(c(0,1),nonsmoke,replace=T,prob=c(0.85,0.15)) ) #---------------------------------------------------------- # people are only diagnosed with CRC if they don't # die of heart disease --- on assumption that heart # disease usually occurs before crc #---------------------------------------------------------- statusHd<-as.integer(status&(!heartattack)) statusAltHd<-as.integer(statusAlt&(!heartattack)) data <- data.frame(time,status,smokestatus) dataHd <- data.frame(time,statusHd,smokestatus) dataAlt <- data.frame(time,statusAlt,smokestatus) dataAltHd <- data.frame(time,statusAltHd,smokestatus) cSmoke<-coxph(Surv(time,status)~smokestatus, data) cHd<-coxph(Surv(time,statusHd)~smokestatus, dataHd) cAltSmoke<-coxph(Surv(time,statusAlt)~smokestatus, dataAlt) cAltHd<-coxph(Surv(time,statusAltHd)~smokestatus, dataAltHd) #---------------------------------------------------------- # as expected RR for crc of smokers relative to non-smokers is about 1 #---------------------------------------------------------- summary(cSmoke) Call: coxph(formula = Surv(time, status) ~ smokestatus, data = data) n= 3000 coef exp(coef) se(coef) z p smokestatus -0.0221 0.978 0.149 -0.148 0.88 exp(coef) exp(-coef) lower .95 upper .95 smokestatus 0.978 1.02 0.73 1.31 Rsquare= 0 (max possible= 0.563 ) Likelihood ratio test= 0.02 on 1 df, p=0.882 Wald test = 0.02 on 1 df, p=0.882 Score (logrank) test = 0.02 on 1 df, p=0.882 #---------------------------------------------------------- # but the RR drops when heart disease takes out crc cases #---------------------------------------------------------- summary(cHd) Call: coxph(formula = Surv(time, statusHd) ~ smokestatus, data = dataHd) n= 3000 coef exp(coef) se(coef) z p smokestatus -0.164 0.848 0.183 -0.898 0.37 exp(coef) exp(-coef) lower .95 upper .95 smokestatus 0.848 1.18 0.592 1.21 Rsquare= 0 (max possible= 0.424 ) Likelihood ratio test= 0.81 on 1 df, p=0.369 Wald test = 0.81 on 1 df, p=0.369 Score (logrank) test = 0.81 on 1 df, p=0.369 #---------------------------------------------------------- # when there really is a relationship #---------------------------------------------------------- summary(cAltSmoke) Call: coxph(formula = Surv(time, statusAlt) ~ smokestatus, data = dataAlt) n= 3000 coef exp(coef) se(coef) z p smokestatus 0.535 1.71 0.137 3.89 1e-04 exp(coef) exp(-coef) lower .95 upper .95 smokestatus 1.71 0.586 1.30 2.23 Rsquare= 0.005 (max possible= 0.659 ) Likelihood ratio test= 15.7 on 1 df, p=7.59e-05 Wald test = 15.1 on 1 df, p=0.000101 Score (logrank) test = 15.5 on 1 df, p=8.34e-05 #---------------------------------------------------------- # and the RR drops when heart disease is added #---------------------------------------------------------- summary(cAltHd) Call: coxph(formula = Surv(time, statusAltHd) ~ smokestatus, data = dataAltHd) n= 3000 coef exp(coef) se(coef) z p smokestatus 0.362 1.44 0.162 2.24 0.025 exp(coef) exp(-coef) lower .95 upper .95 smokestatus 1.44 0.696 1.05 1.97 Rsquare= 0.002 (max possible= 0.526 ) Likelihood ratio test= 5.09 on 1 df, p=0.024 Wald test = 5.01 on 1 df, p=0.0252 Score (logrank) test = 5.07 on 1 df, p=0.0244 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.