[ 
https://issues.apache.org/jira/browse/MADLIB-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1040:
------------------------------------
    Fix Version/s:     (was: v2.0)

> Survival Analysis - Cox regression model for time-dependent covariates
> ----------------------------------------------------------------------
>
>                 Key: MADLIB-1040
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1040
>             Project: Apache MADlib
>          Issue Type: Wish
>          Components: Module: Cox Proportional Hazards
>            Reporter: Pietro Pugni
>
> This JIRA follows a discussion opened on the user mailing list ( 
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201611.mbox/browser
>  ).
> The actual Cox model implented in MADlib ( 
> https://madlib.incubator.apache.org/docs/latest/group__grp__cox__prop__hazards.html
>  ) only supports time-independent covariates and doesn't provide any 
> structure for time-dependent covariates, where a subject has one or more rows 
> for different time-varying periods. This version of the CPH model is much 
> more useful in survival analysis because it accounts for changes of 
> covariates effect over time.
> To provide some input, here are some good reference links:
>  - "Using Time Dependent Covariates and Time Dependent Coefficients in the 
> Cox Model", by T Thernau: 
> https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
>  - "Time-dependent Covariates in Cox Regression": 
> http://www.math.ucsd.edu/~rxu/math284/slect7.pdf
>  - "Time-dependent covariates in the Cox Proportoinal-Hazards Regression 
> Model", by LD Fisher: 
> https://pdfs.semanticscholar.org/f970/7f0dd6ff04899d7a3323668ee9ed1b9ad28e.pdf
>  
> This is the article used by Thernau to implement the counting process 
> algorithm in the R survival package:
>  -  "Cox's regression model for counting processes: a large sample study", by 
> Andersen and Gill: 
> https://projecteuclid.org/download/pdf_1/euclid.aos/1176345976
> As far as I know, the counting process algorithm is the fastest used in CPH 
> models. The counter part is that user has to provide a verticalized dataset 
> with a row per time changes within each subject. The formula used in the 
> coxph() function provided with the survival package is the following:
> coxph(data = df, formula = Surv(start, stop, event) ~ cluster(subject.id) + 
> covariate.1 + covariate.2 + ... + covariate.n)
> where covariates can be factors (categorical variables) or numeric. In the 
> linked documentation you can find some examples of counting process datasets.
> Counting process is also the only dataset format supported by any R survival 
> analysis package. SAS supports both counting process and longitudinal format. 
> The longitudinal format is far more slow, but requires less user development 
> time and effort in order to create the dataset. Here are some hints:
>  - "Survival Analysis Using SAS - A practical Guide - Second Edition - Paul 
> D. Allison - SAS Publishing", ISBN 978-1-59994-640-5, in particular Chapter 5 
> starting from page 153.
>  - "Your Survival Guide to Using Time-Dependent Covariates", by Powel and 
> Bagnell: 
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201611.mbox/browser
> WARNING
> Be aware the for big cohort datasets the coxph() function of the R survival 
> package uses a lot of RAM and is obviously single-core. I don't know how and 
> if this can be handled by the MADlib engine, but the winning point here is 
> making it safer and possibly faster.
> Thank you everyone
>  Pietro Pugni
> PS: this is my first JIRA. I hope to have it done the right way.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to