[
https://issues.apache.org/jira/browse/MADLIB-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank McQuillan updated MADLIB-1040:
------------------------------------
Fix Version/s: (was: v2.0)
> Survival Analysis - Cox regression model for time-dependent covariates
> ----------------------------------------------------------------------
>
> Key: MADLIB-1040
> URL: https://issues.apache.org/jira/browse/MADLIB-1040
> Project: Apache MADlib
> Issue Type: Wish
> Components: Module: Cox Proportional Hazards
> Reporter: Pietro Pugni
>
> This JIRA follows a discussion opened on the user mailing list (
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201611.mbox/browser
> ).
> The actual Cox model implented in MADlib (
> https://madlib.incubator.apache.org/docs/latest/group__grp__cox__prop__hazards.html
> ) only supports time-independent covariates and doesn't provide any
> structure for time-dependent covariates, where a subject has one or more rows
> for different time-varying periods. This version of the CPH model is much
> more useful in survival analysis because it accounts for changes of
> covariates effect over time.
> To provide some input, here are some good reference links:
> - "Using Time Dependent Covariates and Time Dependent Coefficients in the
> Cox Model", by T Thernau:
> https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf
> - "Time-dependent Covariates in Cox Regression":
> http://www.math.ucsd.edu/~rxu/math284/slect7.pdf
> - "Time-dependent covariates in the Cox Proportoinal-Hazards Regression
> Model", by LD Fisher:
> https://pdfs.semanticscholar.org/f970/7f0dd6ff04899d7a3323668ee9ed1b9ad28e.pdf
>
> This is the article used by Thernau to implement the counting process
> algorithm in the R survival package:
> - "Cox's regression model for counting processes: a large sample study", by
> Andersen and Gill:
> https://projecteuclid.org/download/pdf_1/euclid.aos/1176345976
> As far as I know, the counting process algorithm is the fastest used in CPH
> models. The counter part is that user has to provide a verticalized dataset
> with a row per time changes within each subject. The formula used in the
> coxph() function provided with the survival package is the following:
> coxph(data = df, formula = Surv(start, stop, event) ~ cluster(subject.id) +
> covariate.1 + covariate.2 + ... + covariate.n)
> where covariates can be factors (categorical variables) or numeric. In the
> linked documentation you can find some examples of counting process datasets.
> Counting process is also the only dataset format supported by any R survival
> analysis package. SAS supports both counting process and longitudinal format.
> The longitudinal format is far more slow, but requires less user development
> time and effort in order to create the dataset. Here are some hints:
> - "Survival Analysis Using SAS - A practical Guide - Second Edition - Paul
> D. Allison - SAS Publishing", ISBN 978-1-59994-640-5, in particular Chapter 5
> starting from page 153.
> - "Your Survival Guide to Using Time-Dependent Covariates", by Powel and
> Bagnell:
> http://mail-archives.apache.org/mod_mbox/incubator-madlib-user/201611.mbox/browser
> WARNING
> Be aware the for big cohort datasets the coxph() function of the R survival
> package uses a lot of RAM and is obviously single-core. I don't know how and
> if this can be handled by the MADlib engine, but the winning point here is
> making it safer and possibly faster.
> Thank you everyone
> Pietro Pugni
> PS: this is my first JIRA. I hope to have it done the right way.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)