[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399718#comment-15399718 ] Eliano Marques commented on SPARK-8518: --- I totally understand your point but I guess the same rule used for GML should apply here, i.e. if you are limiting the summary statistics for GML you can apply the same rule. Probably the best solution would be to add a parameter to the function which determines if this gets calculated or not. Often you are more interested in the model parameters than in the prediction itself so understanding how robust / stable they are might be as relevant as the prediction. Let me look at what L-BFGS returns and see if we can come back to you with something. Will keep you posted. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Fix For: 1.6.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399031#comment-15399031 ] Yanbo Liang commented on SPARK-8518: [~eliano.m.marq...@gmail.com] Since we use L-BFGS to solve the aft survival regression problem, and L-BFGS use the first derivative to represent the approximation of inverse Hessian. We need to figure out a way to output the approximation of inverse Hessian, and then get the standard statistics like what is done in GLMs. Please feel free to file a ticket and work on it if you have some ideas. Thanks! > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Fix For: 1.6.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394450#comment-15394450 ] Eliano Marques commented on SPARK-8518: --- Would it be possible to add some information about the coefficients standard deviation? According to the information https://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression, it would be possible to bring to the model outputs the second derivate of the gradient function for beta and log teta? Bringing this information to the model outputs would enable us to perform the standard statistical tests on the parameters, similar to what is done in the glm. Thanks > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Fix For: 1.6.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731866#comment-14731866 ] Apache Spark commented on SPARK-8518: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/8611 > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729163#comment-14729163 ] Yanbo Liang commented on SPARK-8518: [~mengxr] [~meihuawu] I have already finished the initial implementation of AFT using L-BFGS, and it can produce the same result as the [R function | https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html] on the [Larynx cancer data | http://www.mcw.edu/FileLibrary/Groups/Biostatistics/Publicfiles/DataFromSection/DataFromSectionTXT/Data_from_section_1.8.txt]. I will submit the PR after correcting code format and adding more test cases. I think you can comment and review it in one or two days. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725935#comment-14725935 ] Meihua Wu commented on SPARK-8518: -- For the reference implementations, recommend we consider this R function: https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725749#comment-14725749 ] Xiangrui Meng commented on SPARK-8518: -- [~yanboliang] Thanks for working on design doc! I think we should be ready to have an initial implementation of AFT using L-BFGS. Are there reference implementations we can test against? > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang >Priority: Critical > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725732#comment-14725732 ] Xiangrui Meng commented on SPARK-8518: -- The value in this column would be 0/1. Then we need to see how users would interpret `event = 1` and `event = 0` vs. `censored = 1` and `censored = 0`. The latter looks easier to understand to me. Users do not need to visit the doc to see what `event = 1` means. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702676#comment-14702676 ] Meihua Wu commented on SPARK-8518: -- [~mengxr] [~yanbo] Either way works for me. In R and some Python survival package, it is called event. In SAS, it is called censored. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702653#comment-14702653 ] Yanbo Liang commented on SPARK-8518: I think it's usually called `event` or `status` in the context of survival analysis. [~meihuawu] What about your opinions? > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701614#comment-14701614 ] Xiangrui Meng commented on SPARK-8518: -- Calling it `censorCol` instead? > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700985#comment-14700985 ] Yanbo Liang commented on SPARK-8518: [~meihuawu] Thanks for your comments! Yes, we need "eventCol" to indicate censor or not, updated the document. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700302#comment-14700302 ] Meihua Wu commented on SPARK-8518: -- [~yanbo] Thank you very much for the update! The loss function and gradient are different for events and censor. So we will need to have a column in the data frame to indicate whether an individual record is an event or censored. I suppose we will need to define a Param for "eventCol" using code gen and mix it into the AFTRegressionParams. cc [~mengxr] > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679918#comment-14679918 ] Yanbo Liang commented on SPARK-8518: [~mengxr] [~meihuawu] I have update the design document, please review and comment it. For the questions I can also clarify here: 1. Which algorithm is the most popular one? Accelerated Failure Time (AFT) model 2. What is the size of the model? It contains a weight vector (with intercept) and a scale parameter. 3. How do the algorithms fit into Spark? Are they easy to be parallelized? I have got the loss function and gradient function, so we can parallelize it using SGD or L-BFGS. 4. What is the complexity? It has the same complexity with LinearRegression. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641864#comment-14641864 ] Yanbo Liang commented on SPARK-8518: [~meihuawu] Got it, I agree to start with exponential/Weibull model, and it's more easy to parallel in Spark. I will update the detail design document ASAP to clarify the loss function, gradient function, how to optimize the loss function, how it fits into Spark, etc. Thank you for your comments. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641711#comment-14641711 ] Meihua Wu commented on SPARK-8518: -- [~yanboliang] Here's my two cents :) For the Cox model, you will need to find the \beta that maximize the log partial likelihood: l(\beta) =... the 3rd formula in the wiki page https://en.wikipedia.org/wiki/Proportional_hazards_model. There are two summations. The first one involves summation over the records indexed by i. For each record i, you will need to do another summation over the records indexed by j. The complexity of the 2nd summation is O(n). In the end, the double summation might be O(n^2). I guess we might be able to improve this to O(n*log(n)) by a pre-processing step of sorting by Y_i. But still not O(n). The exponential/Weibull model is like linear regression: there is only one summation in the objective function and each term in the summation is O(1). So the overall complexity is O(n). In the end, I am not saying the Cox model is not good [it is actually more flexible and robust.]. But I think for our first step, the exponential/Weibull model is easier to implement and computational-wise scales better for massive data. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638397#comment-14638397 ] Yanbo Liang commented on SPARK-8518: [~meihuawu] Thank you for your valued comments. I agree with you that AFT model is common used. But I did not find that it's easily parallelizable than Cox PH model. I think AFT model is like regression problems and need to optimize loss function using SGD or other method. Could you give me some references about the likelihood function, loss function and gradient function that can prove that it's more easily parallelizable than Cox PH model? > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634621#comment-14634621 ] Meihua Wu commented on SPARK-8518: -- [~mengxr] [~yanboliang] Sounds like to plan. We would start with something simple. I agree that the Cox PH model is a non-parametric model. It is not easy to implement it efficiently in Spark: To determine the contribution of a particular row in the RDD to the objective function, you will need to reference to other rows in the RDD, effectively breaking the parallelism. The log-linear model of survival models are often called Accelerated Failure Time (AFT) model (https://en.wikipedia.org/wiki/Accelerated_failure_time_model). For AFT, there are again two favor: parametric vs non-parametric. For the parametric favor, the commonly used model is based on Weilbull / exponential distribution. Under these models, each row in the RDD contribute to the objective function independently, thus easily parallelizable. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634291#comment-14634291 ] Xiangrui Meng commented on SPARK-8518: -- [~yanboliang] Let's narrow the goal of this JIRA to the log-linear model. I didn't know that it is the same as CoxPHModel. You mentioned that all the models listed in your design doc are non-parametric, but log-linear model is parametric. Could you update the design doc and move other models to a section called "out of scope"? It is okay to keep the content, which we can refer to if we want to implement more in the future. For the log-linear model, please list the proposed public API and some analysis about its complexity. [~meihuawu] You are definitely welcomed to contribute:) Because you are a first-time MLlib contributor, I will let [~yanboliang] to the coding part and you help review his design doc and pull request. Note that the goal is to implement a version with minimal number of features. Other useful features could come in follow-up PRs. Thanks!! > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633409#comment-14633409 ] Yanbo Liang commented on SPARK-8518: [~mengxr] I think "proportional hazard models" you mentioned is the same as "CoxPHModel" in my current design doc. "CoxPHModel" is the abbreviation of "Cox’s Proportional Hazard model(https://en.wikipedia.org/wiki/Proportional_hazards_model)". As far as I know it's the most popular used survival model and is also supported by R(coxph). As the purpose of this JIRA is to support the most commonly used and easy to parallel model, I think the "Cox’s Proportional Hazard model" is the most appropriate one. It's not very hard to parallel it because of it composed of statistic computing and regression. And the regression procedure can also leverage what exists in mllib at present. I will update my design doc to clarify your questions above as soon as possible and work on code. [~meihuawu] Thanks for your comments and interests to this issue. I think what you mentioned is also the same as "Cox’s Proportional Hazard model(https://en.wikipedia.org/wiki/Proportional_hazards_model)". I agree with you on the implementation proposal. Due to the SGD optimization method had been implemented in mllib, I think we can leverage on it. I will update the design doc, please don't hesitate to comment and review. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632965#comment-14632965 ] Meihua Wu commented on SPARK-8518: -- Hi [~mengxr] and [~yanboliang], For the log-linear model for censored survival data, I believe the most commonly used and easy to parallel methods are based on the exponential/Weibull distribution of the survival time. The algorithm is to optimize the log likelihood. So I think we could start with stochastic gradient descend for large scale data. Can I chime in and contribute for this jira as well? > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630940#comment-14630940 ] Xiangrui Meng commented on SPARK-8518: -- [~yanboliang] Thanks for working the design doc! I hope that you enjoyed the process. I'm not super familiar with survival models. The simplest one I know is the censored log-linear formulation resulting from proportional hazard models. The purpose of this JIRA is not to support all survival models, but support one that is most commonly used and easy to parallel in Spark. So I think the design doc also needs to answer the following: 1. Which algorithm is the most popular one? 2. What is the size of the model? 3. How do the algorithms fit into Spark? Are they easy to be parallelized? 4. What is the complexity? Also CC [~rams]. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). > Design doc from [~yanboliang]: > https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627621#comment-14627621 ] Yanbo Liang commented on SPARK-8518: https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614795#comment-14614795 ] Yanbo Liang commented on SPARK-8518: OK, I will first finish the design documents and discussion. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Original Estimate: 168h > Remaining Estimate: 168h > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614436#comment-14614436 ] Xiangrui Meng commented on SPARK-8518: -- [~yanbo] This JIRA may require some design discussion. Could you first check the R survival package and write down your proposal first? It should include the public APIs and how to implement it in Spark. Thanks! > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis
[ https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613258#comment-14613258 ] Yanbo Liang commented on SPARK-8518: I will work on it. > Log-linear models for survival analysis > --- > > Key: SPARK-8518 > URL: https://issues.apache.org/jira/browse/SPARK-8518 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng > > We want to add basic log-linear models for survival analysis. The > implementation should match the result from R's survival package > (http://cran.r-project.org/web/packages/survival/index.html). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org