[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2016-07-29 Thread Eliano Marques (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399718#comment-15399718
 ] 

Eliano Marques commented on SPARK-8518:
---

I totally understand your point but I guess the same rule used for GML should 
apply here, i.e. if you are limiting the summary statistics for GML you can 
apply the same rule. 

Probably the best solution would be to add a parameter to the function which 
determines if this gets calculated or not. Often you are more interested in the 
model parameters than in the prediction itself so understanding how robust / 
stable they are might be as relevant as the prediction. 
Let me look at what L-BFGS returns and see if we can come back to you with 
something. Will keep you posted. 

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 1.6.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2016-07-29 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399031#comment-15399031
 ] 

Yanbo Liang commented on SPARK-8518:


[~eliano.m.marq...@gmail.com] Since we use L-BFGS to solve the aft survival 
regression problem, and L-BFGS use the first derivative to represent the 
approximation of inverse Hessian. We need to figure out a way to output the 
approximation of inverse Hessian, and then get the standard statistics like 
what is done in GLMs. Please feel free to file a ticket and work on it if you 
have some ideas. Thanks!

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 1.6.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2016-07-26 Thread Eliano Marques (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394450#comment-15394450
 ] 

Eliano Marques commented on SPARK-8518:
---

Would it be possible to add some information about the coefficients standard 
deviation? 

According to the information 
https://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression,
 it would be possible to bring to the model outputs the second derivate of the 
gradient function for beta and log teta? 

Bringing this information to the model outputs would enable us to perform the 
standard statistical tests on the parameters, similar to what is done in the 
glm. 

Thanks

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
> Fix For: 1.6.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-09-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731866#comment-14731866
 ] 

Apache Spark commented on SPARK-8518:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8611

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-09-03 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729163#comment-14729163
 ] 

Yanbo Liang commented on SPARK-8518:


[~mengxr] [~meihuawu]
I have already finished the initial implementation of AFT using L-BFGS, and it 
can produce the same result as the [R function | 
https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html] on 
the [Larynx cancer data | 
http://www.mcw.edu/FileLibrary/Groups/Biostatistics/Publicfiles/DataFromSection/DataFromSectionTXT/Data_from_section_1.8.txt].
 I will submit the PR after correcting code format and adding more test cases. 
I think you can comment and review it in one or two days.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-09-01 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725935#comment-14725935
 ] 

Meihua Wu commented on SPARK-8518:
--

For the reference implementations, recommend we consider this R function: 
https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html 



> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-09-01 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725749#comment-14725749
 ] 

Xiangrui Meng commented on SPARK-8518:
--

[~yanboliang] Thanks for working on design doc! I think we should be ready to 
have an initial implementation of AFT using L-BFGS. Are there reference 
implementations we can test against?

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-09-01 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725732#comment-14725732
 ] 

Xiangrui Meng commented on SPARK-8518:
--

The value in this column would be 0/1. Then we need to see how users would 
interpret `event = 1` and `event = 0` vs. `censored = 1` and `censored = 0`. 
The latter looks easier to understand to me. Users do not need to visit the doc 
to see what `event = 1` means.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702676#comment-14702676
 ] 

Meihua Wu commented on SPARK-8518:
--

[~mengxr] [~yanbo] Either way works for me. 

In R and some Python survival package, it is called event.
In SAS, it is called censored.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702653#comment-14702653
 ] 

Yanbo Liang commented on SPARK-8518:


I think it's usually called `event` or `status` in the context of survival 
analysis.
[~meihuawu] What about your opinions?

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14701614#comment-14701614
 ] 

Xiangrui Meng commented on SPARK-8518:
--

Calling it `censorCol` instead?

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-18 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700985#comment-14700985
 ] 

Yanbo Liang commented on SPARK-8518:


[~meihuawu] Thanks for your comments!
Yes, we need "eventCol" to indicate censor or not, updated the document.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-17 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700302#comment-14700302
 ] 

Meihua Wu commented on SPARK-8518:
--

[~yanbo] Thank you very much for the update!

The loss function and gradient are different for events and censor. So we will 
need to have a column in the data frame to indicate whether an individual 
record is an event or censored. I suppose we will need to define a Param for 
"eventCol" using code gen and mix it into the AFTRegressionParams. 

cc [~mengxr]

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-08-10 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14679918#comment-14679918
 ] 

Yanbo Liang commented on SPARK-8518:


[~mengxr] [~meihuawu]
I have update the design document, please review and comment it.
For the questions I can also clarify here:
1. Which algorithm is the most popular one?
Accelerated Failure Time (AFT) model
2. What is the size of the model?
It contains a weight vector (with intercept) and a scale parameter.
3. How do the algorithms fit into Spark? Are they easy to be parallelized?
I have got the loss function and gradient function, so we can parallelize it 
using SGD or L-BFGS.
4. What is the complexity?
It has the same complexity with LinearRegression.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-25 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641864#comment-14641864
 ] 

Yanbo Liang commented on SPARK-8518:


[~meihuawu] 
Got it, I agree to start with exponential/Weibull model, and it's more easy to 
parallel in Spark.
I will update the detail design document ASAP to clarify the loss function, 
gradient function, how to optimize the loss function, how it fits into Spark, 
etc.
Thank you for your comments.


> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-25 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641711#comment-14641711
 ] 

Meihua Wu commented on SPARK-8518:
--

[~yanboliang] Here's my two cents :)

For the Cox model, you will need to find the \beta that maximize the log 
partial likelihood: l(\beta) =... the 3rd formula in the wiki page 
https://en.wikipedia.org/wiki/Proportional_hazards_model. There are two 
summations. The first one involves summation over the records indexed by i. For 
each record i, you will need to do another summation over the records indexed 
by j. The complexity of the 2nd summation is O(n). In the end, the double 
summation might be O(n^2). I guess we might be able to improve this to 
O(n*log(n)) by a pre-processing step of sorting by Y_i. But still not O(n).

The exponential/Weibull model is like linear regression: there is only one 
summation in the objective function and each term in the summation is O(1). So 
the overall complexity is O(n).

In the end, I am not saying the Cox model is not good [it is actually more 
flexible and robust.]. But I think for our first step, the exponential/Weibull 
model is easier to implement and computational-wise scales better for massive 
data.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638397#comment-14638397
 ] 

Yanbo Liang commented on SPARK-8518:


[~meihuawu] Thank you for your valued comments. I agree with you that AFT model 
is common used. But I did not find that it's easily parallelizable than Cox PH 
model. 
I think AFT model is like regression problems and need to optimize loss 
function using SGD or other method. Could you give me some references about the 
likelihood function, loss function and gradient function that can prove that 
it's more easily parallelizable than Cox PH model?

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-20 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634621#comment-14634621
 ] 

Meihua Wu commented on SPARK-8518:
--

[~mengxr] [~yanboliang]
Sounds like to plan. We would start with something simple.

I agree that the Cox PH model is a non-parametric model. It is not easy to 
implement it efficiently in Spark: To determine the contribution of a 
particular row in the RDD to the objective function, you will need to reference 
to other rows in the RDD, effectively breaking the parallelism. 

The log-linear model of survival models are often called Accelerated Failure 
Time (AFT) model 
(https://en.wikipedia.org/wiki/Accelerated_failure_time_model). For AFT, there 
are again two favor: parametric vs non-parametric. For the parametric favor, 
the commonly used model is based on Weilbull / exponential distribution. Under 
these models, each row in the RDD contribute to the objective function 
independently, thus easily parallelizable. 



> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634291#comment-14634291
 ] 

Xiangrui Meng commented on SPARK-8518:
--

[~yanboliang] Let's narrow the goal of this JIRA to the log-linear model. I 
didn't know that it is the same as CoxPHModel. You mentioned that all the 
models listed in your design doc are non-parametric, but log-linear model is 
parametric. Could you update the design doc and move other models to a section 
called "out of scope"? It is okay to keep the content, which we can refer to if 
we want to implement more in the future. For the log-linear model, please list 
the proposed public API and some analysis about its complexity.

[~meihuawu] You are definitely welcomed to contribute:) Because you are a 
first-time MLlib contributor, I will let [~yanboliang] to the coding part and 
you help review his design doc and pull request. Note that the goal is to 
implement a version with minimal number of features. Other useful features 
could come in follow-up PRs. Thanks!!

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633409#comment-14633409
 ] 

Yanbo Liang commented on SPARK-8518:


[~mengxr] I think "proportional hazard models" you mentioned is the same as 
"CoxPHModel" in my current design doc. "CoxPHModel" is the abbreviation of 
"Cox’s Proportional Hazard 
model(https://en.wikipedia.org/wiki/Proportional_hazards_model)". As far as I 
know it's the most popular used survival model and is also supported by 
R(coxph). 
As the purpose of this JIRA is to support the most commonly used and easy to 
parallel model, I think the "Cox’s Proportional Hazard model" is the most 
appropriate one. 
It's not very hard to parallel it because of it composed of statistic computing 
and regression. And the regression procedure can also leverage what exists in 
mllib at present. 
I will update my design doc to clarify your questions above as soon as possible 
and work on code.
[~meihuawu] Thanks for your comments and interests to this issue. I think what 
you mentioned is also the same as "Cox’s Proportional Hazard 
model(https://en.wikipedia.org/wiki/Proportional_hazards_model)". I agree with 
you on the implementation proposal.  Due to the SGD optimization method had 
been implemented in mllib, I think we can leverage on it. I will update the 
design doc, please don't hesitate to comment and review.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632965#comment-14632965
 ] 

Meihua Wu commented on SPARK-8518:
--

Hi [~mengxr] and [~yanboliang], 

For the log-linear model for censored survival data, I believe the most 
commonly used and easy to parallel methods are based on the exponential/Weibull 
distribution of the survival time. The algorithm is to optimize the log 
likelihood. So I think we could start with stochastic gradient descend for 
large scale data. Can I chime in and contribute for this jira as well?




> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630940#comment-14630940
 ] 

Xiangrui Meng commented on SPARK-8518:
--

[~yanboliang] Thanks for working the design doc! I hope that you enjoyed the 
process. I'm not super familiar with survival models. The simplest one I know 
is the censored log-linear formulation resulting from proportional hazard 
models. The purpose of this JIRA is not to support all survival models, but 
support one that is most commonly used and easy to parallel in Spark. So I 
think the design doc also needs to answer the following:

1. Which algorithm is the most popular one?
2. What is the size of the model?
3. How do the algorithms fit into Spark? Are they easy to be parallelized?
4. What is the complexity?

Also CC [~rams].

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).
> Design doc from [~yanboliang]: 
> https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-14 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627621#comment-14627621
 ] 

Yanbo Liang commented on SPARK-8518:


https://docs.google.com/document/d/1fLtB0sqg2HlfqdrJlNHPhpfXO0Zb2_avZrxiVoPEs0E/pub

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614795#comment-14614795
 ] 

Yanbo Liang commented on SPARK-8518:


OK, I will first finish the design documents and discussion.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-05 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14614436#comment-14614436
 ] 

Xiangrui Meng commented on SPARK-8518:
--

[~yanbo] This JIRA may require some design discussion. Could you first check 
the R survival package and write down your proposal first? It should include 
the public APIs and how to implement it in Spark. Thanks!

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8518) Log-linear models for survival analysis

2015-07-03 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14613258#comment-14613258
 ] 

Yanbo Liang commented on SPARK-8518:


I will work on it.

> Log-linear models for survival analysis
> ---
>
> Key: SPARK-8518
> URL: https://issues.apache.org/jira/browse/SPARK-8518
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>
> We want to add basic log-linear models for survival analysis. The 
> implementation should match the result from R's survival package 
> (http://cran.r-project.org/web/packages/survival/index.html).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org