[jira] [Commented] (SPARK-7674) R-like stats for ML models

Joseph K. Bradley (JIRA) Mon, 22 Jun 2015 15:45:10 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596785#comment-14596785
 ]


Joseph K. Bradley commented on SPARK-7674:
------------------------------------------

Traits: I agree it would be nice to add traits for common properties or 
abstractions.  I'll note that in the doc.

Stats within vs. outside of the model: I think differently about 2 types of 
results: (a) model summary (e.g., stats from training) vs. (b) prediction 
results (on training or test data).
* (a) The model summary could be part of the model.  The main question there 
would be how best to share code between models, where we might be restricted by 
multiple inheritance issues.  (E.g., if we wanted an IterativeAlgorithm trait, 
we could not have an implementation in it b/c of the Java API.)  This might 
become apparent after the initial implementation.
* (b) The prediction results should definitely be a separate object since we 
would need it for multiple datasets (train, test, etc.).


> R-like stats for ML models
> --------------------------
>
>                 Key: SPARK-7674
>                 URL: https://issues.apache.org/jira/browse/SPARK-7674
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> This is an umbrella JIRA for supporting ML model summaries and statistics, 
> following the example of R's summary() and plot() functions.
> [Design 
> doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]
> From the design doc:
> {quote}
> R and its well-established packages provide extensive functionality for 
> inspecting a model and its results.  This inspection is critical to 
> interpreting, debugging and improving models.
> R is arguably a gold standard for a statistics/ML library, so this doc 
> largely attempts to imitate it.  The challenge we face is supporting similar 
> functionality, but on big (distributed) data.  Data size makes both efficient 
> computation and meaningful displays/summaries difficult.
> R model and result summaries generally take 2 forms:
> * summary(model): Display text with information about the model and results 
> on data
> * plot(model): Display plots about the model and results
> We aim to provide both of these types of information.  Visualization for the 
> plottable results will not be supported in MLlib itself, but we can provide 
> results in a form which can be plotted easily with other tools.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7674) R-like stats for ML models

Reply via email to