[ 
https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597184#comment-14597184
 ] 

Joseph K. Bradley commented on SPARK-7674:
------------------------------------------

Yeah, you're right that we'll need to have some as None...or something more 
Java-friendly.  Maybe null + a helper method hasValueX for valueX.

I'd prefer to store the entire dataset in a transient reference.  That will 
permit going beyond random samples, such as filtering to find outliers.  For 
values requiring sampling, we could either:
* Use a single random seed given to the model/results object upon construction 
(so those values would be lazy vals), or
** I like this option, though I don't expect to require randomness for the 
initial PRs.
* Provide those values via methods which take random seeds (so those values 
would be recomputed, which is nice for flexibility but less nice for lazy eval).

Q-Q plots themselves seem a bit outside the scope though.  This class will 
provide residuals, and the rest of the stuff needed to compute Q-Q plots is 
pretty basic (sorting).  Once MLlib gets fancy and does visualization, that 
could be cool to add.

> R-like stats for ML models
> --------------------------
>
>                 Key: SPARK-7674
>                 URL: https://issues.apache.org/jira/browse/SPARK-7674
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Critical
>
> This is an umbrella JIRA for supporting ML model summaries and statistics, 
> following the example of R's summary() and plot() functions.
> [Design 
> doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]
> From the design doc:
> {quote}
> R and its well-established packages provide extensive functionality for 
> inspecting a model and its results.  This inspection is critical to 
> interpreting, debugging and improving models.
> R is arguably a gold standard for a statistics/ML library, so this doc 
> largely attempts to imitate it.  The challenge we face is supporting similar 
> functionality, but on big (distributed) data.  Data size makes both efficient 
> computation and meaningful displays/summaries difficult.
> R model and result summaries generally take 2 forms:
> * summary(model): Display text with information about the model and results 
> on data
> * plot(model): Display plots about the model and results
> We aim to provide both of these types of information.  Visualization for the 
> plottable results will not be supported in MLlib itself, but we can provide 
> results in a form which can be plotted easily with other tools.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to