[ https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597184#comment-14597184 ]
Joseph K. Bradley commented on SPARK-7674: ------------------------------------------ Yeah, you're right that we'll need to have some as None...or something more Java-friendly. Maybe null + a helper method hasValueX for valueX. I'd prefer to store the entire dataset in a transient reference. That will permit going beyond random samples, such as filtering to find outliers. For values requiring sampling, we could either: * Use a single random seed given to the model/results object upon construction (so those values would be lazy vals), or ** I like this option, though I don't expect to require randomness for the initial PRs. * Provide those values via methods which take random seeds (so those values would be recomputed, which is nice for flexibility but less nice for lazy eval). Q-Q plots themselves seem a bit outside the scope though. This class will provide residuals, and the rest of the stuff needed to compute Q-Q plots is pretty basic (sorting). Once MLlib gets fancy and does visualization, that could be cool to add. > R-like stats for ML models > -------------------------- > > Key: SPARK-7674 > URL: https://issues.apache.org/jira/browse/SPARK-7674 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > Priority: Critical > > This is an umbrella JIRA for supporting ML model summaries and statistics, > following the example of R's summary() and plot() functions. > [Design > doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing] > From the design doc: > {quote} > R and its well-established packages provide extensive functionality for > inspecting a model and its results. This inspection is critical to > interpreting, debugging and improving models. > R is arguably a gold standard for a statistics/ML library, so this doc > largely attempts to imitate it. The challenge we face is supporting similar > functionality, but on big (distributed) data. Data size makes both efficient > computation and meaningful displays/summaries difficult. > R model and result summaries generally take 2 forms: > * summary(model): Display text with information about the model and results > on data > * plot(model): Display plots about the model and results > We aim to provide both of these types of information. Visualization for the > plottable results will not be supported in MLlib itself, but we can provide > results in a form which can be plotted easily with other tools. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org