[ 
https://issues.apache.org/jira/browse/SPARK-6233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-6233.
------------------------------------
    Resolution: Not a Problem
      Assignee: Joseph K. Bradley

I'm closing this discussion because, after thinking more about it, it probably 
is not an issue since Transformers operate only in batch mode (on RDDs).  It 
will become an issue if we ever provide per-row transformations in this API.

> Should spark.ml Models be distributed by default?
> -------------------------------------------------
>
>                 Key: SPARK-6233
>                 URL: https://issues.apache.org/jira/browse/SPARK-6233
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>
> This JIRA is for discussing a potential change for the spark.ml package.
> *Issue*: When an Estimator runs, it often computes helpful side information 
> which is not stored in the returned Model.  (E.g., linear methods have RDDs 
> of residuals.)  It would be nice to have this information by default, rather 
> than having to recompute it.
> *Suggestion*: Introduce a DistributedModel trait.  Every Estimator in the 
> spark.ml package should be able to return a distributed model with extra info 
> computed during training.
> *Motivation*: This kind of info is one of the most useful aspects of R.  
> E.g., when you train a linear model, you can immediately summarize or plot 
> information about the residuals.  For MLlib, the user currently has to take 
> extra steps (and computation time) to recompute this info.
> *API*: My general idea is as follows.
> {code}
> trait Model
> trait LocalModel extends Model
> trait DistributedModel[LocalModelType: LocalModel] extends Model {
>   /** convert to local model */
>   def toLocal: LocalModelType
> }
> class LocalLDAModel extends LocalModel
> class DistributedLDAModel[LocalLDAModel] extends DistributedModel {
>   def toLocal: LocalLDAModel
> }
> {code}
> *Issues with this API*:
> * API stability: To keep the API stable in the future, either (a) all models 
> should return DistributedModels, or (b) all models should return Models which 
> can then be tested for the LocalModel or DistributedModel trait.
> * memory “leaks”: Users may not expect models to store references to RDDs, so 
> they may be surprised by how much storage is being used.
> * naturally distributed models: Some models will simply be too large to be 
> converted into LocalModels.  It is unclear what to do here.
> *Is this worthwhile?*
> Pros:
> * Saving computation
> * Easier for users (skipping 1 more step of computing this info)
> Cons:
> * API issues
> * Limited savings on computation.  In general, computing this info may take 
> much less time than model training (e.g., computing residuals vs. training a 
> GLM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to