[jira] [Created] (SPARK-6233) Should spark.ml Models be distributed by default?

Joseph K. Bradley (JIRA) Mon, 09 Mar 2015 14:46:30 -0700

Joseph K. Bradley created SPARK-6233:
----------------------------------------


             Summary: Should spark.ml Models be distributed by default?
                 Key: SPARK-6233
                 URL: https://issues.apache.org/jira/browse/SPARK-6233
             Project: Spark
          Issue Type: Brainstorming
          Components: ML
    Affects Versions: 1.4.0
            Reporter: Joseph K. Bradley


This JIRA is for discussing a potential change for the spark.ml package.

*Issue*: When an Estimator runs, it often computes helpful side information 
which is not stored in the returned Model.  (E.g., linear methods have RDDs of 
residuals.)  It would be nice to have this information by default, rather than 
having to recompute it.

*Suggestion*: Introduce a DistributedModel trait.  Every Estimator in the 
spark.ml package should be able to return a distributed model with extra info 
computed during training.

*Motivation*: This kind of info is one of the most useful aspects of R.  E.g., 
when you train a linear model, you can immediately summarize or plot 
information about the residuals.  For MLlib, the user currently has to take 
extra steps (and computation time) to recompute this info.

*API*: My general idea is as follows.
{code}
trait Model
trait LocalModel extends Model
trait DistributedModel[LocalModelType: LocalModel] extends Model {
  /** convert to local model */
  def toLocal: LocalModelType
}

class LocalLDAModel extends LocalModel
class DistributedLDAModel[LocalLDAModel] extends DistributedModel {
  def toLocal: LocalLDAModel
}
{code}

*Issues with this API*:
* API stability: To keep the API stable in the future, either (a) all models 
should return DistributedModels, or (b) all models should return Models which 
can then be tested for the LocalModel or DistributedModel trait.
* memory “leaks”: Users may not expect models to store references to RDDs, so 
they may be surprised by how much storage is being used.
* naturally distributed models: Some models will simply be too large to be 
converted into LocalModels.  It is unclear what to do here.

*Is this worthwhile?*
Pros:
* Saving computation
* Easier for users (skipping 1 more step of computing this info)

Cons:
* API issues
* Limited savings on computation.  In general, computing this info may take 
much less time than model training (e.g., computing residuals vs. training a 
GLM).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6233) Should spark.ml Models be distributed by default?

Reply via email to