Joseph K. Bradley created SPARK-6233: ----------------------------------------
Summary: Should spark.ml Models be distributed by default? Key: SPARK-6233 URL: https://issues.apache.org/jira/browse/SPARK-6233 Project: Spark Issue Type: Brainstorming Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley This JIRA is for discussing a potential change for the spark.ml package. *Issue*: When an Estimator runs, it often computes helpful side information which is not stored in the returned Model. (E.g., linear methods have RDDs of residuals.) It would be nice to have this information by default, rather than having to recompute it. *Suggestion*: Introduce a DistributedModel trait. Every Estimator in the spark.ml package should be able to return a distributed model with extra info computed during training. *Motivation*: This kind of info is one of the most useful aspects of R. E.g., when you train a linear model, you can immediately summarize or plot information about the residuals. For MLlib, the user currently has to take extra steps (and computation time) to recompute this info. *API*: My general idea is as follows. {code} trait Model trait LocalModel extends Model trait DistributedModel[LocalModelType: LocalModel] extends Model { /** convert to local model */ def toLocal: LocalModelType } class LocalLDAModel extends LocalModel class DistributedLDAModel[LocalLDAModel] extends DistributedModel { def toLocal: LocalLDAModel } {code} *Issues with this API*: * API stability: To keep the API stable in the future, either (a) all models should return DistributedModels, or (b) all models should return Models which can then be tested for the LocalModel or DistributedModel trait. * memory “leaks”: Users may not expect models to store references to RDDs, so they may be surprised by how much storage is being used. * naturally distributed models: Some models will simply be too large to be converted into LocalModels. It is unclear what to do here. *Is this worthwhile?* Pros: * Saving computation * Easier for users (skipping 1 more step of computing this info) Cons: * API issues * Limited savings on computation. In general, computing this info may take much less time than model training (e.g., computing residuals vs. training a GLM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org