[ https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175629#comment-16175629 ]
Joseph K. Bradley edited comment on SPARK-19357 at 9/21/17 11:14 PM: --------------------------------------------------------------------- [~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit foolish; I just realized these changes to support parallel model evaluation are going to cause some problems for optimizing multi-model fitting. * When we originally designed the Pipelines API, we put {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class Estimator}} for the sake of eventually being able to override that method within specific Estimators which can do algorithm-specific optimizations. E.g., if you're tuning {{maxIter}}, then you should really only fit once and just save the model at various iterations along the way. * These recent changes in master to CrossValidator and TrainValidationSplit have switched from calling fit() with all of the ParamMaps to calling fit() with a single ParamMap. This means that the model-specific optimization is no longer possible. Although we haven't found time yet to do these model-specific optimizations, I'd really like for us to be able to do so in the future. For some models, this could lead to huge speedups (N^2 to N for the case of maxIter for linear models). Any ideas for fixing this? Here are my thoughts: * To allow model-specific optimization, the implementation for fitting for multiple ParamMaps needs to be within models, not within CrossValidator or other tuning algorithms. * Therefore, we need to use something like {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}}. However, we will need an API which takes the {{parallelism}} Param. * Since {{Estimator}} is an abstract class, we can add a new method as long as it has a default implementation, without worrying about breaking APIs across Spark versions. So we could add something like: ** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: Int): Seq[M]}} ** However, this will not mesh well with our plans for dumping models from CrossValidator to disk during tuning. For that, we would need to be able to pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something like that). What do you think? was (Author: josephkb): [~bryanc], [~nick.pentre...@gmail.com], [~WeichenXu123] Well I feel a bit foolish; I just realized these changes to support parallel model evaluation are going to cause some problems for optimizing multi-model fitting. * When we originally designed the Pipelines API, we put {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}} in {{abstract class Estimator}} for the sake of eventually being able to override that method within specific Estimators which can do algorithm-specific optimizations. E.g., if you're tuning {{maxIter}}, then you should really only fit once and just save the model at various iterations along the way. * These recent changes in master to CrossValidator and TrainValidationSplit have switched from calling fit() with all of the ParamMaps to calling fit() with a single ParamMap. This means that the model-specific optimization is no longer possible. Although we haven't found time yet to do these model-specific optimizations, I'd really like for us to be able to do so in the future. Any ideas for fixing this? Here are my thoughts: * To allow model-specific optimization, the implementation for fitting for multiple ParamMaps needs to be within models, not within CrossValidator or other tuning algorithms. * Therefore, we need to use something like {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M]}}. However, we will need an API which takes the {{parallelism}} Param. * Since {{Estimator}} is an abstract class, we can add a new method as long as it has a default implementation, without worrying about breaking APIs across Spark versions. So we could add something like: ** {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: Int): Seq[M]}} ** However, this will not mesh well with our plans for dumping models from CrossValidator to disk during tuning. For that, we would need to be able to pass callbacks, e.g.: {{def fit(dataset: Dataset[_], paramMaps: Array[ParamMap], parallelism: Int, callback: M => ()): Seq[M]}} (or something like that). What do you think? > Parallel Model Evaluation for ML Tuning: Scala > ---------------------------------------------- > > Key: SPARK-19357 > URL: https://issues.apache.org/jira/browse/SPARK-19357 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Fix For: 2.3.0 > > Attachments: parallelism-verification-test.pdf > > > This is a first step of the parent task of Optimizations for ML Pipeline > Tuning to perform model evaluation in parallel. A simple approach is to > naively evaluate with a possible parameter to control the level of > parallelism. There are some concerns with this: > * excessive caching of datasets > * what to set as the default value for level of parallelism. 1 will evaluate > all models in serial, as is done currently. Higher values could lead to > excessive caching. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org