[ https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288993#comment-16288993 ]
Nick Pentreath commented on SPARK-19357: ---------------------------------------- I've thought about this and taken a look at the proposed solution in SPARK-22126 (PR: https://github.com/apache/spark/pull/19350; see my [comment|https://github.com/apache/spark/pull/19350/files#r156599955]). I don't think the PR solves the problem of a pipeline with stages that have model-specific optimizations. In addition the API presented there seems a bit convoluted and quite tricky to implement a model-specific optimization for a given estimator. I don't see the benefit there of "pushing" the parallel implementation down to {{Estimator}}. Overall, if we cannot in the short term support model-specific optimization for CV, that seems ok to me since we don't have any actual implementations, and the benefit of parallel CV as it stands far outweighs that cost. We can make a note in the user guide or API docs if necessary. If we can figure out a clean API to support both all the better but until we actually have a significant model-specific optimization implementation it seems like overkill. I do think Bryan's concept seems cleaner and simpler to implement for specific estimators, so perhaps [~bryanc] is able to work up a WIP PR to illustrate how it would work? > Parallel Model Evaluation for ML Tuning: Scala > ---------------------------------------------- > > Key: SPARK-19357 > URL: https://issues.apache.org/jira/browse/SPARK-19357 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Fix For: 2.3.0 > > Attachments: parallelism-verification-test.pdf > > > This is a first step of the parent task of Optimizations for ML Pipeline > Tuning to perform model evaluation in parallel. A simple approach is to > naively evaluate with a possible parameter to control the level of > parallelism. There are some concerns with this: > * excessive caching of datasets > * what to set as the default value for level of parallelism. 1 will evaluate > all models in serial, as is done currently. Higher values could lead to > excessive caching. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org