[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288993#comment-16288993
 ] 

Nick Pentreath commented on SPARK-19357:
----------------------------------------

I've thought about this and taken a look at the proposed solution in 
SPARK-22126 (PR: https://github.com/apache/spark/pull/19350; see my 
[comment|https://github.com/apache/spark/pull/19350/files#r156599955]). I don't 
think the PR solves the problem of a pipeline with stages that have 
model-specific optimizations. In addition the API presented there seems a bit 
convoluted and quite tricky to implement a model-specific optimization for a 
given estimator. I don't see the benefit there of "pushing" the parallel 
implementation down to {{Estimator}}.

Overall, if we cannot in the short term support model-specific optimization for 
CV, that seems ok to me since we don't have any actual implementations, and the 
benefit of parallel CV as it stands far outweighs that cost. We can make a note 
in the user guide or API docs if necessary.

If we can figure out a clean API to support both all the better but until we 
actually have a significant model-specific optimization implementation it seems 
like overkill. I do think Bryan's concept seems cleaner and simpler to 
implement for specific estimators, so perhaps [~bryanc] is able to work up a 
WIP PR to illustrate how it would work?


> Parallel Model Evaluation for ML Tuning: Scala
> ----------------------------------------------
>
>                 Key: SPARK-19357
>                 URL: https://issues.apache.org/jira/browse/SPARK-19357
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>             Fix For: 2.3.0
>
>         Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to