[ 
https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176773#comment-16176773
 ] 

Bryan Cutler commented on SPARK-19357:
--------------------------------------

[~josephkb] I think trying to push down the parallelism to the estimators might 
end up making things difficult. Each model-specific optimization would have to 
implement some kind of parallelization, and for pipelines it could get really 
messy.  As [~WeichenXu123] pointed out there could be memory problems too.

It could be possible to still use the current parallelism and still allow for 
model-specific optimizations.  For example, if we doing cross validation and 
have a param map with {{regParam = (0.1, 0.3) and maxIter = (5, 10)}}.  Lets 
say that the cross validator could know that maxIter is optimized for the model 
being evaluated (e.g. a new method in Estimator that return such params).  It 
would then be straightforward for the cross validator to remove maxIter from 
the param map that will be parallelized over and use it to create 2 arrays of 
paramMaps: {{((regParam=0.1, maxIter=5), (regParam=0.1, maxIter=10))}} and  
{{((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10))}}.  It could then fit 
these 2 in parallel with calls to {{def fit(dataset: Dataset\[_\], paramMaps: 
Array\[ParamMap\]): Seq\[M\]}}.

Hopefully that makes sense.  In short, it would require some simple changes to 
CrossValidator and something like a new method to return a list of 
model-specific optimized params, like {{def getOptimizedParams(): 
Array\[Param\[_\]\] = Array.empty}} in {{Estimator}} that could be overridden 
as required.

> Parallel Model Evaluation for ML Tuning: Scala
> ----------------------------------------------
>
>                 Key: SPARK-19357
>                 URL: https://issues.apache.org/jira/browse/SPARK-19357
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Bryan Cutler
>            Assignee: Bryan Cutler
>             Fix For: 2.3.0
>
>         Attachments: parallelism-verification-test.pdf
>
>
> This is a first step of the parent task of Optimizations for ML Pipeline 
> Tuning to perform model evaluation in parallel.  A simple approach is to 
> naively evaluate with a possible parameter to control the level of 
> parallelism.  There are some concerns with this:
> * excessive caching of datasets
> * what to set as the default value for level of parallelism.  1 will evaluate 
> all models in serial, as is done currently. Higher values could lead to 
> excessive caching.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to