[ https://issues.apache.org/jira/browse/SPARK-19357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176773#comment-16176773 ]
Bryan Cutler commented on SPARK-19357: -------------------------------------- [~josephkb] I think trying to push down the parallelism to the estimators might end up making things difficult. Each model-specific optimization would have to implement some kind of parallelization, and for pipelines it could get really messy. As [~WeichenXu123] pointed out there could be memory problems too. It could be possible to still use the current parallelism and still allow for model-specific optimizations. For example, if we doing cross validation and have a param map with {{regParam = (0.1, 0.3) and maxIter = (5, 10)}}. Lets say that the cross validator could know that maxIter is optimized for the model being evaluated (e.g. a new method in Estimator that return such params). It would then be straightforward for the cross validator to remove maxIter from the param map that will be parallelized over and use it to create 2 arrays of paramMaps: {{((regParam=0.1, maxIter=5), (regParam=0.1, maxIter=10))}} and {{((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10))}}. It could then fit these 2 in parallel with calls to {{def fit(dataset: Dataset\[_\], paramMaps: Array\[ParamMap\]): Seq\[M\]}}. Hopefully that makes sense. In short, it would require some simple changes to CrossValidator and something like a new method to return a list of model-specific optimized params, like {{def getOptimizedParams(): Array\[Param\[_\]\] = Array.empty}} in {{Estimator}} that could be overridden as required. > Parallel Model Evaluation for ML Tuning: Scala > ---------------------------------------------- > > Key: SPARK-19357 > URL: https://issues.apache.org/jira/browse/SPARK-19357 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Fix For: 2.3.0 > > Attachments: parallelism-verification-test.pdf > > > This is a first step of the parent task of Optimizations for ML Pipeline > Tuning to perform model evaluation in parallel. A simple approach is to > naively evaluate with a possible parameter to control the level of > parallelism. There are some concerns with this: > * excessive caching of datasets > * what to set as the default value for level of parallelism. 1 will evaluate > all models in serial, as is done currently. Higher values could lead to > excessive caching. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org