[ https://issues.apache.org/jira/browse/SPARK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16247910#comment-16247910 ]
Ganesh Sivalingam commented on SPARK-18755: ------------------------------------------- [~yuhaoyan] No problem, I do have some things to add: If you have a look in the scikit-learn code base, the {{RandomizedSearchCV}} and {{GridSearchCV}} functions are exactly the same, except in the way they handle the incoming parameter distributions. {{GridSearchCV }} does the same thing as {{ParamGridBuilder.build()}} and {{RandomizedSearchCV}} does the equivalent of what {{RandomParamGridBuilder.build()}} (which I just submitted) does. Once the parameter sets have been created they both use {{BaseSearchCV}} for everything else, and this is does the same as the current Spark {{CrossValidator}} class. I could create a {{RandomSearchCrossValidator}} class using the logic in {{RandomParamGridBuilder}} if you like? I will also be available for doing benchmarking. > Add Randomized Grid Search to Spark ML > -------------------------------------- > > Key: SPARK-18755 > URL: https://issues.apache.org/jira/browse/SPARK-18755 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: yuhao yang > > Randomized Grid Search implements a randomized search over parameters, where > each setting is sampled from a distribution over possible parameter values. > This has two main benefits over an exhaustive search: > 1. A budget can be chosen independent of the number of parameters and > possible values. > 2. Adding parameters that do not influence the performance does not decrease > efficiency. > Randomized Grid search usually gives similar result as exhaustive search, > while the run time for randomized search is drastically lower. > For more background, please refer to: > sklearn: http://scikit-learn.org/stable/modules/grid_search.html > http://blog.kaggle.com/2015/07/16/scikit-learn-video-8-efficiently-searching-for-optimal-tuning-parameters/ > http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf > https://www.r-bloggers.com/hyperparameter-optimization-in-h2o-grid-search-random-search-and-the-future/. > There're two ways to implement this in Spark as I see: > 1. Add searchRatio to ParamGridBuilder and conduct sampling directly during > build. Only 1 new public function is required. > 2. Add trait RadomizedSearch and create new class RandomizedCrossValidator > and RandomizedTrainValiationSplit, which can be complicated since we need to > deal with the models. > I'd prefer option 1 as it's much simpler and straightforward. We can support > Randomized grid search via some smallest change. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org