[ 
https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953820#comment-15953820
 ] 

Bryan Cutler commented on SPARK-19979:
--------------------------------------

>From the discussion in the PR

{noformat}
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
val dt = new DecisionTreeClassifier()
  .setMaxDepth(5)
val pipeline = new Pipeline()

val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr)
val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt)

val pipeline1_grid = new ParamGridBuilder()
  .baseOn(pipeline.stages -> pipeline1)
  .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
  .addGrid(lr.regParam, Array(0.1, 0.01))
  .build()

val pipeline2_grid = new ParamGridBuilder()
  .baseOn(pipeline.stages -> pipeline2)
  .addGrid(hashingTF.numFeatures, Array(10, 100, 1000))
  .build()

val paramGrid = pipeline1_grid ++ pipeline2_grid

val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(2)  // Use 3+ in practice
{noformat}

[~josephkb] [~mlnick] would this be good to add to the documentation?

> [MLLIB] Multiple Estimators/Pipelines In CrossValidator
> -------------------------------------------------------
>
>                 Key: SPARK-19979
>                 URL: https://issues.apache.org/jira/browse/SPARK-19979
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.0
>            Reporter: David Leifker
>
> Update CrossValidator and TrainValidationSplit to be able to accept multiple 
> pipelines and grid parameters for testing different algorithms and/or being 
> able to better control tuning combinations. Maintains backwards compatible 
> API and reads legacy serialized objects.
> The same could be done using an external iterative approach. Build different 
> pipelines, throwing each into a CrossValidator, and then taking the best 
> model from each of those CrossValidators. Then finally picking the best from 
> those. This is the initial approach I explored. It resulted in a lot of 
> boiler plate code that felt like it shouldn't need to exist if the api simply 
> allowed for arrays of estimators and their parameters.
> A couple advantages to this implementation to consider come from keeping the 
> functional interface to the CrossValidator.
> 1. The caching of the folds is better utilized. An external iterative 
> approach creates a new set of k folds for each CrossValidator fit and the 
> folds are discarded after each CrossValidator run. In this implementation a 
> single set of k folds is created and cached for all of the pipelines.
> 2. A potential advantage of using this implementation is for future 
> parallelization of the pipelines within the CrossValdiator. It is of course 
> possible to handle the parallelization outside of the CrossValidator here 
> too, however I believe there is already work in progress to parallelize the 
> grid parameters and that could be extended to multiple pipelines.
> Both of those behind-the-scene optimizations are possible because of 
> providing the CrossValidator with the data and the complete set of 
> pipelines/estimators to evaluate up front allowing one to abstract away the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to