[ https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953820#comment-15953820 ]
Bryan Cutler commented on SPARK-19979: -------------------------------------- >From the discussion in the PR {noformat} val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val dt = new DecisionTreeClassifier() .setMaxDepth(5) val pipeline = new Pipeline() val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr) val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt) val pipeline1_grid = new ParamGridBuilder() .baseOn(pipeline.stages -> pipeline1) .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) .build() val pipeline2_grid = new ParamGridBuilder() .baseOn(pipeline.stages -> pipeline2) .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .build() val paramGrid = pipeline1_grid ++ pipeline2_grid val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(2) // Use 3+ in practice {noformat} [~josephkb] [~mlnick] would this be good to add to the documentation? > [MLLIB] Multiple Estimators/Pipelines In CrossValidator > ------------------------------------------------------- > > Key: SPARK-19979 > URL: https://issues.apache.org/jira/browse/SPARK-19979 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 2.1.0 > Reporter: David Leifker > > Update CrossValidator and TrainValidationSplit to be able to accept multiple > pipelines and grid parameters for testing different algorithms and/or being > able to better control tuning combinations. Maintains backwards compatible > API and reads legacy serialized objects. > The same could be done using an external iterative approach. Build different > pipelines, throwing each into a CrossValidator, and then taking the best > model from each of those CrossValidators. Then finally picking the best from > those. This is the initial approach I explored. It resulted in a lot of > boiler plate code that felt like it shouldn't need to exist if the api simply > allowed for arrays of estimators and their parameters. > A couple advantages to this implementation to consider come from keeping the > functional interface to the CrossValidator. > 1. The caching of the folds is better utilized. An external iterative > approach creates a new set of k folds for each CrossValidator fit and the > folds are discarded after each CrossValidator run. In this implementation a > single set of k folds is created and cached for all of the pipelines. > 2. A potential advantage of using this implementation is for future > parallelization of the pipelines within the CrossValdiator. It is of course > possible to handle the parallelization outside of the CrossValidator here > too, however I believe there is already work in progress to parallelize the > grid parameters and that could be extended to multiple pipelines. > Both of those behind-the-scene optimizations are possible because of > providing the CrossValidator with the data and the complete set of > pipelines/estimators to evaluate up front allowing one to abstract away the > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org