There are three different regParams defined in the grid and there are tree folds. For simplicity, we didn't split the dataset into three and reuse them, but do the split for each fold. Then we need to cache 3*3 times. Note that the pipeline API is not yet optimized for performance. It would be nice to optimize its perforamnce in 1.4. -Xiangrui
On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <petro.rude...@gmail.com> wrote: > Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch + > LogisticRegression. I’ve reimplemented LogisticRegression.fit method and > comment out instances.unpersist() > > |override def fit(dataset:SchemaRDD, > paramMap:ParamMap):LogisticRegressionModel = { > println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with > ParamMap $paramMap.") > transformSchema(dataset.schema, paramMap, logging =true) > import dataset.sqlContext._ > val map = this.paramMap ++ paramMap > val instances = dataset.select(map(labelCol).attr, > map(featuresCol).attr) > .map { > case Row(label:Double, features:Vector) => > LabeledPoint(label, features) > } > > if (instances.getStorageLevel ==StorageLevel.NONE) { > println("Instances not persisted") > instances.persist(StorageLevel.MEMORY_AND_DISK) > } > > val lr = (new LogisticRegressionWithLBFGS) > .setValidateData(false) > .setIntercept(true) > lr.optimizer > .setRegParam(map(regParam)) > .setNumIterations(map(maxIter)) > val lrm = new LogisticRegressionModel(this, map, > lr.run(instances).weights) > //instances.unpersist() > // copy model params > Params.inheritValues(map,this, lrm) > lrm > } > | > > CrossValidator feeds the same SchemaRDD for each parameter (same hash code), > but somewhere cache being flushed. The memory is enough. Here’s the output: > > |Fitting dataset 2051470010 with ParamMap { > DRLogisticRegression-f35ae4d3-regParam: 0.1 > }. > Instances not persisted > Fitting dataset 2051470010 with ParamMap { > DRLogisticRegression-f35ae4d3-regParam: 0.01 > }. > Instances not persisted > Fitting dataset 2051470010 with ParamMap { > DRLogisticRegression-f35ae4d3-regParam: 0.001 > }. > Instances not persisted > Fitting dataset 802615223 with ParamMap { > DRLogisticRegression-f35ae4d3-regParam: 0.1 > }. > Instances not persisted > Fitting dataset 802615223 with ParamMap { > DRLogisticRegression-f35ae4d3-regParam: 0.01 > }. > Instances not persisted > | > > I have 3 parameters in GridSearch and 3 folds for CrossValidation: > > | > val paramGrid = new ParamGridBuilder() > .addGrid(model.regParam,Array(0.1,0.01,0.001)) > .build() > > crossval.setEstimatorParamMaps(paramGrid) > crossval.setNumFolds(3) > | > > I assume that the data should be read and cached 3 times (1 to > numFolds).combinations(2) and be independent from number of parameters. But > i have 9 times data being read and cached. > > Thanks, > Peter Rudenko > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org