[ https://issues.apache.org/jira/browse/SPARK-29824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-29824. ---------------------------------- Resolution: Duplicate > Missing persist on trainDataset in ml.classification.GBTClassifier.train() > -------------------------------------------------------------------------- > > Key: SPARK-29824 > URL: https://issues.apache.org/jira/browse/SPARK-29824 > Project: Spark > Issue Type: Sub-task > Components: ML > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > The rdd trainDataset in ml.classification.GBTClassifier.train() is used by an > action first and other actions in GradientBoostedTrees.run/runWithValidation, > but it is not persisted, which will cause recomputation on trainDataset. > {code:scala} > override protected def train( > dataset: Dataset[_]): GBTClassificationModel = instrumented { instr => > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > ... > val numFeatures = trainDataset.first().features.size // first use > trainDataset > ... > // trainDataset will be used by other actions in run methods. > val (baseLearners, learnerWeights) = if (withValidation) { > GradientBoostedTrees.runWithValidation(trainDataset, validationDataset, > boostingStrategy, > $(seed), $(featureSubsetStrategy)) > } else { > GradientBoostedTrees.run(trainDataset, boostingStrategy, $(seed), > $(featureSubsetStrategy)) > } > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org