[ https://issues.apache.org/jira/browse/SPARK-29810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970909#comment-16970909 ]
Aman Omer commented on SPARK-29810: ----------------------------------- Thanks [~spark_cachecheck] for reporting. I will raise a PR for this. > Missing persist on retaggedInput in RandomForest.run() > ------------------------------------------------------ > > Key: SPARK-29810 > URL: https://issues.apache.org/jira/browse/SPARK-29810 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > The rdd retaggedInput should be persisted in ml.tree.impl.RandomForest.run(), > because it will be used more than one actions. > {code:scala} > def run( > input: RDD[LabeledPoint], > strategy: OldStrategy, > numTrees: Int, > featureSubsetStrategy: String, > seed: Long, > instr: Option[Instrumentation], > prune: Boolean = true, // exposed for testing only, real trees are > always pruned > parentUID: Option[String] = None): Array[DecisionTreeModel] = { > val timer = new TimeTracker() > timer.start("total") > timer.start("init") > val retaggedInput = input.retag(classOf[LabeledPoint]) // it needs to be > persisted > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org