Dong Wang created SPARK-29856: --------------------------------- Summary: Conditional unnecessary persist on RDDs in ML algorithms Key: SPARK-29856 URL: https://issues.apache.org/jira/browse/SPARK-29856 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 3.0.0 Reporter: Dong Wang
When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is persisted, but it only used once. So this persist operation is unnecessary. {code:scala} val baggedInput = BaggedPoint .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement, (tp: TreePoint) => tp.weight, seed = seed) .persist(StorageLevel.MEMORY_AND_DISK) ... while (nodeStack.nonEmpty) { ... timer.start("findBestSplits") RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup, treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) timer.stop("findBestSplits") } baggedInput.unpersist() {code} However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop. In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses {color:#DE350B}_baggedInput_{color}. In most of ML applications, the loop will executes for many times, which means {color:#DE350B}_baggedInput_{color} will be used in many actions. So the persist is necessary now. That's the point why the persist operation is "conditional" unnecessary. Same situations exist in many other ML algorithms, e.g., RDD {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org