[ https://issues.apache.org/jira/browse/SPARK-29813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Omer updated SPARK-29813: ------------------------------ Parent: SPARK-29818 Issue Type: Sub-task (was: Improvement) > Missing persist in mllib.PrefixSpan.findFrequentItems() > ------------------------------------------------------- > > Key: SPARK-29813 > URL: https://issues.apache.org/jira/browse/SPARK-29813 > Project: Spark > Issue Type: Sub-task > Components: MLlib > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > There are three actions in this piece of code: reduceByKey, sortBy, and > collect. But data is not persisted, which will cause recomputation. > {code:scala} > private[fpm] def findFrequentItems[Item: ClassTag]( > data: RDD[Array[Array[Item]]], > minCount: Long): Array[Item] = { > data.flatMap { itemsets => > val uniqItems = mutable.Set.empty[Item] > itemsets.foreach(set => uniqItems ++= set) > uniqItems.toIterator.map((_, 1L)) > }.reduceByKey(_ + _).filter { case (_, count) => > count >= minCount > }.sortBy(-_._2).map(_._1).collect() > } > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org