[ https://issues.apache.org/jira/browse/SPARK-29828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dong Wang updated SPARK-29828: ------------------------------ Summary: Missing persist in ml.recommendation.ALS.train (was: Missing persist on ratings on ratings in ml.recommendation.ALS.train) > Missing persist in ml.recommendation.ALS.train > ---------------------------------------------- > > Key: SPARK-29828 > URL: https://issues.apache.org/jira/browse/SPARK-29828 > Project: Spark > Issue Type: Sub-task > Components: ML > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > There is a ratings.isEmpty() in ml.recommendation.ALS.train(). Actually, > isEmpty() has an action operator. > {code:scala} > def isEmpty(): Boolean = withScope { > partitions.length == 0 || take(1).length == 0 > } > {code} > So rdd ratings will be used by multi actions, it should be persisted, and > unpersisted after its child rdd has been persisted. > {code:scala} > def train[ID: ClassTag]( // scalastyle:ignore > ratings: RDD[Rating[ID]], > ... > require(!ratings.isEmpty(), s"No ratings available from $ratings") // > first use ratings > ... > val blockRatings = partitionRatings(ratings, userPart, itemPart) > .persist(intermediateRDDStorageLevel) > val (userInBlocks, userOutBlocks) = > makeBlocks("user", blockRatings, userPart, itemPart, > intermediateRDDStorageLevel) > userOutBlocks.count() // materialize blockRatings and user blocks > // ratings should be unpersisted here > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org