[ https://issues.apache.org/jira/browse/SPARK-29828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dong Wang updated SPARK-29828: ------------------------------ Description: There is a ratings.isEmpty() at the beginning of theml.recommendation.ALS.train(). Actually, isEmpty() has an action operator. {code:scala} def isEmpty(): Boolean = withScope { partitions.length == 0 || take(1).length == 0 } {code} So rdd ratings will be used by multi actions, it should be persisted, and unpersisted after its child rdd has been persisted. {code:scala} def train[ID: ClassTag]( // scalastyle:ignore ratings: RDD[Rating[ID]], ... require(!ratings.isEmpty(), s"No ratings available from $ratings") // first use ratings ... val blockRatings = partitionRatings(ratings, userPart, itemPart) .persist(intermediateRDDStorageLevel) val (userInBlocks, userOutBlocks) = makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel) userOutBlocks.count() // materialize blockRatings and user blocks // ratings should be unpersisted here {code} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. was: Two missing persist issues in ml.recommendation.ALS.train(). 1. There is a ratings.isEmpty() at the beginning of the method. Actually, isEmpty() has an action operator. {code:scala} def isEmpty(): Boolean = withScope { partitions.length == 0 || take(1).length == 0 } {code} So rdd ratings will be used by multi actions, it should be persisted, and unpersisted after its child rdd has been persisted. {code:scala} def train[ID: ClassTag]( // scalastyle:ignore ratings: RDD[Rating[ID]], ... require(!ratings.isEmpty(), s"No ratings available from $ratings") // first use ratings ... val blockRatings = partitionRatings(ratings, userPart, itemPart) .persist(intermediateRDDStorageLevel) val (userInBlocks, userOutBlocks) = makeBlocks("user", blockRatings, userPart, itemPart, intermediateRDDStorageLevel) userOutBlocks.count() // materialize blockRatings and user blocks // ratings should be unpersisted here {code} 2. This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. > Missing persist on ratings in ml.recommendation.ALS.train > --------------------------------------------------------- > > Key: SPARK-29828 > URL: https://issues.apache.org/jira/browse/SPARK-29828 > Project: Spark > Issue Type: Sub-task > Components: ML > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > There is a ratings.isEmpty() at the beginning of > theml.recommendation.ALS.train(). Actually, isEmpty() has an action operator. > {code:scala} > def isEmpty(): Boolean = withScope { > partitions.length == 0 || take(1).length == 0 > } > {code} > So rdd ratings will be used by multi actions, it should be persisted, and > unpersisted after its child rdd has been persisted. > {code:scala} > def train[ID: ClassTag]( // scalastyle:ignore > ratings: RDD[Rating[ID]], > ... > require(!ratings.isEmpty(), s"No ratings available from $ratings") // > first use ratings > ... > val blockRatings = partitionRatings(ratings, userPart, itemPart) > .persist(intermediateRDDStorageLevel) > val (userInBlocks, userOutBlocks) = > makeBlocks("user", blockRatings, userPart, itemPart, > intermediateRDDStorageLevel) > userOutBlocks.count() // materialize blockRatings and user blocks > // ratings should be unpersisted here > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org