[ https://issues.apache.org/jira/browse/SPARK-29809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen resolved SPARK-29809. ---------------------------------- Resolution: Duplicate > Missing persist in Word2Vec.fit() > --------------------------------- > > Key: SPARK-29809 > URL: https://issues.apache.org/jira/browse/SPARK-29809 > Project: Spark > Issue Type: Sub-task > Components: MLlib > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > The RDD dataset is used by more than two actions in learnVocab(dataset) and > doFit. It needs to be persisted. > {code:scala} > def fit[S <: Iterable[String]](dataset: RDD[S]): Word2VecModel = { > // Needs to persist dataset here > learnVocab(dataset) // has action on dataset > createBinaryTree() > val sc = dataset.context > val expTable = sc.broadcast(createExpTable()) > val bcVocab = sc.broadcast(vocab) > val bcVocabHash = sc.broadcast(vocabHash) > try { > doFit(dataset, sc, expTable, bcVocab, bcVocabHash) // has action on > dataset > {code} > This issue is reported by our tool _CacheCheck_, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org