Qi Dai created SPARK-13289: ------------------------------ Summary: Word2Vec generate infinite distances when numIterations>5 Key: SPARK-13289 URL: https://issues.apache.org/jira/browse/SPARK-13289 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.0 Environment: Linux, Scala Reporter: Qi Dai
I recently ran some word2vec experiments on a cluster with 50 executors on some large text dataset but find out that when number of iterations is larger than 5 the distance between words will be all infinite. My code looks like this: val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" ").toSeq) import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val word2vec = new Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5) val model = word2vec.fit(text) val synonyms = model.findSynonyms("who", 40) for((synonym, cosineSimilarity) <- synonyms) { println(s"$synonym $cosineSimilarity") } The results are: to Infinity and Infinity that Infinity with Infinity said Infinity it Infinity by Infinity be Infinity have Infinity he Infinity has Infinity his Infinity an Infinity ) Infinity not Infinity who Infinity I Infinity had Infinity their Infinity were Infinity they Infinity but Infinity been Infinity I tried many different datasets and different words for finding synonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org