Qi Dai created SPARK-13289:
------------------------------

             Summary: Word2Vec generate infinite distances when numIterations>5
                 Key: SPARK-13289
                 URL: https://issues.apache.org/jira/browse/SPARK-13289
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.6.0
         Environment: Linux, Scala
            Reporter: Qi Dai


I recently ran some word2vec experiments on a cluster with 50 executors on some 
large text dataset but find out that when number of iterations is larger than 5 
the distance between words will be all infinite. My code looks like this:

val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
").toSeq)
import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
val word2vec = new 
Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
val model = word2vec.fit(text)
val synonyms = model.findSynonyms("who", 40)
for((synonym, cosineSimilarity) <- synonyms) {
  println(s"$synonym $cosineSimilarity")
}

The results are: 
to Infinity
and Infinity
that Infinity
with Infinity
said Infinity
it Infinity
by Infinity
be Infinity
have Infinity
he Infinity
has Infinity
his Infinity
an Infinity
) Infinity
not Infinity
who Infinity
I Infinity
had Infinity
their Infinity
were Infinity
they Infinity
but Infinity
been Infinity

I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to